I am currently a Senior Research Engineer at Qualcomm AI Research Korea, where I work on developing more human-like, real-time voice agents.
I received my Ph.D. as a candidate at the Data Science & AI Lab (DSAIL) at Seoul National University (SNU).
My research there primarily focused on speech synthesis (text-to-speech, voice conversion) and speech large language and dialog models (speech LLMs), and I also explored generative models in other domains (vision, NLP, ...).
I received my B.S. in Electrical and Computer Engineering from Seoul National University.
Selected Publications
I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio.
Specifically, I am interested in speech large language models (speech LLMs) and spoken dialog models.
Additionally, both now and in the past, I have conducted research with a focus on diffusion models.
My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency.
Below are my representative works.
USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
We propose ContextDialog, a benchmark to evaluate a model’s ability to utilize past information, and observe and analyze that open-source multi-turn voice interaction models often fail to recall past information and, even in RAG scenarios, remain highly error-prone, resulting in inadequate responses.
UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
Projects
During my time at DSAIL, I collaborated with
NAVER Cloud
on developing spoken language and spoken dialog models based on pre-trained large language models (LLMs).
In this project, we extended a pre-trained LLM into a spoken language model by leveraging a large-scale
speech-text paired dataset and further expanded it into a spoken dialog model through supervised fine-tuning.
I successfully built a USDM, an English end-to-end spoken dialog model that incorporates paralinguistic features and
contributed to the construction of Naver’s end-to-end Korean spoken dialog model, SpeechX.
Project Overview
2022.03 ~ 2023.07: Built a speaker-adaptive TTS model for more natural voice generation. Incorporated speech input into a GPT-2–scale model by attaching an encoder for ASR (Automatic Speech Recognition) and SER (Speech Emotion Recognition), aiming to enrich speech understanding capabilities.
2023.08 ~ 2024.05: Developed an English spoken dialog system with an end-to-end pipeline and paralinguistic awareness. This research led to a publication at NeurIPS 2024, while simultaneously laying the groundwork for Naver’s proprietary Speech LLM.
2024.06 ~ 2025.01: Focused on developing a text-to-speech model for synthetic spoken dialog generation.
AI that can talk and understand human emotions - Spoken dialog model that can engage in spoken interactions while understanding and responding to human emotions, highlighting advancements in emotional recognition and natural conversation.
Education
Ph.D. in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025
USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
We propose ContextDialog, a benchmark to evaluate a model’s ability to utilize past information, and observe and analyze that open-source multi-turn voice interaction models often fail to recall past information and, even in RAG scenarios, remain highly error-prone, resulting in inadequate responses.
UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.
PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.
HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.
EdiText is a general-purpose text editing method that leverages a diffusion language model to perform fine-to-coarse edits on a given text within a desired range.
Edit-A-Video is a diffusion-based one-shot video editing model that solves background inconsistency problems via a new sparse-causal mask blending method.
Diptych Prompting is a novel zero-shot subject-driven text-to-image generation approach that treats generation as an inpainting task, leveraging a 'diptych' property for stable subject alignment.
Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.
SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs, enabling better conditional generation on imbalanced real-world data.
The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.