I am a Ph.D. candidate at the Data Science & AI Lab (DSAIL) at Seoul National University (SNU).
My research primarily focuses on speech synthesis (text-to-speech, voice conversion) and speech large language and dialog models (speech LLMs), and I have also researched generative models on other domains (vision, NLP, ...).
I received my B.S. in Electrical and Computer Engineering from Seoul National University.
Selected Publications
I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio.
Specifically, I am interested in speech large language models (speech LLMs) and spoken dialog models.
Additionally, both now and in the past, I have conducted research with a focus on diffusion models.
My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency.
Below are my representative works.
USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.
Projects
During my time at DSAIL, I collaborated with
NAVER Cloud
on developing spoken language and spoken dialog models based on pre-trained large language models (LLMs).
In this project, we extended a pre-trained LLM into a spoken language model by leveraging a large-scale
speech-text paired dataset and further expanded it into a spoken dialog model through supervised fine-tuning.
I successfully built a USDM, an English end-to-end spoken dialog model that incorporates paralinguistic features and
contributed to the construction of Naver’s end-to-end Korean spoken dialog model, SpeechX.
Project Overview
2022.03 ~ 2023.07: Built a speaker-adaptive TTS model for more natural voice generation. Incorporated speech input into a GPT-2–scale model by attaching an encoder for ASR (Automatic Speech Recognition) and SER (Speech Emotion Recognition), aiming to enrich speech understanding capabilities.
2023.08 ~ 2024.05: Developed an English spoken dialog system with an end-to-end pipeline and paralinguistic awareness. This research led to a publication at NeurIPS 2024, while simultaneously laying the groundwork for Naver’s proprietary Speech LLM.
2024.06 ~ 2025.01: Focused on developing a text-to-speech model for synthetic spoken dialog generation.
2025.02 ~ 2025.05 (Ongoing): Beyond supervised fine-tuning, exploring RLHF (e.g., DPO, GRPO) methods to capture more natural and affective nuances in spoken dialog. Currently investigating strategies for reward/preference data collection, either from human annotators or synthetic approaches, to enhance voice interaction quality.
AI that can talk and understand human emotions - Spoken dialog model that can engage in spoken interactions while understanding and responding to human emotions, highlighting advancements in emotional recognition and natural conversation.
Education
Ph.D. Candidate in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025 (Expected)
USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
We propose ContextDialog, a benchmark to evaluate a model’s ability to utilize past information, and observe and analyze that open-source multi-turn voice interaction models often fail to recall past information and, even in RAG scenarios, remain highly error-prone, resulting in inadequate responses.
VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.
PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.
HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.
EdiText is a general-purpose text editing method that leverages a diffusion language model to perform fine-to-coarse edits on a given text within a desired range.
Edit-A-Video is a diffusion-based one-shot video editing model that solves background inconsistency problems via a new sparse-causal mask blending method.
Diptych Prompting is a novel zero-shot subject-driven text-to-image generation approach that treats generation as an inpainting task, leveraging a 'diptych' property for stable subject alignment.
Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.
SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs, enabling better conditional generation on imbalanced real-world data.
The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.