Heeseung Kim

Email / CV / LinkedIn / Google Scholar / GitHub

I am a Ph.D. candidate at the Data Science & AI Lab (DSAIL) at Seoul National University (SNU). My research primarily focuses on speech synthesis (text-to-speech, voice conversion) and speech large language and dialog models (speech LLMs), and I have also researched generative models on other domains (vision, NLP, ...). I received my B.S. in Electrical and Computer Engineering from Seoul National University.

Selected Publications

I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio. Specifically, I am interested in speech large language models (speech LLMs) and spoken dialog models. Additionally, both now and in the past, I have conducted research with a focus on diffusion models. My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency. Below are my representative works.

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
Neural Information Processing Systems (NeurIPS), 2024
arXiv / code / demo

USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon
INTERSPEECH, Oral Presentation, 2023
arXiv / code / demo

UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
Heeseung Kim*, Sungwon Kim*, Sungroh Yoon
International Conference on Machine Learning (ICML), 2022
arXiv / demo

Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon
INTERSPEECH, 2024
project page / arXiv

VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.

Edit-A-Video: Single Video Editing with Object-Aware Consistency
Chaehun Shin*, Heeseung Kim*, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
Asian Conference on Machine Learning (ACML), Oral, Best Paper Award, 2023
project page / arXiv

Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.

Projects

During my time at DSAIL, I collaborated with NAVER Cloud on developing spoken language and spoken dialog models based on pre-trained large language models (LLMs). In this project, we extended a pre-trained LLM into a spoken language model by leveraging a large-scale speech-text paired dataset and further expanded it into a spoken dialog model through supervised fine-tuning. I successfully built a USDM, an English end-to-end spoken dialog model that incorporates paralinguistic features and contributed to the construction of Naver’s end-to-end Korean spoken dialog model, SpeechX.

Project Overview

2022.03 ~ 2023.07: Built a speaker-adaptive TTS model for more natural voice generation. Incorporated speech input into a GPT-2–scale model by attaching an encoder for ASR (Automatic Speech Recognition) and SER (Speech Emotion Recognition), aiming to enrich speech understanding capabilities.
2023.08 ~ 2024.05: Developed an English spoken dialog system with an end-to-end pipeline and paralinguistic awareness. This research led to a publication at NeurIPS 2024, while simultaneously laying the groundwork for Naver’s proprietary Speech LLM.
2024.06 ~ 2025.01: Focused on developing a text-to-speech model for synthetic spoken dialog generation.
2025.02 ~ 2025.05 (Ongoing): Beyond supervised fine-tuning, exploring RLHF (e.g., DPO, GRPO) methods to capture more natural and affective nuances in spoken dialog. Currently investigating strategies for reward/preference data collection, either from human annotators or synthetic approaches, to enhance voice interaction quality.

Related Articles & Blog Posts

Naver-Seoul National University Release Speech Model, Achieves Natural Speech Like GPT-4o - Naver and Seoul National University researchers develop a large-scale spoken language model, achieving natural spoken dialog generation and publishing their work at NeurIPS 2024.
AI that can talk and understand human emotions - Spoken dialog model that can engage in spoken interactions while understanding and responding to human emotions, highlighting advancements in emotional recognition and natural conversation.

Education

Ph.D. Candidate in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025 (Expected)

Integrated M.S./Ph.D. Program. Advisor: Sungroh Yoon.

B.S. in Seoul National University
Electrical and Computer Engineering
Mar 2015 - Feb 2019

Cum Laude

Invited Talks, Services, Honors, and Awards

Reviewer, Forty-Second International Conference on Machine Learning (ICML), 2025
Reviewer, IEEE Transactions On Multimedia, 2025
Invited Talk "Speech Synthesis to Voice Assistant", Supertone, 2025
Reviewer, The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Invited Talk "Latest Trends in Spoken Dialog Models and Voice Agents", Qualcomm, 2024
Reviewer, The Thirteenth International Conference on Learning Representations (ICLR), 2024
Invited Talk "Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation", HMG TECH SUMMIT, 2024
Invited Talk "Speech and Spoken Dialog Modeling", Neosapience, 2024
Top Reviewer, The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
Best Paper Award, Asian Conference on Machine Learning (ACML), 2023
Invited Talk "A case study of research and development at Seoul National University using Amazon Mechanical Turk", AWS Summit Seoul, 2024
Invited Talk "Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance", Kakao Enterprise, 2022
Best Poster Award, AIIS Fall Retreat, 2022
Outstanding Paper Award, Hyundai AI Consortium, 2022
Cum Laude, Seoul National University, 2019
Academic Performance Scholarship, Seoul National University, 2016-1, 2018-1,2

Research

Below is a list of my publications.

	Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo Neural Information Processing Systems (NeurIPS), 2024 arXiv / code / demo USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
	UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon INTERSPEECH, Oral Presentation, 2023 arXiv / code / demo UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
	Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance Heeseung Kim, Sungwon Kim, Sungroh Yoon International Conference on Machine Learning (ICML), 2022 arXiv / demo Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.
	VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon INTERSPEECH, 2024 project page / arXiv VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
	Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon ACL Findings, 2025 arXiv / demo / dataset We propose ContextDialog, a benchmark to evaluate a model’s ability to utilize past information, and observe and analyze that open-source multi-turn voice interaction models often fail to recall past information and, even in RAG scenarios, remain highly error-prone, resulting in inadequate responses.
	Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data Sungwon Kim, Heeseung Kim, Sungroh Yoon arXiv, 2022 arXiv / demo Guided-TTS 2 is a model that enables personalized text-to-speech using only 10 seconds of untranscribed speech.
	NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers Nohil Park, Heeseung Kim, Che Hyun Lee, Jooyoung Choi, Jiheum Yeom, Sungroh Yoon IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Oral Presentation, 2025 project page / arXiv NanoVoice is a method that efficiently learns personalized adapters for each speaker simultaneously when given multiple speakers' voices.
	VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance Jiheum Yeom, Heeseung Kim, Jooyoung Choi, Che Hyun Lee, Nohil Park, Sungroh Yoon IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025 project page / arXiv VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.
	PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu International Conference on Learning Representations (ICLR), 2022 project page / arXiv / code / poster PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.

	HyperCLOVA X Technical Report HyperCLOVA X Team, NAVER Cloud arXiv preprint, 2024 arXiv HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.
	EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon ACL, 2025 arXiv EdiText is a general-purpose text editing method that leverages a diffusion language model to perform fine-to-coarse edits on a given text within a desired range.
	Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings Sangwon Yu, Jongyoon Song, Heeseung Kim, Seong-min Lee, Woo-Jong Ryu, Sungroh Yoon ACL, 2022 arXiv / code AGG addresses the degeneration problem in neural language models by gating the specific part of the gradient for rare token embeddings.

	Edit-A-Video: Single Video Editing with Object-Aware Consistency Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon Asian Conference on Machine Learning (ACML), Oral, Best Paper Award, 2023 project page / arXiv Edit-A-Video is a diffusion-based one-shot video editing model that solves background inconsistency problems via a new sparse-causal mask blending method.
	Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 project page / arXiv Diptych Prompting is a novel zero-shot subject-driven text-to-image generation approach that treats generation as an inpainting task, leveraging a 'diptych' property for stable subject alignment.
	Style-Friendly SNR Sampler for Style-Driven Generation Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon arXiv preprint, 2024 project page / arXiv Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.

Stein Latent Optimization for Generative Adversarial Networks
Uiwon Hwang, Heeseung Kim, Dahuin Jung, Hyemi Jang, Hyungyu Lee, Sungroh Yoon
International Conference on Learning Representations, 2022
arXiv / code

SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs, enabling better conditional generation on imbalanced real-world data.

Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement
Hyunjun Yoo*, Eunji Kim*, Jong Won Chung*, Hyeon Cho, Sujin Jeong, Heeseung Kim, Dongju Jang, Hayun Kim, Jinsu Yoon, Gae Hwang Lee, Hyunbum Kang, Joo-Young Kim, Youngjun Yun, Sungroh Yoon, Yongtaek Hong
ACS Appl. Mater. Interfaces, 2022

The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.