Heeseung Kim

Email  /  CV  /  LinkedIn  /  Google Scholar  /  GitHub

I am a Ph.D. candidate at the Data Science & AI Lab (DSAIL) at Seoul National University (SNU). My research primarily focuses on speech synthesis (text-to-speech, voice conversion) and speech large language and dialog models (speech LLMs), and I have also researched generative models on other domains (vision, NLP, ...). I received my B.S. in Electrical and Computer Engineering from Seoul National University.



Selected Publications

I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio. Specifically, I am interested in speech large language models (speech LLMs) and spoken dialog models. Additionally, both now and in the past, I have conducted research with a focus on diffusion models. My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency. Below are my representative works.

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
Neural Information Processing Systems (NeurIPS), 2024
arXiv / code / demo

USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon
INTERSPEECH, Oral Presentation, 2023
arXiv / code / demo

UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
Heeseung Kim*, Sungwon Kim*, Sungroh Yoon
International Conference on Machine Learning (ICML), 2022
arXiv / demo

Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon
INTERSPEECH, 2024
project page / arXiv

VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.

Edit-A-Video: Single Video Editing with Object-Aware Consistency
Chaehun Shin*, Heeseung Kim*, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
Asian Conference on Machine Learning (ACML), Oral, Best Paper Award, 2023
project page / arXiv

Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.


Projects

During my time at DSAIL, I collaborated with NAVER Cloud on developing spoken language and spoken dialog models based on pre-trained large language models (LLMs). In this project, we extended a pre-trained LLM into a spoken language model by leveraging a large-scale speech-text paired dataset and further expanded it into a spoken dialog model through supervised fine-tuning. I successfully built a USDM, an English end-to-end spoken dialog model that incorporates paralinguistic features and contributed to the construction of Naver’s end-to-end Korean spoken dialog model, SpeechX.

Project Overview

  • 2022.03 ~ 2023.07: Built a speaker-adaptive TTS model for more natural voice generation. Incorporated speech input into a GPT-2–scale model by attaching an encoder for ASR (Automatic Speech Recognition) and SER (Speech Emotion Recognition), aiming to enrich speech understanding capabilities.
  • 2023.08 ~ 2024.05: Developed an English spoken dialog system with an end-to-end pipeline and paralinguistic awareness. This research led to a publication at NeurIPS 2024, while simultaneously laying the groundwork for Naver’s proprietary Speech LLM.
  • 2024.06 ~ 2025.01: Focused on developing a text-to-speech model for synthetic spoken dialog generation.
  • 2025.02 ~ 2025.05 (Ongoing): Beyond supervised fine-tuning, exploring RLHF (e.g., DPO, GRPO) methods to capture more natural and affective nuances in spoken dialog. Currently investigating strategies for reward/preference data collection, either from human annotators or synthetic approaches, to enhance voice interaction quality.

Related Articles & Blog Posts


Education
Ph.D. Candidate in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025 (Expected)
  • Integrated M.S./Ph.D. Program.   Advisor: Sungroh Yoon.

  • B.S. in Seoul National University
    Electrical and Computer Engineering
    Mar 2015 - Feb 2019
  • Cum Laude

  • Invited Talks, Services, Honors, and Awards

    Research

    Below is a list of my publications.

    Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
    Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
    Neural Information Processing Systems (NeurIPS), 2024
    arXiv / code / demo

    USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.

    UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
    Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon
    INTERSPEECH, Oral Presentation, 2023
    arXiv / code / demo

    UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.

    Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
    Heeseung Kim*, Sungwon Kim*, Sungroh Yoon
    International Conference on Machine Learning (ICML), 2022
    arXiv / demo

    Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.

    VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
    Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon
    INTERSPEECH, 2024
    project page / arXiv

    VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.

    Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
    Heeseung Kim*, Che Hyun Lee*, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon
    arXiv preprint, 2025
    arXiv / demo / dataset

    We propose ContextDialog, a benchmark to evaluate a model’s ability to utilize past information, and observe and analyze that open-source multi-turn voice interaction models often fail to recall past information and, even in RAG scenarios, remain highly error-prone, resulting in inadequate responses.

    Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
    Sungwon Kim*, Heeseung Kim*, Sungroh Yoon
    arXiv, 2022
    arXiv / demo

    Guided-TTS 2 is a model that enables personalized text-to-speech using only 10 seconds of untranscribed speech.

    NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers
    Nohil Park, Heeseung Kim, Che Hyun Lee, Jooyoung Choi, Jiheum Yeom, Sungroh Yoon
    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Oral Presentation, 2025
    project page / arXiv

    NanoVoice is a method that efficiently learns personalized adapters for each speaker simultaneously when given multiple speakers' voices.

    VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
    Jiheum Yeom, Heeseung Kim, Jooyoung Choi, Che Hyun Lee, Nohil Park, Sungroh Yoon
    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
    project page / arXiv

    VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.

    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
    Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu
    International Conference on Learning Representations (ICLR), 2022
    project page / arXiv / code / poster

    PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.

    HyperCLOVA X Technical Report
    HyperCLOVA X Team, NAVER Cloud
    arXiv preprint, 2024
    arXiv

    HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.

    EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models
    Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon
    arXiv preprint, 2025
    arXiv

    EdiText is a general-purpose text editing method that leverages a diffusion language model to perform fine-to-coarse edits on a given text within a desired range.

    Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings
    Sangwon Yu, Jongyoon Song, Heeseung Kim, Seong-min Lee, Woo-Jong Ryu, Sungroh Yoon
    ACL, 2022
    arXiv / code

    AGG addresses the degeneration problem in neural language models by gating the specific part of the gradient for rare token embeddings.

    Edit-A-Video: Single Video Editing with Object-Aware Consistency
    Chaehun Shin*, Heeseung Kim*, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
    Asian Conference on Machine Learning (ACML), Oral, Best Paper Award, 2023
    project page / arXiv

    Edit-A-Video is a diffusion-based one-shot video editing model that solves background inconsistency problems via a new sparse-causal mask blending method.

    Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
    Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    project page / arXiv

    Diptych Prompting is a novel zero-shot subject-driven text-to-image generation approach that treats generation as an inpainting task, leveraging a 'diptych' property for stable subject alignment.

    Style-Friendly SNR Sampler for Style-Driven Generation
    Jooyoung Choi*, Chaehun Shin*, Yeongtak Oh, Heeseung Kim, Sungroh Yoon
    arXiv preprint, 2024
    project page / arXiv

    Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.

    Stein Latent Optimization for Generative Adversarial Networks
    Uiwon Hwang, Heeseung Kim, Dahuin Jung, Hyemi Jang, Hyungyu Lee, Sungroh Yoon
    International Conference on Learning Representations, 2022
    arXiv / code

    SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs, enabling better conditional generation on imbalanced real-world data.

    Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement
    Hyunjun Yoo*, Eunji Kim*, Jong Won Chung*, Hyeon Cho, Sujin Jeong, Heeseung Kim, Dongju Jang, Hayun Kim, Jinsu Yoon, Gae Hwang Lee, Hyunbum Kang, Joo-Young Kim, Youngjun Yun, Sungroh Yoon, Yongtaek Hong
    ACS Appl. Mater. Interfaces, 2022

    The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.