Heeseung Kim

Email  /  CV  /  LinkedIn  /  Google Scholar  /  GitHub

I am a Ph.D. candidate at the Data Science & AI Lab (DSAIL) at Seoul National University (SNU). My research primarily focuses on speech synthesis (text-to-speech, voice conversion) and spoken language and dialog models, and I have also researched generative models on other domains (vision, NLP, ...). I received my B.S. in Electrical and Computer Engineering from Seoul National University.



Research

I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio. Specifically, I am interested in large spoken language models (LSLM) and spoken dialog models. Additionally, both now and in the past, I have conducted research with a focus on diffusion models. My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency. Representative papers are highlighted.

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
Neural Information Processing Systems (NeurIPS), 2024
arXiv / code / demo

USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon
INTERSPEECH, Oral Presentation, 2023
arXiv / code / demo

UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
Heeseung Kim*, Sungwon Kim*, Sungroh Yoon
International Conference on Machine Learning (ICML), 2022
arXiv / demo

Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon
arXiv preprint, 2024
project page / arXiv

Diptych Prompting is a novel zero-shot subject-driven text-to-image generation that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models.

Style-Friendly SNR Sampler for Style-Driven Generation
Jooyoung Choi*, Chaehun Shin*, Yeongtak Oh, Heeseung Kim, Sungroh Yoon
arXiv preprint, 2024
project page / arXiv

Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers
Nohil Park, Heeseung Kim, Che Hyun Lee, Jooyoung Choi, Jiheum Yeom, Sungroh Yoon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
project page / arXiv

NanoVoice is a method that efficiently learns personalized adapters for each speaker simultaneously when given multiple speakers' voices.

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
Jiheum Yeom, Heeseung Kim, Jooyoung Choi, Che Hyun Lee, Nohil Park, Sungroh Yoon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
project page / arXiv

VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon
INTERSPEECH, 2024
project page / arXiv

VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.

HyperCLOVA X Technical Report
HyperCLOVA X Team, NAVER Cloud
arXiv preprint, 2024
arXiv

HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.

Edit-A-Video: Single Video Editing with Object-Aware Consistency
Chaehun Shin*, Heeseung Kim*, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
Asian Conference on Machine Learning (ACML), Oral, Presentation, Best Paper Award, 2023
project page / arXiv

Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.

Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement
Hyunjun Yoo*, Eunji Kim*, Jong Won Chung*, Hyeon Cho, Sujin Jeong, Heeseung Kim, Dongju Jang, Hayun Kim, Jinsu Yoon, Gae Hwang Lee, Hyunbum Kang, Joo-Young Kim, Youngjun Yun, Sungroh Yoon, Yongtaek Hong
ACS Appl. Mater. Interfaces, 2022

The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
Sungwon Kim*, Heeseung Kim*, Sungroh Yoon
arXiv, 2022
arXiv / demo

Guided-TTS 2 is a model that enables personalized text-to-speech using only 10 seconds of untranscribed speech.

Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings
Sangwon Yu, Jongyoon Song, Heeseung Kim, Seong-min Lee, Woo-Jong Ryu, Sungroh Yoon
ACL, 2022
arXiv / code

AGG addresses the degeneration problem in neural language models by gating the specific part of the gradient for rare token embeddings.

PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu
International Conference on Learning Representations (ICLR), 2022
project page / arXiv / code / poster

PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for training and sampling from diffusion models applied to speech synthesis.

Stein Latent Optimization for Generative Adversarial Networks
Uiwon Hwang, Heeseung Kim, Dahuin Jung, Hyemi Jang, Hyungyu Lee, Sungroh Yoon
International Conference on Learning Representations, 2022
arXiv / code

SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs with clustered latent spaces, enabling better conditional generation of imbalanced real-world data.



Education
Ph.D. Candidate in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025 (Expected)
  • Integrated M.S./Ph.D. Program.   Advisor: Sungroh Yoon.

  • B.S. in Seoul National University
    Electrical and Computer Engineering
    Mar 2015 - Feb 2019
  • Cum Laude


  • Invited Talks, Services, Honors, and Awards