Research
I have a broad interest in generative models, and I am currently particularly focused on multimodal large language models in speech and audio.
Specifically, I am interested in large spoken language models (LSLM) and spoken dialog models.
Additionally, both now and in the past, I have conducted research with a focus on diffusion models.
My previous research primarily centered around speech synthesis, where I worked on tasks such as text-to-speech and voice conversion, with a focus on keywords like personalization and data efficiency.
Representative papers are highlighted.
|
|
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Heeseung Kim,
Soonshin Seo,
Kyeongseok Jeong,
Ohsung Kwon,
Soyoon Kim,
Jungwhan Kim,
Jaehong Lee,
Eunwoo Song,
Myungwoo Oh,
Jung-Woo Ha,
Sungroh Yoon,
Kang Min Yoo
Neural Information Processing Systems (NeurIPS), 2024
arXiv /
code /
demo
USDM is a paralinguistic-aware spoken dialog model built through supervised fine-tuning (SFT) on spoken dialog data, on top of a cross-modal pretrained model trained using a speech-text interleaving technique.
|
|
UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
Heeseung Kim,
Sungwon Kim,
Jiheum Yeom,
Sungroh Yoon
INTERSPEECH, Oral Presentation, 2023
arXiv /
code /
demo
UnitSpeech is a speaker adaptation model that enables personalized text-to-speech and any-to-any voice conversion with only 5 to 10 seconds of untranscribed speech.
|
|
Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
Heeseung Kim*,
Sungwon Kim*,
Sungroh Yoon
International Conference on Machine Learning (ICML), 2022
arXiv /
demo
Guided-TTS is a method for building a TTS model using long-form untranscribed speech data of the target speaker.
|
|
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Chaehun Shin,
Jooyoung Choi,
Heeseung Kim,
Sungroh Yoon
arXiv preprint, 2024
project page /
arXiv
Diptych Prompting is a novel zero-shot subject-driven text-to-image generation that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models.
|
|
Style-Friendly SNR Sampler for Style-Driven Generation
Jooyoung Choi*,
Chaehun Shin*,
Yeongtak Oh,
Heeseung Kim,
Sungroh Yoon
arXiv preprint, 2024
project page /
arXiv
Style-friendly sampler shifts the diffusion fine-tuning toward higher noise levels, enabling FLUX and SD3.5 to effectively learn new, unique artistic styles and expand the scope of style-driven generation.
|
|
NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers
Nohil Park,
Heeseung Kim,
Che Hyun Lee,
Jooyoung Choi,
Jiheum Yeom,
Sungroh Yoon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
project page /
arXiv
NanoVoice is a method that efficiently learns personalized adapters for each speaker simultaneously when given multiple speakers' voices.
|
|
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
Jiheum Yeom,
Heeseung Kim,
Jooyoung Choi,
Che Hyun Lee,
Nohil Park,
Sungroh Yoon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
project page /
arXiv
VoiceGuider is a personalized text-to-speech model that proposes a guidance method during inference, enabling robust speaker adaptation even for out-of-domain speakers.
|
|
VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech
Heeseung Kim,
Sang-gil Lee,
Jiheum Yeom,
Che Hyun Lee,
Sungwon Kim,
Sungroh Yoon
INTERSPEECH, 2024
project page /
arXiv
VoiceTailor is a one-shot speaker-adaptive text-to-speech model, which proposes combining low-rank adapters to perform speaker adaptation in a parameter-efficient manner.
|
|
HyperCLOVA X Technical Report
HyperCLOVA X Team, NAVER Cloud
arXiv preprint, 2024
arXiv
HyperCLOVA X is a series of large language models (LLMs) specifically designed to accommodate the Korean language and culture, while also excelling in English, mathematics, and coding tasks.
|
|
Edit-A-Video: Single Video Editing with Object-Aware Consistency
Chaehun Shin*,
Heeseung Kim*,
Che Hyun Lee,
Sang-gil Lee,
Sungroh Yoon
Asian Conference on Machine Learning (ACML), Oral, Presentation, Best Paper Award, 2023
project page /
arXiv
Edit-A-Video is a diffusion-based one-shot video editing model that solves a background inconsistency problem using a new sparse-causal mask blending method.
|
|
Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement
Hyunjun Yoo*,
Eunji Kim*,
Jong Won Chung*,
Hyeon Cho,
Sujin Jeong,
Heeseung Kim,
Dongju Jang,
Hayun Kim,
Jinsu Yoon,
Gae Hwang Lee,
Hyunbum Kang,
Joo-Young Kim,
Youngjun Yun,
Sungroh Yoon,
Yongtaek Hong
ACS Appl. Mater. Interfaces, 2022
The proposed high-performance strain sensors, optimally positioned using deep learning analysis, enable accurate detection of directional facial muscle movement for silent speech recognition.
|
|
Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
Sungwon Kim*,
Heeseung Kim*,
Sungroh Yoon
arXiv, 2022
arXiv /
demo
Guided-TTS 2 is a model that enables personalized text-to-speech using only 10 seconds of untranscribed speech.
|
|
Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings
Sangwon Yu,
Jongyoon Song,
Heeseung Kim,
Seong-min Lee,
Woo-Jong Ryu,
Sungroh Yoon
ACL, 2022
arXiv /
code
AGG addresses the degeneration problem in neural language models by gating the specific part of the gradient for rare token embeddings.
|
|
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive
Prior
Sang-gil Lee,
Heeseung Kim,
Chaehun Shin,
Xu Tan,
Chang Liu,
Qi Meng,
Tao Qin,
Wei Chen,
Sungroh Yoon,
Tie-Yan Liu
International Conference on Learning Representations (ICLR), 2022
project page /
arXiv /
code /
poster
PriorGrad presents an efficient method for constructing a data-dependent non-standard Gaussian prior for
training and sampling from diffusion models applied to speech synthesis.
|
|
Stein Latent Optimization for Generative Adversarial Networks
Uiwon Hwang,
Heeseung Kim,
Dahuin Jung,
Hyemi Jang,
Hyungyu Lee,
Sungroh Yoon
International Conference on Learning Representations, 2022
arXiv /
code
SLOGAN introduces Stein latent optimization and a novel unsupervised conditional contrastive loss for GANs with clustered latent spaces, enabling better conditional generation of imbalanced real-world data.
|
|
Ph.D. Candidate in Seoul National University
Electrical and Computer Engineering
Mar 2019 - Aug 2025 (Expected)
Integrated M.S./Ph.D. Program. Advisor: Sungroh Yoon.
B.S. in Seoul National University
Electrical and Computer Engineering
Mar 2015 - Feb 2019
Cum Laude
|
Invited Talks, Services, Honors, and Awards
|
- Reviewer, The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
- Invited Talk "Latest Trends in Spoken Dialog Models and Voice Agents", Qualcomm, 2024
- Reviewer, The Thirteenth International Conference on Learning Representations (ICLR), 2024
- Invited Talk "Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation", HMG TECH SUMMIT, 2024
- Invited Talk "Speech and Spoken Dialog Modeling", Neosapience, 2024
- Top Reviewer, The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
- Best Paper Award, Asian Conference on Machine Learning (ACML), 2023
- Invited Talk "A case study of research and development at Seoul National University using Amazon Mechanical Turk", AWS Summit Seoul, 2024
- Invited Talk "Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance", Kakao Enterprise, 2022
- Best Poster Award, AIIS Fall Retreat, 2022
- Outstanding Paper Award, Hyundai AI Consortium, 2022
- Cum Laude, Seoul National University, 2019
- Academic Performance Scholarship, Seoul National University, 2016-1, 2018-1,2
|