Seongsu Ha

A deep learning researcher interested in computer vision and machine learning. Specifically my research interest lies in improving the quality of multi-modal representations and their interactions for various downstream applications to widen the capability boundaries, such as video understanding, video corpus moment retrieval, video scene boundary segmentation and visual grounding. Currently, I am a PhD student at the University of North Carolina at Chapel Hill, advised by Prof. Gedas Bertasius. Previously, I received Master's degree in Data Science at Seoul National University Visual Information Processing Lab, advised by Prof. Joonseok Lee. I also received Bachelor's degree in Computer Science, Engineering at University of Illinois at Urbana-Champaign.

seongsu0311@gmail.com / LinkedIn

News

09/2025: New paper on video large language models. arXiv 2025
08/2025: Starting PhD in Computer Science at the University of North Carolina at Chapel Hill
01/2025: Started working at EverEx as AI research engineer
12/2024: Marengo-2.7, a new SOTA video foundation model of multivector representation, released! Tech Blog
08/2024: TWLV-1, analysis and insights from evaluation on video foundation models, released! Tech Report
07/2024: Paper on referring image segmentation accepted at ECCV 2024
07/2024: Paper on video frame sampling accepted at BMVC 2024
03/2024: Pegasus-1, a new SOTA video-to-text generative model, released! Tech Report
03/2024: Marengo-2.6, a new SOTA video foundation model for any-to-any search, released! Tech Blog
01/2024: Paper on video moment localization accepted at AISTATS 2024
09/2023: Started working at Twelve Labs as research scientist
06/2023: Started working at Twelve Labs as research intern
05/2023: Paper on talking head generation accepted at Sight and Sound, CVPR Workshop 2023
06/2022: Paper on scene boundary segmentation accepted at ACCV 2022
01/2022: Started working at KakaoBrain as research intern
03/2021: Started MS in Data Science at Seoul National University Graduate School of Data Science

Research

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Hyungjin Chung*, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, BH Kim
arXiv, 2025
paper

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
Seongsu Ha*, Chaeyun Kim*, Donghwa Kim*, Junho Lee, Sangho Lee, Joonseok Lee
ECCV, 2024
paper

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space
Junho Lee*, Jeongwoo Shin, Seung Woo Ko, Seongsu Ha, Joonseok Lee
BMVC, 2024
paper

Towards a Complete Benchmark on Video Moment Localization
Jinyeong Chae*, Donghwa Kim*, Kwanseok Kim, Doyeon Lee, Sangho Lee, Seongsu Ha, Jonghwan Mun, Woo-Young Kang, Byungseok Roh, Joonseok Lee
AISTATS, 2024
paper

Disentangled Audio-Driven NeRF: Talking Head Generation with Detailed Identity-Specific Micro expressions
Seoyoung Lee*, Seongsu Ha*, Joonseok Lee
CVPRW, 2023
paper

Boundary-aware Self-supervised Learning for Video Scene Segmentation
Jonghwan Mun*, Minchul Shin*, Gunsoo Han, Sangho Lee, Seongsu Ha, Joonseok Lee, Eun-Sol Kim
ACCV, 2022
paper

Marengo 2.7: Pioneering Multi-Vector Embeddings for Advanced Video Understanding
Twelve Labs
Technical Blog, 24.12
blog

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
Twelve Labs
Technical Report arXiv, 24.08
paper

Pegasus-1: a new SOTA Video-to-Text Generative Model
Twelve Labs
Technical Report arXiv, 24.04
paper

Marengo-2.6: a new SOTA Video Foundation Model for Any-to-Any Search
Twelve Labs
Technical Blog, 24.03
blog

Experience

	Graduate Researcher, Video Understanding Group Aug.2025 ~ Present
	AI research engineer, EverEx Jan.2025 ~ May.2025
	ML Research Scientist, Twelve Labs. Sep.2023 ~ Jan.2025 ML Research Intern, Twelve Labs. June.2023 ~ Sep.2023
	Graduate Researcher, Visual Information Processing Lab. Mar.2021 ~ Jun. 2023
	Research Intern, Kakaobrain Jan.2022 ~ Mar.2022
	Research Assistant, Perform Research Group Jun.2018 ~ Sep.2018

Source code credit to Dr. Jon Barron