May 26, 2026arXiv:2605.26486

LongCat-Video-Avatar 1.5 Technical Report

Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang

AI Summary

LongCat-Video-Avatar 1.5 improves audio-driven video generation by focusing on systematic engineering and production readiness, rather than novel architectures. The upgrade incorporates Whisper Large for audio encoding and scaled training recipes to achieve better lip-sync, temporal stability, and identity consistency in long videos. Through data curation, RLHF training, and advanced step distillation for faster inference (8 NFE), the model generalizes to stylized domains and complex real-world scenarios, outperforming closed-source systems in human evaluations.

Key Contribution

Open-source LongCat-Video-Avatar 1.5 leapfrogs closed-source competitors in audio-driven video generation by prioritizing practical engineering over architectural novelty, delivering commercial-grade quality and speed.

Abstract

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

Computer Vision Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LongCat-Video-Avatar 1.5 Technical Report

Related Papers