Mar 31, 2026arXiv:2603.29931

Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha

AI Summary

The paper introduces Gloria, a method for generating consistent character videos by representing character visual attributes through a compact set of anchor frames extracted from massive video datasets. To prevent copy-pasting and resolve multi-reference conflicts inherent in reference-based video generation, they propose Superset Content Anchoring and RoPE as Weak Condition. Experiments demonstrate the method's ability to generate high-quality character videos exceeding 10 minutes with expressive identity and appearance consistency across views, outperforming existing approaches.

Key Contribution

Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.

Abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Gloria: Consistent Character Video Generation via Content Anchors

Related Papers