Search papers, labs, and topics across Lattice.
The paper introduces Talking Avatar generation from Video Reference (TAVR), a framework that generates talking avatars using cross-scene video inputs, overcoming the limitations of single-view, same-scene reference images. TAVR employs a token selection module and a three-stage training scheme involving same-scene pretraining, cross-scene fine-tuning, and reinforcement learning with identity-based rewards to handle extended temporal contexts and bridge domain gaps. Experiments on a new cross-scene video pair benchmark demonstrate that TAVR outperforms existing methods in generating high-fidelity talking avatars with flexible video referencing.
Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.
Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.