LightricksJan 6, 2026arXiv:2601.03233

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, N. Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V. Kulikov, Yaron Inger, Y. Shiftan, Zeev Melumian, Zeev Farbman

AI Summary

LTX-2, a new open-source foundational model, generates high-quality, temporally synchronized audiovisual content by employing an asymmetric dual-stream transformer architecture with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers. The model incorporates temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning, along with a multilingual text encoder and modality-aware classifier-free guidance (modality-CFG) to improve audiovisual alignment and controllability. Evaluations demonstrate that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, rivaling proprietary models with significantly reduced computational cost and inference time.

Key Contribution

Finally, an open-source text-to-audiovisual model, LTX-2, rivals proprietary systems in quality and prompt adherence while drastically reducing computational cost.

Abstract

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations6

Influential citations0

References31

Year2026

VenuearXiv.org

Related Papers

Finding related papers...

Search

LTX-2: Efficient Joint Audio-Visual Foundation Model

Related Papers