Search papers, labs, and topics across Lattice.
2
0
4
Current audio-visual generation models struggle to maintain coherence and alignment when scaling to minute-long content, a problem exposed by the new LongAV-Compass benchmark.
Ditching text-based chain-of-thought unlocks better audio-visual reasoning by interleaving textual steps with a unified latent space that preserves dense sensory information.