Search papers, labs, and topics across Lattice.
S-Lab
2
0
5
Ditching modular architectures unlocks surprisingly competitive vision-language performance, proving that end-to-end pixel-to-word models can rival traditional approaches at scale.
Current audio-visual generation models struggle to maintain coherence and alignment when scaling to minute-long content, a problem exposed by the new LongAV-Compass benchmark.