Search papers, labs, and topics across Lattice.
LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.
Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.
Text-to-video generation gets a 1.58x speed boost with CalibAtt, a training-free method that exploits consistent sparsity patterns in attention layers.
Tri-modal masked diffusion models can now be trained from scratch, achieving strong results in text generation, text-to-image, and text-to-speech, thanks to a systematic exploration of the design space and a novel SDE-based batch size reparameterization.
RL fine-tuning can make vision-language models *less* reliable reasoners, as gains in benchmark accuracy come at the cost of faithfulness to the underlying visual grounding and chain-of-thought.