Search papers, labs, and topics across Lattice.
7
0
12
28
Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.
Executable visual transformations enable MLLMs to achieve continuous self-evolution without the pitfalls of pseudo-labels, leading to superior performance in dynamic VQA tasks.
Forget external retrieval controllers: GRIP lets your language model decide when and how to retrieve information, all within its own token-level decoding process.
Noisy multi-turn dialogue data hurts instruction tuning, but selecting entire conversations based on topic grounding and information flow yields surprisingly robust models.
Turns out, what makes for good code pre-training data depends heavily on the downstream task you're targeting.
LLMs can now leverage visual structure, not just text, to pinpoint bugs in multimodal programs, thanks to a novel graph alignment approach that bridges the gap between GUI screenshots and code.
Forget static datasets: this iterative training loop uses diagnostic feedback to continuously patch the blind spots in large multimodal models, leading to consistent performance gains.