Search papers, labs, and topics across Lattice.
MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.
Achieve world-consistent video generation by directly optimizing geometry in the latent space of pre-trained video diffusion models, sidestepping costly RGB-space operations and architectural changes.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.
Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.
Forget local semantic alignment: CAST unlocks temporally coherent video retrieval and generation by explicitly modeling visual state transitions.
AI-generated videos can now respect physics, thanks to a framework that uses a physical simulator to guide diffusion models, resulting in more realistic and coherent motion.
Robots can now remember what they've done and what they need to do next for 15 minutes straight, thanks to a new memory architecture that mixes video and text.
Multimodal web agents are surprisingly vulnerable to cross-modal attacks, but a novel adversarial training approach can double task completion efficiency while mitigating these risks.
DINOv2's impressive unimodal performance doesn't translate to cross-modal understanding, but a simple training tweak can align embeddings across RGB, depth, and segmentation without sacrificing feature quality.
State-of-the-art emotion recognition in conversations is now possible by decoupling modality-specific context modeling and multimodal fusion with a mixture-of-experts approach that doesn't require speaker identity.
Forget painstakingly labeling audio datasets – AuditoryHuM uses LLMs and targeted human input to automatically generate and cluster high-quality auditory scene labels.
Existing deforestation monitoring maps misclassify smallholder agroforestry as "forest," risking unfair penalties under regulations like the EUDR.
Despite advances in multimodal models, they still struggle to understand spatial relationships from an egocentric perspective, as shown by a 37.66% performance gap on the new SAW-Bench benchmark.
LLMs can now generate physics explanation videos up to 6 minutes long, but their visual reasoning and the reliability of auto-generated Manim code still need significant improvement.