Search papers, labs, and topics across Lattice.
13 papers from Berkeley AI Research (BAIR) on Multimodal Models
LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.
Teaching robots to manipulate objects just got easier: OCRA learns directly from human demonstration videos by focusing on object interactions and incorporating tactile feedback.
MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.
Robots can now remember what they've done and what they need to do next for 15 minutes straight, thanks to a new memory architecture that mixes video and text.
Multimodal web agents are surprisingly vulnerable to cross-modal attacks, but a novel adversarial training approach can double task completion efficiency while mitigating these risks.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Achieve globally consistent 3D reconstruction over sequences exceeding 19,000 frames by combining test-time training with sliding window attention, outperforming prior state-of-the-art methods by over 74% on ATE on KITTI.
You can now train autonomous driving VLAs on 60% less data and without any reasoning annotations, thanks to a fix for difficulty bias in Group Relative Policy Optimization.
Human-level 3D perception can emerge from a surprisingly simple, scalable learning objective using multi-view images, finally closing the gap between AI and human performance on this fundamental visual task.
An educational RAG system achieves 84% accuracy in answering student questions with minimal human editing, suggesting a practical path towards scalable AI-assisted teaching.
Forget clunky skeletons: this new model lets you prompt your way to accurate 3D human meshes from single images, even in the wildest poses.
Key contribution not extracted.
An end-to-end learned robotic system can now clean your kitchen in a completely new house, thanks to a novel co-training approach on diverse data.