Berkeley AI Research (BAIR)

×Multimodal Models

13 papers from Berkeley AI Research (BAIR) on Multimodal Models

Mar 17, 2026

UC Santa Cruz2w ago·also Apple ML, CUHK

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.

Jiawei Mao, Hardy Chen, Haoqin Tu +5

Eval Frameworks & Benchmarks Multimodal Models

Mar 15, 2026

BAIR2w ago·also D Spatial Grounding Performance using, SJTU

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Teaching robots to manipulate objects just got easier: OCRA learns directly from human demonstration videos by focusing on object interactions and incorporating tactile feedback.

Kuanning Wang, Ke Fan, Yuqian Fu +6

Computer Vision Multimodal Models Robotics & Embodied AI

Mar 12, 2026

NVIDIA2w ago·also BAIR, MIT CSAIL, Clarifai, K-frame

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.

Baifeng Shi, Stephanie Fu, Long Lian +12

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Mar 4, 2026

Google ResearchMar 4, 2026·also BAIR, Stanford HAI

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Robots can now remember what they've done and what they need to do next for 15 minutes straight, thanks to a new memory architecture that mixes video and text.

Marcel Torne, Karl Pertsch, Homer Walke +14

Multimodal Models Robotics & Embodied AI World Models & Planning

Google ResearchMar 4, 2026·also BAIR, DeepMind

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Multimodal web agents are surprisingly vulnerable to cross-modal attacks, but a novel adversarial training approach can double task completion efficiency while mitigating these risks.

Haoyu Liu, Dingcheng Li, Lukas Rutishauser +1

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +10

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenneth Kimble, Kenny Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 3, 2026

BAIRMar 3, 2026·also Google Research, UC Merced

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Achieve globally consistent 3D reconstruction over sequences exceeding 19,000 frames by combining test-time training with sliding window attention, outperforming prior state-of-the-art methods by over 74% on ATE on KITTI.

Junyi Zhang, Junyi Zhang, Charles Herrmann +13

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

Feb 24, 2026

Applied IntuitionFeb 24, 2026·also BAIR, Texas A&M

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

You can now train autonomous driving VLAs on 60% less data and without any reasoning annotations, thanks to a fix for difficulty bias in Group Relative Policy Optimization.

Ishaan Rawal, I. Rawal, Shubh Gupta +3

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Feb 19, 2026

BAIRFeb 19, 2026

Human-level 3D shape perception emerges from multi-view learning

Human-level 3D perception can emerge from a surprisingly simple, scalable learning objective using multi-view images, finally closing the gap between AI and human performance on this fundamental visual task.

Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa

Computer Vision Multimodal Models

Feb 17, 2026

BAIRFeb 17, 2026·also CMU ML

Edison 3.0: A Multimodal RAG System for Large-Scale Educational Q&A with Human-in-the-Loop Oversight

An educational RAG system achieves 84% accuracy in answering student questions with minimal human editing, suggesting a practical path towards scalable AI-assisted teaching.

Meenakshi Mittal, Rishi Khare, Mihran Miroyan +2

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Meta AIFeb 17, 2026·also BAIR, CMU ML, SJTU

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Forget clunky skeletons: this new model lets you prompt your way to accurate 3D human meshes from single images, even in the wildest poses.

Xitong Yang, Xitong Yang, Devansh Kukreja +23

Computer Vision Multimodal Models Robotics & Embodied AI

Feb 14, 2026

BAIRFeb 14, 2026·also CMU ML, Google Research, Department of Computational and Data

TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment

Key contribution not extracted.

Youngsun Wi, Jessica Yin, Jessica Yin +7