Stanford HAI

×Multimodal Models

17 papers from Stanford HAI on Multimodal Models

Mar 19, 2026

Department of Mechanical Engineering1w ago·also Stanford HAI

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

VLAs aren't just memorizing training data; sparse autoencoders reveal a hidden layer of generalizable motion primitives that can be steered to control robot behavior across tasks.

Aiden Swann, Aiden Swann, Lachlain McGranahan +7

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Mar 12, 2026

Stanford HAI2w ago·also AI2

Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs

RADAR offers a scalable, interpretable framework for understanding robot policy generalization by directly linking test-time performance to the training data, revealing the specific types of generalization required.

Jensen Gao, Dorsa Sadigh, Sandy Huang +2

Multimodal Models Recommendation & Information Retrieval Robotics & Embodied AI

Mar 10, 2026

Stanford HAI3w ago

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Mar 5, 2026

Stanford HAI3w ago·also Department of Mechanical Engineering

Observing and Controlling Features in Vision-Language-Action Models

Forget retraining: you can steer a robot's behavior in real-time by nudging its internal representations with lightweight, targeted interventions.

Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann +1

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Mar 4, 2026

Google ResearchMar 4, 2026·also BAIR, Stanford HAI

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Robots can now remember what they've done and what they need to do next for 15 minutes straight, thanks to a new memory architecture that mixes video and text.

Marcel Torne, Karl Pertsch, Homer Walke +14

Multimodal Models Robotics & Embodied AI World Models & Planning

Stanford HAIMar 4, 2026

Using Vision + Language Models to Predict Item Difficulty

Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.

Samin Khan

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Mar 4, 2026·also Stanford HAI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Turns out, the best memory design for robotic manipulation depends heavily on the task, with no single architecture dominating across the board.

Yinpei Dai, Yinpei Dai, Hongze Fu +14

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 3, 2026

Stanford HAIMar 3, 2026

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Forget OCR? Powerful MLLMs can extract information from business documents just as well from images alone, challenging the necessity of traditional OCR pipelines.

Jiyuan Shen, Peiyue Yuan, A. Ghosh +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Feb 27, 2026

Stanford HAIFeb 27, 2026·also NVIDIA

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.

Shengqu Cai, Weili Nie, Chao Liu +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 25, 2026

Feb 25, 2026·also Stanford HAI, Independent Researcher

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.

Zhihao Sun, Tong Wu, Ruirui Tu

Computer Vision Multimodal Models Robotics & Embodied AI

Feb 20, 2026

Stanford HAIFeb 20, 2026·also Independent Researcher

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

XR gets real: control virtual worlds with your head and hands, not just text prompts.

Linxi Xie, Lisong C. Sun, Lisong C. Sun +7

Computer Vision Multimodal Models Robotics & Embodied AI+1

Feb 18, 2026

Stanford HAIFeb 18, 2026

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Achieve spatially faithful image-to-image translation without cross-domain supervision by bridging diffusion models with self-supervised semantic representations.

Jiaming Liu, Felix Petersen, Felix Petersen +12

Computer Vision Multimodal Models

Feb 12, 2026

Stanford HAIFeb 12, 2026·also Google Research

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Verification at test time can be a surprisingly effective alternative to scaling policy learning for vision-language-action alignment, yielding substantial gains in both simulated and real-world robotic tasks.

Jacky Kwok, Xilun Zhang, Azalia Mirhoseini +2

Computer Vision Multimodal Models Robotics & Embodied AI

Stanford HAIFeb 12, 2026·also BAIR

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

Closing the reality gap: iteratively refining a world model with real-world robot data yields a significant boost in vision-language-action policy performance.

Tony Lee, L. Shi, Jianyu Chen +2

Multimodal Models Robotics & Embodied AI World Models & Planning

Feb 3, 2026

Jamia HamdardFeb 3, 2026·also Stanford HAI, Macquarie, NJU, USTC +1

They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

You can now detect harmful memes with 17% better accuracy and understand *why* they're toxic, thanks to a new framework that injects cultural context and explains its reasoning.

Sahil Tripathi, Gautam Siddharth Kashyap, Mehwish Nasim +3

Constitutional AI & AI Ethics Multimodal Models Natural Language Processing

Jan 15, 2026

Stanford HAIJan 15, 2026·also Salesforce AI, UNC

Future Optical Flow Prediction Improves Robot Control&Video Generation

A unified Vision-Language Model and Diffusion architecture unlocks surprisingly effective optical flow forecasting from noisy web data, enabling language-conditioned robot control and video generation.

Kanchana Ranasinghe, Honglu Zhou, Yu Fang +7

Multimodal Models Robotics & Embodied AI

Apr 22, 2025

BAIRApr 22, 2025·also Google Research, Stanford HAI, PI

π0.5: a Vision-Language-Action Model with Open-World Generalization

An end-to-end learned robotic system can now clean your kitchen in a completely new house, thanks to a novel co-training approach on diverse data.

Physical Intelligence, Kevin Black, Noah Brown +33473

Computer Vision Multimodal Models Robotics & Embodied AI

Search

Stanford HAI