MIT CSAIL

×Multimodal Models

15 papers from MIT CSAIL on Multimodal Models

Mar 30, 2026

Pandora: Articulated 3D Scene Graphs from Egocentric Vision

Robots can now "see" hidden objects and understand articulation by learning from human egocentric video, even if they can't physically explore those areas themselves.

Alan Yu, Alan Yu, Yun Chang +5

Computer Vision Multimodal Models Robotics & Embodied AI+1

Mar 13, 2026

MIT CSAIL2w ago

V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

Video generative models already contain powerful image restoration priors, and can be coaxed into state-of-the-art performance with just 1,000 training examples.

Shenghe Zheng, Junpeng Jiang, Wenbo Li

Computer Vision Multimodal Models

Mar 12, 2026

NVIDIA2w ago·also BAIR, MIT CSAIL, Clarifai, K-frame

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

MLLMs can now handle 4K videos up to 100x faster thanks to AutoGaze, which selectively attends to only the most informative patches.

Baifeng Shi, Stephanie Fu, Long Lian +12

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Mar 10, 2026

MIT CSAIL3w ago·also D visual features, TJU

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.

William Shen, Nishanth Kumar, Sahit Chintalapudi +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Mar 9, 2026

3w ago·also MIT CSAIL, Max Planck, Tuebingen AI Center/University of Tuebingen, University of Siegen

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

By dynamically adjusting contrastive learning temperatures based on data density, MM-TS achieves state-of-the-art results on multimodal long-tail datasets.

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele +2

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Mar 5, 2026

MIT CSAIL3w ago·also Cambridge, Princeton

From Pixels to Predicates: Learning Symbolic World Models via Pretrained VLMs

Forget hand-engineered features: this approach learns symbolic representations for robotic planning directly from pixels using VLMs, enabling impressive zero-shot generalization to new environments and goals.

Ashay Athalye, Nishanth Kumar, Tom Silver +4

Multimodal Models Robotics & Embodied AI World Models & Planning

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +10

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenny Kimble, Kenneth Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 19, 2026

MIT CSAILFeb 19, 2026

Canonicalizing Multimodal Contrastive Representation Learning

Independently trained multimodal models like CLIP aren't so independent after all: a single orthogonal transformation can align their embedding spaces across both image and text modalities.

Sharut Gupta, Sanyam Kansal, Stefanie Jegelka +1

Multimodal Models Training Efficiency & Optimization

Feb 17, 2026

MIT CSAILFeb 17, 2026

Visual Persuasion: What Influences Decisions of Vision-Language Models?

VLMs can be easily swayed by subtle, optimized visual prompts, revealing vulnerabilities in their decision-making processes that could be exploited in real-world applications.

Manuel Cherep, Pranav M R, Pattie Maes +1

Computer Vision Multimodal Models Recommendation & Information Retrieval

MIT CSAILFeb 17, 2026·also Runway, UIUC

VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

By cleverly repurposing text-to-video diffusion models, VideoSketcher achieves high-quality sequential sketch generation from extremely limited human-drawn sketch data.

Hui Ren, Yuval Alaluf, Omer Bar Tal +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 15, 2026

Feb 15, 2026·also MIT CSAIL, Cambridge, Dana-Farber Cancer Institute, Institut Polytechnique de Paris +1

Towards Spatial Transcriptomics-driven Pathology Foundation Models

Injecting spatial transcriptomics data into existing pathology foundation models unlocks significant performance gains across a range of downstream tasks, including molecular status prediction and gene-to-image retrieval.

Konstantin Hemker, Andrew H. Song, Cristina Almagro-Pérez +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Jan 14, 2026

MIT CSAILJan 14, 2026·also KAIST

DreamWaQ++: Obstacle-Aware Quadrupedal Locomotion With Resilient Multimodal Reinforcement Learning

Quadrupedal robots can now nimbly navigate stairs and rough terrain thanks to a new multimodal RL approach that doesn't require feeling around with its front feet.

I. M. A. Nahrendra, Byeong-Uk Yu, Mi-Suk Oh +5

Multimodal Models Robotics & Embodied AI World Models & Planning

Oct 31, 2025

Tsinghua AIOct 31, 2025·also MIT CSAIL

DLDC: A Dual Loop Data Cleaning Method for Fine-Tuning Remote Sensing Image Generative Models

Forget expensive human annotation: this dual-loop method automatically cleans remote sensing image-text datasets, boosting T2I model performance by over 35%.

Tian Xing, Hu Yan, Xinwei Wang +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

May 31, 2025

May 31, 2025·also MIT CSAIL

ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation

Forget hand-annotated data: ChartGen automatically generates 222.5K chart-image/code pairs, exposing surprising weaknesses in today's VLMs at reconstructing plotting scripts.

Jovana Kondic, Pengyuan Li, Dhiraj Joshi +12

Code Generation & Program Synthesis Data Curation & Synthetic Data Multimodal Models

Feb 27, 2025

MIT CSAILFeb 27, 2025

Modular Approaches to Complex Reasoning in Visual Question Answering Systems

Achieving 80% accuracy on VQA v2.0 proves that combining Visual BERT, ViLT, and memory-augmented attention can significantly outperform traditional VQA models.

Akshay Bhosale, Sujit Wandre, Sagar Chavan +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Search

MIT CSAIL