CMU Machine Learning

×Multimodal Models

23 papers from CMU Machine Learning on Multimodal Models

Apr 21, 2026

CMU MLApr 21, 2026

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

VLMs can be significantly boosted on embodied tasks by mid-training on a carefully curated subset of VLM data that is highly aligned with the VLA domain, rivaling the performance of much larger models.

Yiyang Du, Zhanqiu Guo, Xin Ye +2

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Apr 16, 2026

CMU MLApr 16, 2026·also IIT Kanpur, IIT Kharagpur

Towards Design Compositing

Mismatched visual elements torpedo design harmony, but GIST offers a training-free fix that stylistically blends components, boosting aesthetic quality in existing pipelines.

Abhinav Mahajan, Abhikhya Tripathy, Sudeeksha Reddy Pala +3

Computer Vision Multimodal Models Natural Language Processing

Apr 16, 2026·also CMU ML, BIT, PKU, SJTU

Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation

Dramatically improve multimodal recommendation accuracy without any training by initializing user embeddings with item modality features and user cluster information.

Jinfeng Xu, Zheyu Chen, Shuo Yang +6

Multimodal Models Recommendation & Information Retrieval Training Efficiency & Optimization

Apr 14, 2026

CMU MLApr 14, 2026·also Microsoft Research

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso

Computer Vision Multimodal Models Tool Use & Agents

CMU MLApr 14, 2026

Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Unlock 20x faster and more accurate 3D human-object contact estimation in complex, multi-person scenes with Pi-HOC, a framework that doesn't require object meshes.

Sravan Chittupalli, Ayush Jain, Dong Huang

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 12, 2026

CMU MLApr 12, 2026·also UW-Madison

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Stop reimplementing multimodal models: TorchUMM offers a unified codebase for evaluation, analysis, and post-training, streamlining research across diverse architectures and tasks.

Yinyi Luo, Wenwen Wang, Hayes Bai +5

Eval Frameworks & Benchmarks Multimodal Models Open-Source Models & Weights

Apr 12, 2026·also CMU ML, PKU

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

Achieve sub-centimeter robotic placement accuracy from compositional language instructions by decomposing the task into visual goal representation and goal-conditioned execution.

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang +3

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 9, 2026

CMU MLApr 9, 2026·also Northeastern, Tongji

Visually-grounded Humanoid Agents

Imagine populating any 3D environment with digital humans that spontaneously navigate and interact, driven only by visual input and goals.

Hang Ye, Hang Ye, Xiaoxuan Ma +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Adelaide UniversityApr 9, 2026·also CMU ML

Novel View Synthesis as Video Completion

Video diffusion models already contain implicit multi-view knowledge, making them surprisingly effective for novel view synthesis when adapted to ignore temporal coherence.

Qi Wu, Qi Wu, Khiem Vuong +6

Computer Vision Multimodal Models

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenneth Kimble, Kenny Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 26, 2026

CMU MLFeb 26, 2026·also BJTU, Institute of Science Tokyo, Shanda AI Research Tokyo, UTokyo

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Finally, digital humans can have realistic, socially aware conversations: DyaDiT generates dyadic gestures that users strongly prefer over existing methods.

Yichen Peng, Yichen Peng, Jyun-Ting Song +14

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

CMU MLFeb 26, 2026·also Microsoft Research, Beihang

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Forget monolithic models: pMoE shows that ensembling diverse expert prompts within a single model framework yields surprisingly large gains in visual adaptation across a wide range of tasks.

Shentong Mo, Shentong Mo, Xufang Luo +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Feb 25, 2026

Feb 25, 2026·also CMU ML, UNC

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

By decomposing long-horizon manipulation into transport and object-centric interaction, LiLo-VLA achieves state-of-the-art zero-shot generalization and robustness, outperforming end-to-end VLA models by a large margin.

Shuo Cheng, Daniel Szafir

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Feb 23, 2026

Feb 23, 2026·also CMU ML, ANU, NJU

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Forget cloud GPUs – a new model brings unified multimodal understanding and generation to your iPhone, running 6x faster than alternatives.

Abdelrahman Shaker, Abdelrahman M. Shaker, Ahmed Heakl +14

Computer Vision Inference & Quantization Multimodal Models

Feb 17, 2026

CMU MLFeb 17, 2026

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.

Manav Nitin Kapadnis, Lawanya Baghel, Carolyn Rosé

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

CMU MLFeb 17, 2026·also Georgia Tech, Purdue

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Forget slow text-based communication: Vision Wormhole unlocks faster multi-agent reasoning by turning VLMs into telepathic hubs, slashing runtime without sacrificing fidelity.

Xiaoze Liu, Xiaoze Liu, Ruowang Zhang +13

Multimodal Models Tool Use & Agents

CMU MLFeb 17, 2026

GMAIL: Generative Modality Alignment for generated Image Learning

Stop treating generated images like real ones: GMAIL aligns them as separate modalities in a shared latent space, unlocking significant gains in vision-language tasks.

Shentong Mo, Sukmin Yun

Computer Vision Data Curation & Synthetic Data Multimodal Models

Feb 16, 2026

IITFeb 16, 2026·also CMU ML

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.

Ayush Shrivastava, Kirtan Gangani, Laksh Jain +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Feb 14, 2026

BAIRFeb 14, 2026·also CMU ML, Google Research, Department of Computational and Data

TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment

Key contribution not extracted.

Youngsun Wi, Jessica Yin, Jessica Yin +7

Multimodal Models Robotics & Embodied AI

Feb 13, 2026

Tsinghua AIFeb 13, 2026·also CMU ML, HIT, Lumos Robotics *Equal contribution, Peking Unviersity +1

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Forget static datasets – RL-based co-training unlocks +20% real-world VLA performance by interactively leveraging simulation while preserving real-world capabilities.

Yinuo Chen, Kang Chen, Tonghe Zhang +2

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

May 29, 2025

CMU MLMay 29, 2025·also Florida State

Leveraging generative AI for cross-regional small object detection in satellite imagery

Synthetic data generated by fine-tuning Stable Diffusion on multi-region satellite imagery boosts small object detection accuracy by 20%, even when real labeled data is scarce.

Zheyang Qin, Stanislav Panev, Celso de Melo +3

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 11, 2025

CMU MLApr 11, 2025·also CAS

FlexDataset: Crafting Annotated Dataset Generation for Diverse Applications

Forget tedious manual annotation: FlexDataset crafts customized, high-fidelity annotated datasets with 5x faster annotation times using a composition-to-data approach.

Ellen Yi-Ge, Leo Shawn5

Computer Vision Data Curation & Synthetic Data Multimodal Models

Jan 2, 2025

CMU MLJan 2, 2025·also Snap Research, TAU

Object-level Visual Prompts for Compositional Image Generation

Achieve semantically coherent image compositions by mixing layout-focused and appearance-focused visual representations in a diffusion model's cross-attention.

Gaurav Parmar, Or Patashnik, K. Wang +515

Computer Vision Multimodal Models

Search

CMU Machine Learning