May 1 – May 8, 2026

Multimodal Models - Weekly Roundup

88 papers published across 9 labs.

Selected Labs publishing this week

Tsinghua AI2 ETH2 Microsoft Research2 BAIR1 DAMO1

Top Papers

May 6, 2026

Universidad Autónoma de Madrid2w ago

MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education

Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.

Miguel Díaz Benito, Cecilia Diana-Albelda, Álvaro García-Martín +3

Data Curation & Synthetic Data Multimodal Models Recommendation & Information Retrieval

2w ago·also CUHK, HKU, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Jiayang Li +72w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Sahar Askari +42w ago

Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.

Sahar Askari, Mohammad Mahdi Mirza Ali Mohammadi, Fatemeh Ensafdoust +2

Interpretability & Mechanistic Interp Multimodal Models

2w ago

Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.

Yunfei Bai, Aaron Tesfa Tsion, Raul Rosales +2

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

All Papers (88)

May 6, 2026

2w ago·also CUHK, HKU, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Jiayang Li +72w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Sahar Askari +42w ago

Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.

Sahar Askari, Mohammad Mahdi Mirza Ali Mohammadi, Fatemeh Ensafdoust +2

Interpretability & Mechanistic Interp Multimodal Models

2w ago

Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.

Yunfei Bai, Aaron Tesfa Tsion, Raul Rosales +2

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

2w ago·also Huawei

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.

Hongxu Chen, Yanghao Wang, Bowei Zhu +6

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.

Mohamed Elhabebe, Ayman El-Baz

Computer Vision Constitutional AI & AI Ethics Multimodal Models

Yangchen Yu +72w ago

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.

Yangchen Yu, Qian Chen, Jia Li +5

Multimodal Models Natural Language Processing Speech & Audio

University of Science and Technology2w ago

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.

Huatian Zhang, Zhendong Mao, Lei Zhang +1

Computer Vision Multimodal Models RLHF & Preference Learning

University of Nebraska-Lincoln2w ago·also Ohio State

Look Once, Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity

End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.

Avhishek Biswas, Apala Pramanik, Eylem Ekici +1

Computer Vision Multimodal Models Robotics & Embodied AI

Anju Rani +22w ago

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

E. Denteh +32w ago

Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.

E. Denteh, Blessing Agyei Kyem, Joshua Kofi Asamoah +1

Computer Vision Multimodal Models

Yuanzhi Wang +92w ago

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng +7

Computer Vision Multimodal Models Natural Language Processing

Jingtao Liu +42w ago

Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.

Jingtao Liu, Peiliang Gong, Chuhang Zheng +2

Computer Vision Multimodal Models

2w ago·also ByteDance, SEU

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.

Zishen Qu, Xuesong Li, Haijian Gu +4

Computer Vision Multimodal Models Natural Language Processing

Binh Long Nguyen +42w ago

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.

Binh Long Nguyen, Kien Nguyen, S. Sridharan +2

Computer Vision Multimodal Models Robotics & Embodied AI

Jingtao Zhou +32w ago

SpecPL: Disentangling Spectral Granularity for Prompt Learning

Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.

Jingtao Zhou, Xirui Kang, Feiyang Huang +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Leying Zhang +42w ago

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.

Leying Zhang, Bowen Shi, Haibin Wu +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yuancheng Wei +92w ago

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.

Yuancheng Wei, Haojie Zhang, Linli Yao +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli +3

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Wei Liu +22w ago

Open-Source Image Editing Models Are Zero-Shot Vision Learners

Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.

Wei Liu, Jiaxin Lin, Rui Chen

Computer Vision Multimodal Models Open-Source Models & Weights

2w ago·also HIT, PKU

Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification

Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.

Wen Wen, Hao Chen, Shiliang Zhang

Computer Vision Inference & Quantization Multimodal Models

Friedrich-Alexander University2w ago·also Helmholtz, Imperial, Technical University Munich

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.

Bernhard Kainz, Johanna P Mueller, Matthew Baugh +1

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago·also D height representation

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.

Yuan Wu, Zhiqiang Yan, Jiawei Lian +2

Computer Vision Multimodal Models Robotics & Embodied AI

CARIAD SE2w ago·also TU Berlin, Vision & Robotics GmbH

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.

Gasser Elazab, Frank Neuhaus, Tilman Koß +5

Computer Vision Multimodal Models Robotics & Embodied AI

School of Computer Science2w ago·also Hubei Key Laboratory of Multimedia and Network, Institute of Artificial Intelligence, National Engineering Research Center for Multimedia, WHU

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.

Haibin He, Maoyuan Ye, Juhua Liu +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Jinzhen Han +32w ago

Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation

Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.

Jinzhen Han, JinByeong Lee, Jisung Kim +1

Computer Vision Multimodal Models

Laura Bravo-S'anchez +52w ago

Anny-Fit: All-Age Human Mesh Recovery

Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.

Laura Bravo-S'anchez, M. Armando, Romain Br'egier +3

Computer Vision Multimodal Models Robotics & Embodied AI

Qiming Li +112w ago

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.

Qiming Li, Zekai Ye, Xiaocheng Feng +9

Computer Vision Multimodal Models

Boyue Xu +32w ago

VL-UniTrack: A Unified Framework with Visual-Language Prompts for UAV-Ground Visual Tracking

Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.

Boyue Xu, Ruichao Hou, Tongwei Ren +1

Computer Vision Multimodal Models Robotics & Embodied AI

Liang Yao +82w ago

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.

Liang Yao, Fan Liu, Shengxiang Xu +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Xiaojian Li +82w ago

Autonomous Laparoscope Control through Unified Mechanics-Based Representation of Multimodal Intraoperative Information

Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.

Xiaojian Li, Jin Fang, Yudong Shi +6

Computer Vision Multimodal Models Robotics & Embodied AI

Cyril Allauzen +42w ago

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.

Cyril Allauzen, Tom Bagby, G. Heigold +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yating Wang +42w ago

Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

Make your prompts 5x more interpretable without hurting accuracy: IPL combines discrete token selection with continuous optimization, and it's plug-and-play with existing methods.

Yating Wang, Yaqi Zhao, Yongshun Gong +2

Interpretability & Mechanistic Interp Multimodal Models Training Efficiency & Optimization

2w ago·also D observations. In contrast, D-Perception to

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.

Wei Li, Jizhihui Liu, Li Yixing +3

Computer Vision Multimodal Models Robotics & Embodied AI

Yupeng Gao +32w ago

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.

Yupeng Gao, Tianyu Li, Guoqing Wang +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Corresponding author2w ago

Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.

Yiran Qiao, Yiren Lu, Yunlai Zhou +5

Computer Vision Multimodal Models

2w ago·also Key Lab of MIMS, Northwestern, School of Computer Science and Engineering

A cross-modal network for facial expression recognition

Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.

Chunwei Tian, Jingyuan Xie, Qi Zhang +3

Computer Vision Multimodal Models

Shuo Liu +52w ago

Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding

Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.

Shuo Liu, Lei Shi, Haowen Liu +3

Computer Vision Multimodal Models Robotics & Embodied AI

Jiaming Hu +42w ago

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.

Jiaming Hu, Jiamu Bai, Haoyu Wang +2

Computer Vision Multimodal Models RLHF & Preference Learning

Zicheng Zhao +32w ago

From Priors to Perception: Grounding Video-LLMs in Physical Reality

Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.

Zicheng Zhao, Chaofan Gan, Shijie Li +1

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Anagh Malik +52w ago

Velox: Learning Representations of 4D Geometry and Appearance

Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.

Anagh Malik, Dorian Chan, Xiaoming Zhao +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Lihua Zhou +82w ago

Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.

Lihua Zhou, Mao Ye, Xiatian Zhu +6

Computer Vision Multimodal Models

Muyao Peng +42w ago

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.

Muyao Peng, Shun Zou, Pei An +2

Computer Vision Multimodal Models Robotics & Embodied AI

Zhiwei Yang +52w ago·also CAS

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.

Zhiwei Yang, Pengfei Song, Yucong Meng +3

Computer Vision Multimodal Models

Kai Zou +32w ago

Advancing Aesthetic Image Generation via Composition Transfer

Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.

Kai Zou, Zhiwei Zhao, Bin Liu +1

Computer Vision Multimodal Models

Tsinghua AI2w ago·also BAIR

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.

Xiaopei Zhu, Guanning Zeng, Zhanhao Hu +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Yihan Lin +62w ago

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.

Yihan Lin, Haoyang Li, Yang Li +4

Computer Vision Multimodal Models Robotics & Embodied AI

Universidad Autónoma de Madrid2w ago

MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education

Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.

Miguel Díaz Benito, Cecilia Diana-Albelda, Álvaro García-Martín +3

Data Curation & Synthetic Data Multimodal Models Recommendation & Information Retrieval

Andranik Sargsyan +12w ago

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.

Andranik Sargsyan, Shant Navasardyan

Computer Vision Multimodal Models

Phenikaa University2w ago

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le +3

Computer Vision Multimodal Models

Wei Luo +342w ago

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Current video generation benchmarks overlook crucial aspects of physical plausibility and temporal coherence, highlighting the need for holistic evaluation metrics like PhyScore.

Wei Luo, Yiting Lu, Xin Li +32

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

May 5, 2026

Free University of Bozen-Bolzano2w ago

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Get expert-level feedback on your performance, not just a score, thanks to a new approach that uses language generation for proficiency estimation.

E. Bianchi, Antonio Liotta

Computer Vision Multimodal Models Training Efficiency & Optimization

Dongyoung Kim +672w ago

RLDX-1 Technical Report

RLDX-1 achieves double the success rate of existing VLAs on complex humanoid tasks, suggesting a leap in robots' ability to handle contact-rich, dynamic manipulation.

Dongyoung Kim, Huiwon Jang, Myungkyu Koo +65

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Lin Song +182w ago

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Bidirectional interaction between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables a unified multimodal model to achieve spatial intelligence beyond general visual competence.

Lin Song, Wenbo Li, Guoqing Ma +16

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago·also HKU, Rice

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

A hierarchical agent that separates visual and textual contexts drastically improves multi-step reasoning on complex charts, outperforming monolithic MLLMs.

Qihua Dong, Ruozhen He, Junwen Chen +4

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Achuth Chandrasekhar +32w ago

Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining

Automating materials science database construction is now feasible: a multi-agent system extracts structured data from scientific literature with high speed and accuracy.

Achuth Chandrasekhar, Omid Barati Farimani, Radheesh Sharma Meda +1

Multimodal Models Scientific Discovery & Drug Design Tool Use & Agents

DAMO2w ago

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.

Zhipeng Xu, Junhao Ji, Zulong Chen +10

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Akshay Syal +42w ago

A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.

Akshay Syal, L. Prince, E. Gultepe +2

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

ETH2w ago

Laundering AI Authority with Adversarial Examples

Production VLMs like GPT-4, Claude Opus, Gemini, and Grok can be easily manipulated into confidently providing false information via subtle adversarial perturbations to images, even without compromising model alignment.

Jie Zhang, Pura Peetathawatchai, Florian Tramèr +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Mustafa Sakhaia +32w ago

InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making

Event cameras can significantly boost the reliability of autonomous driving in high-dynamic-range and high-speed scenarios, achieving perfect route completion in CARLA benchmarks.

Mustafa Sakhaia, Kaung Sithua, Min Khant Soe Okea +1

Computer Vision Multimodal Models Robotics & Embodied AI

Kristy Sakano +22w ago

From Language to Logic: A Theoretical Architecture for VLM-Grounded Safe Navigation

Guaranteeing safe robot navigation in unstructured environments just got easier: translate human language rules into formal logic, ground them with VLMs, and let the robot navigate.

Kristy Sakano, Kalonji Harrington, Mumu Xu

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Hao Wu +122w ago

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Robot video world models can be significantly improved by distilling a multimodal reward function and stabilizing long-horizon inference, leading to better instruction following and manipulation accuracy.

Hao Wu, Yuqi Li, Yuan Gao +10

Multimodal Models Robotics & Embodied AI World Models & Planning

Zhiyuan Li +62w ago

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

Robots can now learn manipulation skills from human videos with greater morphological accuracy and temporal consistency, thanks to a new method that disentangles task and embodiment.

Zhiyuan Li, Wenyan Yang, Wenshuai Zhao +4

Computer Vision Multimodal Models Robotics & Embodied AI

Timon Homberger +42w ago

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

Achieve scalable open-vocabulary semantic maps of entire buildings by fusing both dense and instance-level semantic information in a novel dual-layer voxel representation.

Timon Homberger, F. Busch, Jes'us Gerardo Ortega Peimbert +2

Computer Vision Multimodal Models Robotics & Embodied AI

Chenhao Yu +52w ago

BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

Unlock agile humanoid robots by ditching teleoperation and training directly from human VR demos.

Chenhao Yu, Hongwu Wang, Youhao Hu +3

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

University of Surrey2w ago

TACO: Trajectory Aligning Cross-view Optimisation

Ditch the GPS: This CVGL pipeline achieves a 5.9x improvement in localization accuracy over IMU-only by intelligently fusing satellite imagery with inertial measurements, needing only a single initial GPS fix.

Tavis Shore, Oscar Mendez, Simon Hadfield

Computer Vision Multimodal Models Robotics & Embodied AI

Jingjing Zhou +72w ago

Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection

Multimodal graph unlearning doesn't have to destroy utility: carefully protecting high-dimensional input projections during the unlearning process preserves performance while still enabling effective forgetting.

Jingjing Zhou, Yongshuai Yang, Qing Qing +5

Multimodal Models Training Efficiency & Optimization

Xun Jiang +72w ago

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Conformal prediction offers a surprisingly effective way to handle both modality imbalance and noisy corruption in multimodal learning by explicitly modeling predictive uncertainty during training.

Xun Jiang, Yufan Gu, Disen Hu +5

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Jing Gong2w ago

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.

Jing Gong

Multimodal Models Open-Source Models & Weights Speech & Audio

May 4, 2026

AI22w ago·also NUS, UW, JHU, UMich +1

MolmoAct2: Action Reasoning Models for Real-world Deployment

Open-sourcing a VLA model that beats closed-source giants on embodied reasoning tasks could finally make real-world robot deployment practical.

Haoquan Fang, Jiafei Duan, Donovan Clay +26

Multimodal Models Open-Source Models & Weights Robotics & Embodied AI

2w ago·also AI Laboratory, HKUST

Perceptual Flow Network for Visually Grounded Reasoning

LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.

Yangfu Li, Yuning Gong, Hongjian Zhan +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago·also National Center for High-Performance, National Chung Cheng University

Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation

Achieve state-of-the-art object detection in multi-camera surveillance without compromising data privacy by fusing models trained on synthetically augmented and federated data.

Peggy Joy Lu, Wei-Yu Chen, Yao-Tsung Huang +1

Computer Vision Data Curation & Synthetic Data Multimodal Models

ETH2w ago·also UZH

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Pehuén Moure, Niclas Pokel, Bilal Bounajma +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yian Zhao +62w ago

Video Generation with Predictive Latents

Encoding temporal prediction into video VAEs unlocks faster training, better generative performance, and improved downstream task performance, all at once.

Yian Zhao, Feng Wang, Qiushan Guo +4

Computer Vision Multimodal Models World Models & Planning

May 3, 2026

2w ago·also Microsoft Research, Forschungszentrum Jülich GmbH, Snowflake

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.

Mahmoud Ahmed, Sameh Abdulah, Olatunji Ruwase +4

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

2w ago

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

MLLMs hallucinate less when you nudge them to pay more attention to non-text inputs during inference, without any training.

Itai Allouche, Joseph Keshet

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Xiaoda Yang +122w ago

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Current audio-visual models nail unimodal quality but still struggle to make music and dance move together rhythmically, highlighting a key gap TMD-Bench is designed to address.

Xiaoda Yang, Majun Zhang, Changhao Pan +10

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Xinmeng Xu +52w ago

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Audio-visual models can be significantly improved by delaying perceptual commitment, correcting intermediate fusion states only when they have sufficient cross-layer and cross-modal support.

Xinmeng Xu, Haoran Xie, S. Joe Qin +3

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

May 2, 2026

Zhaoyang Li +23w ago

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

Standard hard projection in multi-modal point cloud completion severs the connection between modalities, but SplAttN's differentiable Gaussian splatting fixes this, leading to state-of-the-art results.

Zhaoyang Li, Zhichao You, Tianrui Li

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also HKU, Tencent AI

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Forget sifting through walls of text – now you can pinpoint exactly where the AI found its answer, down to the pixel, even in complex visuals like charts and diagrams.

Peiyang Liu, Ziqiang Cui, Xi Wang +2

Computer Vision Multimodal Models Recommendation & Information Retrieval

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, Hengjie Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Yi Wang +173w ago

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.

Yi Wang, Xincheng Li, Pengwei Xie +15

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Stanford HAI3w ago·also Tsinghua AI, Beihang, CUHK, HKUST +1

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.

Houyuan Chen, Hong Li, Xianghao Kong +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Microsoft Research3w ago·also SNU

Map2World: Segment Map Conditioned Text to 3D World Generation

Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang +2

Computer Vision Multimodal Models World Models & Planning

Yan Fang +93w ago

Let ViT Speak: Generative Language-Image Pre-training

Ditch the complex multimodal pre-training pipelines: GenLIP proves a simple language modeling objective can effectively align vision encoders with LLMs, achieving strong performance with less data.

Yan Fang, Mengcheng Lan, Zilong Huang +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Siyuan Huang +83w ago

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

LVLMs can maintain sharper visual focus during long-form generation by adding a lightweight, learnable memory module that bypasses attention dilution.

Siyuan Huang, Xiaoye Qu, Yafu Li +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Massimo Rondelli +23w ago

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.

Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval