May 4 – May 11, 2026

Multimodal Models - Weekly Roundup

75 papers published across 7 labs.

279% acceleration

Selected Labs publishing this week

Tsinghua AI2 ETH2 BAIR1 DAMO1 AI21

Top Papers

May 6, 2026

Universidad Autónoma de Madrid3w ago

MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education

Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.

Miguel Díaz Benito, Cecilia Diana-Albelda, Álvaro García-Martín +3

Data Curation & Synthetic Data Multimodal Models Recommendation & Information Retrieval

May 8, 2026

Tsinghua AI3w ago·also Cambridge, IMATI-CNR

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Reconstructing 3D animals in the wild just got a whole lot easier, even in crowded and occluded scenes, thanks to a new promptable framework.

Xu Hu, J. Lyu, Jiuming Liu +4

Computer Vision Multimodal Models Robotics & Embodied AI

May 6, 2026

Tsinghua AI3w ago·also CUHK, HKU, Tencent AI, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Jiayang Li +73w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Sahar Askari +43w ago

Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.

Sahar Askari, Mohammad Mahdi Mirza Ali Mohammadi, Fatemeh Ensafdoust +2

Interpretability & Mechanistic Interp Multimodal Models

All Papers (75)

May 8, 2026

Tsinghua AI3w ago·also Cambridge, IMATI-CNR

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

Reconstructing 3D animals in the wild just got a whole lot easier, even in crowded and occluded scenes, thanks to a new promptable framework.

Xu Hu, J. Lyu, Jiuming Liu +4

Computer Vision Multimodal Models Robotics & Embodied AI

May 6, 2026

Tsinghua AI3w ago·also CUHK, HKU, Tencent AI, University of California

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL offers a fully transparent recipe for training state-of-the-art multimodal search agents, finally democratizing access to a capability previously locked behind closed doors.

Shuang Chen, Kaituo Feng, Hangting Chen +7

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Jiayang Li +73w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Sahar Askari +43w ago

Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

Decoding driver behavior jumps from 73% to 81% accuracy by fusing EEG, EMG, and GSR signals, pinpointing the physiological markers that matter most.

Sahar Askari, Mohammad Mahdi Mirza Ali Mohammadi, Fatemeh Ensafdoust +2

Interpretability & Mechanistic Interp Multimodal Models

3w ago·also TokenRhythm AI

Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

Forget expensive on-site inspections: this multimodal model uses assessor text and GIS data to accurately predict building energy performance, enabling scalable retrofit planning.

Yunfei Bai, Aaron Tesfa Tsion, Raul Rosales +2

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

3w ago·also Huawei

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Decoupling radial and angular dynamics in vision-language model adaptation unlocks significant gains in few-shot performance, outperforming existing flow matching methods.

Hongxu Chen, Yanghao Wang, Bowei Zhu +4

Computer Vision Multimodal Models Training Efficiency & Optimization

3w ago

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Training vision-language models to detect glaucoma fairly across demographics requires debiasing both text *and* images, which this paper achieves with a novel pretraining strategy.

Mohamed Elhabebe, Ayman El-Baz

Computer Vision Constitutional AI & AI Ethics Multimodal Models

Yangchen Yu +73w ago·also HKUST, SMU

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Standard multimodal fusion can hurt performance in emotion recognition, but this new approach knows when to drop modalities, leading to state-of-the-art results.

Yangchen Yu, Qian Chen, Jia Li +5

Multimodal Models Natural Language Processing Speech & Audio

University of Science and Technology3w ago

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.

Huatian Zhang, Zhendong Mao, Yongdong Zhang

Computer Vision Multimodal Models RLHF & Preference Learning

University of Nebraska-Lincoln3w ago·also Ohio State

Look Once, Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity

End-to-end ML models get smoked in real-world mmWave vehicular connectivity: a hybrid vision-primed approach slashes outage rates by leveraging model-based reasoning and RF feedback.

Avhishek Biswas, Apala Pramanik, Eylem Ekici +1

Computer Vision Multimodal Models Robotics & Embodied AI

Anju Rani +23w ago

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

E. Denteh +33w ago

Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

Achieve near-perfect traffic congestion classification by fusing motion-guided visual attention with data-adaptive temporal decomposition, outperforming existing vision-based and signal-based methods.

E. Denteh, Blessing Agyei Kyem, Joshua Kofi Asamoah +1

Computer Vision Multimodal Models

Yuanzhi Wang +93w ago·also SJTU

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng +7

Computer Vision Multimodal Models Natural Language Processing

Jingtao Liu +43w ago·also Huawei

Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

Achieve 80.5% Top-1 accuracy in zero-shot EEG-to-image retrieval by mimicking the human visual system, substantially outperforming existing methods.

Jingtao Liu, Peiliang Gong, Chuhang Zheng +2

Computer Vision Multimodal Models

3w ago·also ByteDance, SEU

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Ditching diffusion's noise-denoising, RLFSeg uses Rectified Flow to directly predict segmentation masks from text prompts, unlocking zero-shot performance gains.

Zishen Qu, Haijian Gu, Hongwei Kang +3

Computer Vision Multimodal Models Natural Language Processing

Binh Long Nguyen +43w ago

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Unlock zero-shot 3D scene understanding: Ilov3Splat lets you identify and segment arbitrary objects in 3D scenes using only natural language, no category supervision needed.

Binh Long Nguyen, Kien Nguyen, S. Sridharan +2

Computer Vision Multimodal Models Robotics & Embodied AI

Jingtao Zhou +33w ago

SpecPL: Disentangling Spectral Granularity for Prompt Learning

Freezing your VAE and permuting high-frequency visual signals unlocks a new SOTA for VLM prompt learning, boosting harmonic-mean accuracy to 81.51%.

Jingtao Zhou, Xirui Kang, Feiyang Huang +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Leying Zhang +43w ago

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.

Leying Zhang, Bowen Shi, Haibin Wu +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yuancheng Wei +83w ago

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.

Yuancheng Wei, Haojie Zhang, Linli Yao +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

VLMs can be easily tricked into "hallucinating" object relationships with simple image rotations or noise, revealing a surprising fragility in their multimodal reasoning.

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

3w ago

Open-Source Image Editing Models Are Zero-Shot Vision Learners

Open-source image editing models can match or beat fine-tuned models on visual understanding tasks *without any task-specific training*.

Wei Liu, Jiaxin Lin

Computer Vision Multimodal Models Open-Source Models & Weights

3w ago·also HIT, HKUST, PKU

Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification

Freezing a text encoder and distilling prompts from vision-language models can stabilize semantics and boost performance in lifelong person re-identification, even across unseen domains.

Wen Wen, Hao Chen, Shiliang Zhang

Computer Vision Inference & Quantization Multimodal Models

Friedrich-Alexander University3w ago·also Helmholtz, Imperial, Technical University Munich

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.

Bernhard Kainz, Johanna P Mueller, Matthew Baugh +1

Computer Vision Multimodal Models Scientific Discovery & Drug Design

3w ago·also Beihang, D height representation

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

By intelligently incorporating LiDAR-derived height information, HiPR overcomes limitations of fixed projection spaces, achieving state-of-the-art camera-LiDAR occupancy prediction with real-time performance.

Yuan Wu, Zhiqiang Yan, Jiawei Lian +2

Computer Vision Multimodal Models Robotics & Embodied AI

CARIAD SE3w ago·also TU Berlin, Vision & Robotics GmbH

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

Finally, a driving dataset that doesn't just assume perfectly paved roads, offering 6.5x more depth data than KITTI for realistic autonomous driving scenarios.

Gasser Elazab, Frank Neuhaus, Tilman Koß +5

Computer Vision Multimodal Models Robotics & Embodied AI

School of Computer Science3w ago·also Hubei Key Laboratory of Multimedia and Network, Institute of Artificial Intelligence, National Engineering Research Center for Multimedia, WHU

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.

Haibin He, Maoyuan Ye, Juhua Liu +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Jinzhen Han +33w ago

Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation

Encoding cross-task relationships between building footprints and heights slashes height estimation error by 7% – more effective than just refining individual encoders.

Jinzhen Han, JinByeong Lee, Jisung Kim +1

Computer Vision Multimodal Models

Laura Bravo-S'anchez +53w ago

Anny-Fit: All-Age Human Mesh Recovery

Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.

Laura Bravo-S'anchez, M. Armando, Romain Br'egier +3

Computer Vision Multimodal Models Robotics & Embodied AI

Qiming Li +113w ago·also Faculty of Computing, HIT

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Steer LVLMs' attention with caption guidance and watch object hallucinations drop by 6%—no training required.

Qiming Li, Zekai Ye, Xiaocheng Feng +9

Computer Vision Multimodal Models

Boyue Xu +33w ago

VL-UniTrack: A Unified Framework with Visual-Language Prompts for UAV-Ground Visual Tracking

Bridging the gap between aerial and ground-level tracking, VL-UniTrack uses visual-language prompts to achieve robust object tracking even with significant viewpoint differences.

Boyue Xu, Ruichao Hou, Tongwei Ren +1

Computer Vision Multimodal Models Robotics & Embodied AI

Liang Yao +83w ago

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.

Liang Yao, Fan Liu, Shengxiang Xu +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Xiaojian Li +83w ago·also Institute of Software, Intelligent Software Research Center

Autonomous Laparoscope Control through Unified Mechanics-Based Representation of Multimodal Intraoperative Information

Achieve autonomous laparoscope control by translating multimodal surgical data into a single "wrench" that guides the robot's movements.

Xiaojian Li, Jin Fang, Yudong Shi +6

Computer Vision Multimodal Models Robotics & Embodied AI

Cyril Allauzen +43w ago

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.

Cyril Allauzen, Tom Bagby, G. Heigold +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yating Wang +43w ago

Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

Make your prompts 5x more interpretable without hurting accuracy: IPL combines discrete token selection with continuous optimization, and it's plug-and-play with existing methods.

Yating Wang, Yaqi Zhao, Yongshun Gong +2

Interpretability & Mechanistic Interp Multimodal Models Training Efficiency & Optimization

3w ago·also D observations. In contrast, Shenzhen Loop Area Institute

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.

Wei Li, Jizhihui Liu, Li Yixing +3

Computer Vision Multimodal Models Robotics & Embodied AI

Yupeng Gao +33w ago

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.

Yupeng Gao, Tianyu Li, Guoqing Wang +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Corresponding author3w ago·also UQ

Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

Forget training from scratch: surprisingly, off-the-shelf 2D diffusion models can unlock generalizable style control in 3D generation models, even for out-of-distribution styles.

Yiran Qiao, Yiren Lu, Yunlai Zhou +5

Computer Vision Multimodal Models

3w ago·also Key Lab of MIMS, Northwestern

A cross-modal network for facial expression recognition

Face symmetry and half-face alignment can be combined to achieve state-of-the-art facial expression recognition.

Chunwei Tian, Jingyuan Xie, Qi Zhang +3

Computer Vision Multimodal Models

Shuo Liu +53w ago·also NJU

Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding

Stop feeding LLMs redundant and conflicting sensor data in autonomous driving: a new architecture slashes hallucinated entities by coordinating multi-sensor inputs *before* reasoning.

Shuo Liu, Lei Shi, Haowen Liu +3

Computer Vision Multimodal Models Robotics & Embodied AI

Jiaming Hu +43w ago

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.

Jiaming Hu, Jiamu Bai, Haoyu Wang +2

Computer Vision Multimodal Models RLHF & Preference Learning

Chaofan Gan +23w ago

From Priors to Perception: Grounding Video-LLMs in Physical Reality

Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.

Chaofan Gan, Shijie Li, Weiyao Lin

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Anagh Malik +53w ago

Velox: Learning Representations of 4D Geometry and Appearance

Unlock efficient 4D object understanding from dynamic point clouds with Velox, a representation that's descriptive, compressive, and accessible.

Anagh Malik, Dorian Chan, Xiaoming Zhao +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Lihua Zhou +83w ago

Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.

Lihua Zhou, Mao Ye, Xiatian Zhu +6

Computer Vision Multimodal Models

Muyao Peng +43w ago

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Even with noisy initial matches, Angle-I2P leverages angular consistency and hierarchical attention to achieve state-of-the-art image-to-point cloud registration.

Muyao Peng, Shun Zou, Pei An +2

Computer Vision Multimodal Models Robotics & Embodied AI

Zhiwei Yang +43w ago

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

By fusing CLIP with a diffusion model, DiCLIP unlocks surprisingly strong weakly supervised segmentation, outperforming prior methods and slashing training costs.

Zhiwei Yang, Pengfei Song, Yucong Meng +2

Computer Vision Multimodal Models

Kai Zou +33w ago

Advancing Aesthetic Image Generation via Composition Transfer

Stop letting semantics dictate composition: Composer unlocks semantic-agnostic control over image aesthetics, letting you transfer and plan compositions with unprecedented precision.

Kai Zou, Zhiwei Zhao, Bin Liu +1

Computer Vision Multimodal Models

Tsinghua AI3w ago·also BAIR

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.

Xiaopei Zhu, Guanning Zeng, Zhanhao Hu +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Yihan Lin +63w ago·also UMich

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Image-based latent actions are your secret weapon for long-horizon reasoning in VLAs, while action-based latent actions unlock complex motor coordination.

Yihan Lin, Haoyang Li, Yang Li +4

Computer Vision Multimodal Models Robotics & Embodied AI

Universidad Autónoma de Madrid3w ago

MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education

Forget bulky atlases and unreliable image searches: MIRAGE offers medical students a free, interactive tool to retrieve, generate, and understand medical images using only open-source models.

Miguel Díaz Benito, Cecilia Diana-Albelda, Álvaro García-Martín +3

Data Curation & Synthetic Data Multimodal Models Recommendation & Information Retrieval

Andranik Sargsyan +13w ago

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

FlowDIS achieves state-of-the-art dichotomous image segmentation by using flow matching, even allowing for precise, pixel-level control via text prompts.

Andranik Sargsyan, Shant Navasardyan

Computer Vision Multimodal Models

Phenikaa University3w ago

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

ScriptHOI reveals that current HOI detectors over-rely on object affordance and phrase co-occurrence, and proposes a novel approach to explicitly model interaction scripts for improved open-vocabulary generalization.

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le +3

Computer Vision Multimodal Models

Yiting Lu +303w ago·also CAS, HKUST, PKU, SJTU +2

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Current video generation benchmarks overlook crucial aspects of physical plausibility and temporal coherence, highlighting the need for holistic evaluation metrics like PhyScore.

Yiting Lu, Haoran Li, Fengbin Guan +28

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

May 5, 2026

Free University of Bozen-Bolzano3w ago

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Get expert-level feedback on your performance, not just a score, thanks to a new approach that uses language generation for proficiency estimation.

E. Bianchi, Antonio Liotta

Computer Vision Multimodal Models Training Efficiency & Optimization

Dongyoung Kim +673w ago

RLDX-1 Technical Report

RLDX-1 achieves double the success rate of existing VLAs on complex humanoid tasks, suggesting a leap in robots' ability to handle contact-rich, dynamic manipulation.

Dongyoung Kim, Huiwon Jang, Myungkyu Koo +65

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Lin Song +183w ago·also HKUST

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Bidirectional interaction between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables a unified multimodal model to achieve spatial intelligence beyond general visual competence.

Lin Song, Wenbo Li, Guoqing Ma +16

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also HKU, Nat'l Eng. Research Center of Visual, PKU, Rice

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

A hierarchical agent that separates visual and textual contexts drastically improves multi-step reasoning on complex charts, outperforming monolithic MLLMs.

Qihua Dong, Ruozhen He, Junwen Chen +4

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Achuth Chandrasekhar +33w ago

Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining

Automating materials science database construction is now feasible: a multi-agent system extracts structured data from scientific literature with high speed and accuracy.

Achuth Chandrasekhar, Omid Barati Farimani, Radheesh Sharma Meda +1

Multimodal Models Scientific Discovery & Drug Design Tool Use & Agents

DAMO3w ago

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.

Zhipeng Xu, Junhao Ji, Zulong Chen +10

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Akshay Syal +43w ago

A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.

Akshay Syal, L. Prince, E. Gultepe +2

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

ETH3w ago

Laundering AI Authority with Adversarial Examples

Production VLMs like GPT-4, Claude Opus, Gemini, and Grok can be easily manipulated into confidently providing false information via subtle adversarial perturbations to images, even without compromising model alignment.

Jie Zhang, Pura Peetathawatchai, Florian Tramèr +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Mustafa Sakhaia +33w ago

InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making

Event cameras can significantly boost the reliability of autonomous driving in high-dynamic-range and high-speed scenarios, achieving perfect route completion in CARLA benchmarks.

Mustafa Sakhaia, Kaung Sithua, Min Khant Soe Okea +1

Computer Vision Multimodal Models Robotics & Embodied AI

Kristy Sakano +23w ago

From Language to Logic: A Theoretical Architecture for VLM-Grounded Safe Navigation

Guaranteeing safe robot navigation in unstructured environments just got easier: translate human language rules into formal logic, ground them with VLMs, and let the robot navigate.

Kristy Sakano, Kalonji Harrington, Mumu Xu

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Yuqi Li +103w ago

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Robot video world models can be significantly improved by distilling a multimodal reward function and stabilizing long-horizon inference, leading to better instruction following and manipulation accuracy.

Yuqi Li, Yuan Gao, Fan Xu +8

Multimodal Models Robotics & Embodied AI World Models & Planning

Zhiyuan Li +63w ago

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

Robots can now learn manipulation skills from human videos with greater morphological accuracy and temporal consistency, thanks to a new method that disentangles task and embodiment.

Zhiyuan Li, Wenyan Yang, Wenshuai Zhao +4

Computer Vision Multimodal Models Robotics & Embodied AI

Timon Homberger +43w ago

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

Achieve scalable open-vocabulary semantic maps of entire buildings by fusing both dense and instance-level semantic information in a novel dual-layer voxel representation.

Timon Homberger, F. Busch, Jes'us Gerardo Ortega Peimbert +2

Computer Vision Multimodal Models Robotics & Embodied AI

Chenhao Yu +53w ago·also CAU, PKU

BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

Unlock agile humanoid robots by ditching teleoperation and training directly from human VR demos.

Chenhao Yu, Hongwu Wang, Youhao Hu +3

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

University of Surrey3w ago

TACO: Trajectory Aligning Cross-view Optimisation

Ditch the GPS: This CVGL pipeline achieves a 5.9x improvement in localization accuracy over IMU-only by intelligently fusing satellite imagery with inertial measurements, needing only a single initial GPS fix.

Tavis Shore, Oscar Mendez, Simon Hadfield

Computer Vision Multimodal Models Robotics & Embodied AI

Jingjing Zhou +73w ago

Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection

Multimodal graph unlearning doesn't have to destroy utility: carefully protecting high-dimensional input projections during the unlearning process preserves performance while still enabling effective forgetting.

Jingjing Zhou, Yongshuai Yang, Qing Qing +5

Multimodal Models Training Efficiency & Optimization

Xun Jiang +73w ago

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Conformal prediction offers a surprisingly effective way to handle both modality imbalance and noisy corruption in multimodal learning by explicitly modeling predictive uncertainty during training.

Xun Jiang, Yufan Gu, Disen Hu +5

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Jing Gong3w ago

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.

Jing Gong

Multimodal Models Open-Source Models & Weights Speech & Audio

May 4, 2026

AI23w ago·also NUS, UW, JHU, UMich +1

MolmoAct2: Action Reasoning Models for Real-world Deployment

Open-sourcing a VLA model that beats closed-source giants on embodied reasoning tasks could finally make real-world robot deployment practical.

Haoquan Fang, Jiafei Duan, Donovan Clay +26

Multimodal Models Open-Source Models & Weights Robotics & Embodied AI

3w ago·also HKUST, PhotoFlow, SCU, SJTU +1

Perceptual Flow Network for Visually Grounded Reasoning

LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.

Yangfu Li, Yuning Gong, Hongjian Zhan +7

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

3w ago·also National Center for High-Performance, National Chung Cheng University

Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation

Achieve state-of-the-art object detection in multi-camera surveillance without compromising data privacy by fusing models trained on synthetically augmented and federated data.

Peggy Joy Lu, Wei-Yu Chen, Yao-Tsung Huang +1

Computer Vision Data Curation & Synthetic Data Multimodal Models

ETH3w ago·also UZH

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Pehuén Moure, Niclas Pokel, Bilal Bounajma +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yian Zhao +63w ago

Video Generation with Predictive Latents

Encoding temporal prediction into video VAEs unlocks faster training, better generative performance, and improved downstream task performance, all at once.

Yian Zhao, Feng Wang, Qiushan Guo +4

Computer Vision Multimodal Models World Models & Planning