March 18 – March 25, 2026

Multimodal Models - Weekly Roundup

100 papers published across 7 labs.

1% acceleration

Selected Labs publishing this week

Tsinghua AI6 NVIDIA3 AI22 CMU ML1 NUS1

Top Papers

Mar 19, 2026

Zening Sun +51w ago

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.

Zening Sun, Zhengpeng Xie, Lichen Bai +3

Computer Vision Multimodal Models RLHF & Preference Learning

Mar 25, 2026

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

AI21w ago

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.

Qi He, Xunmei Liu, Hammaad Memon +6

Computer Vision Multimodal Models

Mar 24, 2026

Royden Wagner +201w ago

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.

Royden Wagner, O. Tas, Jaime Villa +18

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Wenyue Chen +81w ago

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.

Wenyue Chen, Wenjue Chen, Peng Li +6

Computer Vision Multimodal Models

All Papers (100)

Mar 25, 2026

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

AI21w ago

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.

Qi He, Xunmei Liu, Hammaad Memon +6

Computer Vision Multimodal Models

Mar 24, 2026

Royden Wagner +201w ago

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

Training data is not enough: reasoning traces from diverse cultural backgrounds are critical for safe and reliable autonomous driving in rare, long-tail scenarios.

Royden Wagner, O. Tas, Jaime Villa +18

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Wenyue Chen +81w ago

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

Forget random back-view hallucinations – Know3D lets you *prompt* the unseen side of 3D models using language, opening the door to controllable 3D asset creation.

Wenyue Chen, Wenjue Chen, Peng Li +6

Computer Vision Multimodal Models

Mar 19, 2026

Hyun-kyu Ko +41w ago

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Forget generating plausible-but-fake details: 3DreamBooth bakes a robust 3D prior into video generation models using only a single-frame optimization, enabling truly view-consistent customized subject videos.

Hyun-kyu Ko, Jihyeon Park, Younghyun Kim +2

Computer Vision Multimodal Models World Models & Planning

Shaked Perek +41w ago

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Skip reinforcement learning and still get SOTA vision-language reasoning performance with a simple loss re-weighting scheme that cuts training time by 7x.

Shaked Perek, Ben Wiesel, Avihu Dekel +2

Multimodal Models Reasoning & Chain-of-Thought Training Efficiency & Optimization

Bin Cao +51w ago

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

A million-sequence, high-quality, open-source motion dataset finally lets text-to-motion models generalize beyond toy benchmarks.

Bin Cao, Sipeng Zheng, Hao Luo +3

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Vsevolod Skorokhodov +41w ago·also Schindler

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Fine-tuning a visual geometry transformer with SEAR unlocks surprisingly accurate RGB-Thermal 3D reconstruction, even surpassing SOTA methods despite training on significantly less multimodal data.

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Corresponding authors1w ago

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Closed-loop feedback using VLMs can dramatically improve text-to-image generation quality, even without additional training.

Ping Chen, Daoxuan Zhang, Xiangming Wang +4

Computer Vision Multimodal Models Tool Use & Agents

1w ago

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang +15

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Youngwan Lee +51w ago

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Current VLMs struggle with multi-hop spatial reasoning, often failing to compose even simple spatial relations across multiple steps, highlighting a critical gap for real-world VLA agent deployment.

Youngwan Lee, Soojin Jang, Yoorhim Cho +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

1w ago

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Strategic visual aids are the secret weapon for geometric reasoning, and this work shows how to teach MLLMs to wield them effectively via reinforcement learning.

Haokun Zhao, Wanshi Xu, Haidong Yuan +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

L3S Research Center Leibniz University1w ago·also IIT

Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Skip annotating image rationales: this method transfers text-based rationales to images for explainable crisis classification, saving annotation effort while boosting performance.

Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl

Interpretability & Mechanistic Interp Multimodal Models Natural Language Processing

Santiago Berrezueta-Guzman +21w ago

Beyond the Code: A Multi-Modal Assessment Strategy for Fostering Professional Competencies via Introductory Programming Projects

Ditch the syntax-only grind: a multi-modal assessment strategy proves that introductory programming courses can boost both coding skills and crucial soft skills like communication and critical thinking.

Santiago Berrezueta-Guzman, Vanesa Metaj, Stefan Wagner

Code Generation & Program Synthesis Multimodal Models Tool Use & Agents

Amandine Brunetto +11w ago

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Synthesizing realistic room acoustics from a single recording is now possible, thanks to a novel flow-matching approach that captures the uncertainty inherent in acoustic environments.

Amandine Brunetto, Amandine Brunetto

Multimodal Models Speech & Audio

Phuc Pham +41w ago

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Forget waiting a minute for garment generation: SwiftTailor slashes inference times while boosting accuracy by representing 3D garments as geometry images.

Phuc Pham, Uy Dieu Tran, Binh-Son Hua +2

Computer Vision Multimodal Models

Weijia Dou +51w ago·also Project leader

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.

Weijia Dou, Wenzhao Zheng, Weiliang Chen +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Xiangyu Bai +31w ago

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Forget brute-force scaling: intelligently selecting just 1% of video frames can actually *improve* video QA accuracy and cut compute by 93%.

Xiangyu Bai, Bishoy Galoaa, Bishoy M. Galoaa +1

Computer Vision Multimodal Models Training Efficiency & Optimization

1w ago

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Fine-tuning LVLMs on counting alone boosts general visual reasoning by over 1.5%, revealing counting as a surprisingly central skill.

Michelle Hurst

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

Department of Mechanical Engineering1w ago·also Stanford HAI

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

VLAs aren't just memorizing training data; sparse autoencoders reveal a hidden layer of generalizable motion primitives that can be steered to control robot behavior across tasks.

Aiden Swann, Aiden Swann, Lachlain McGranahan +7

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

1w ago

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Multimodal LLMs suffer a major performance hit when asked to switch from text-based to image-based tasks mid-conversation, revealing a surprising asymmetry in their ability to handle task interference.

Masayuki Kawarada, Masayuki Kawarada, Tatsuya Ishigaki +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

1w ago·also Tsinghua AI, HKU

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.

Xinyao Zhang, Xinyao Zhang, Wenkai Dong +19

Computer Vision Multimodal Models Natural Language Processing

Cong Wang +81w ago

PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Achieve more physically realistic video generation by explicitly modeling 3D geometry and physical attributes across multiple viewpoints.

Cong Wang, Hanxin Zhu, Xiao Tang +6

Computer Vision Multimodal Models World Models & Planning

Aditi Naiknaware +11w ago

T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

VLMs can now better detect when they're seeing something they shouldn't, even as the world changes around them, thanks to a new method that dynamically fuses visual and textual cues.

Aditi Naiknaware, S. Sekeh

Computer Vision Multimodal Models Natural Language Processing

Wan-Cyuan Fan +91w ago

Tinted Frames: Question Framing Blinds Vision-Language Models

VLMs selectively ignore visual information based on question framing, even when the visual reasoning task remains identical, highlighting a critical vulnerability in their grounding capabilities.

Wan-Cyuan Fan, Wan-Cyuan Fan, Jiayun Luo +7

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Multimodal Models

Tsinghua AI1w ago·also Guangdong Laboratory of AI and Digital Economy (SZ), Independent Researcher, PolyU, SYSU +1

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.

Yinghui Li, Jiayi Kuang, Peng Xing +11

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

INESC TEC1w ago·also U Porto

WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Get faithful and plausible natural language explanations for chest X-rays with as few as 5 human-annotated examples per diagnosis, and even boost classification accuracy in the process.

Isabel Rio-Torto, Jaime S. Cardoso, L. Teixeira +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models+1

NVIDIA1w ago·also Pitt

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.

Keda Tao, Keda Tao, Yuhua Zheng +25

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Swagat Padhan +61w ago

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

VLMs struggle with spatial reasoning, but a clever decomposition into sub-problems and probabilistic recombination unlocks significantly better metric-semantic grounding.

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah +4

Computer Vision Multimodal Models Robotics & Embodied AI

Quentin Guimard +51w ago

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Unlocking fairer vision-language models may be as simple as intervening in the sparse latent space of a sparse autoencoder, enabling targeted bias removal without harming performance.

Quentin Guimard, Federico Bartsch, Simone Caldarella +3

Computer Vision Constitutional AI & AI Ethics Multimodal Models

Weilin Chen +51w ago

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Forget generic textures – CustomTex lets you clone real-world object appearances onto your 3D scenes with uncanny fidelity.

Weilin Chen, Jiahao Rao, Wenhao Wang +3

Computer Vision Multimodal Models

Yikai Zheng +81w ago

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Proactive VideoLLMs can finally be both accurate AND efficient thanks to a novel propose-match framework that decouples semantic understanding from streaming perception.

Yikai Zheng, Xin Ding, Yifan Yang +6

Computer Vision Multimodal Models Natural Language Processing

1w ago

Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.

Yuchen Su, Shaoxin Zhong, Yonghua Zhu +7

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yuqiang Lin +151w ago

TAU-R1: Visual Language Model for Traffic Anomaly Understanding

A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.

Yuqiang Lin, Kehua Chen, Sam Lockyer +13

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Jiayi Yuan +31w ago

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Achieve state-of-the-art panoramic depth estimation without any training by cleverly exploiting the 3D consistency priors embedded within existing vision foundation models.

Jiayi Yuan, Haobo Jiang, De Wen Soh +1

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI1w ago·also OPPO, Shenzhen Institutes of Advanced

Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

AI can now handle the tedious copywriting and real-time Q&A for live-streaming commerce, freeing up human streamers to focus on engagement.

Ruizhi Yu, Keyang Zhong, Peng Liu +5

Multimodal Models Natural Language Processing Tool Use & Agents

Bryce Grant +31w ago

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Turns out, VLA models are mostly just looking at the scene: visual pathways dominate action generation, and language only matters when the visuals are ambiguous.

Bryce Grant, Bryce Grant, Xijia Zhao +1

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Teerapong Panboonyuen1w ago

Foundations and Architectures of Artificial Intelligence for Motor Insurance

Automating motor insurance from vehicle damage analysis to claims evaluation is now possible with a vertically integrated AI paradigm.

Teerapong Panboonyuen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Dong Zhuo +121w ago

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

DriveTok achieves unified multi-view reconstruction and understanding by learning scene tokens that integrate semantic, geometric, and textural information, outperforming existing 2D tokenizers in autonomous driving scenarios.

Dong Zhuo, Dong Zhuo, Wenzhao Zheng +10

Computer Vision Multimodal Models Robotics & Embodied AI

Shang-Jui Ray Kuo +31w ago

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

State Space Models can outperform Vision Transformers as vision encoders in VLMs, particularly when model size is a constraint.

Shang-Jui Ray Kuo, Shang-Jui Kuo, Paola Cascante-Bonilla +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

1w ago

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Diffusion models can now generate rare concepts and execute complex edits with greater fidelity, thanks to a training-free prompt blending technique that leverages statistical properties of the diffusion process itself.

Kwanyoung Lee, SeungJu Cha, Yebin Ahn +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

1w ago

ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Ditch the finetuning: this training-free method uses attention scores to generate rare concepts in images with more precision and control than LLM-guided approaches.

Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha +3

Computer Vision Multimodal Models Natural Language Processing

Zening Sun +51w ago

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.

Zening Sun, Zhengpeng Xie, Lichen Bai +3

Computer Vision Multimodal Models RLHF & Preference Learning

Hongjia Zhai +71w ago

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Real-time robotic perception just got a major upgrade: OnlinePG achieves open-vocabulary panoptic mapping with 3D Gaussian Splatting, enabling robots to understand and interact with environments in a way that was previously impossible.

Hongjia Zhai, Qi Zhang, Xiaokun Pan +5

Computer Vision Multimodal Models Robotics & Embodied AI

Munich Center for Machine Learning (MCML)1w ago·also TU Munich

Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Synthesized PET scans from MRI can nearly match the diagnostic accuracy of real PET for Alzheimer's, potentially unlocking wider access to crucial functional insights.

Yitong Li, Igor Yakushev, D. Hedderich +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Bishoy M. Galoaa +41w ago

Motion-o: Trajectory-Grounded Video Reasoning

Visual language models can now explicitly reason about object trajectories in videos, thanks to a simple yet effective method that augments training data and uses discrete motion tags.

Bishoy M. Galoaa, Bishoy Galoaa, Shayda Moezzi +2

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Yuchen Li +41w ago

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

LVLMs can gain a surprising amount of spatial reasoning ability by explicitly generating segmentation and depth tokens before answering questions.

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Nobuo Yoshii +91w ago

Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Radiometric disentanglement from a single image becomes tractable by exploiting the shared illumination constraint across multiple objects, enabling stochastic sampling of reflectance, texture, and illumination.

Nobuo Yoshii, Nobuo Yoshii, Xinran Nicole Han +7

Computer Vision Multimodal Models

1w ago·also Beihang

Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.

Ye Wang, Wei Lu, Zhi-Hui You +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

1w ago

Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Ditch the mask decoder: a single segmentation token can unlock competitive image segmentation directly from MLLMs.

Anqi Zhang, Xiaokang Ji, X. Ji +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

1w ago·also UPM Saudi Arabia

GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Reconstructing realistic hand-object interactions from video just got an order of magnitude faster, thanks to a novel Gaussian Splatting approach that ensures physical consistency.

Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini +5

Computer Vision Multimodal Models Robotics & Embodied AI

Oliver Cory +21w ago

SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Automating linguistically-grounded sign language annotation is now possible, unlocking scalable dataset curation previously limited by manual effort.

Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

Yan Shu +81w ago

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Pixel-perfect geospatial reasoning is now possible, thanks to a vision-language model that adaptively fuses multi-modal and multi-temporal Earth observation data.

Yan Shu, Bin Ren, B. Ren +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Hui Yang +91w ago

Generalized Hand-Object Pose Estimation with Occlusion Awareness

Overcoming occlusion in hand-object pose estimation just got easier: GenHOI leverages hierarchical semantic knowledge and hand priors to achieve state-of-the-art results on challenging benchmarks.

Hui Yang, Wei Sun, Jian Liu +7

Computer Vision Multimodal Models Robotics & Embodied AI

D. Ben-Ami +31w ago·also Ben-Gurion University of the Negev

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.

D. Ben-Ami, Gabriele Serussi, Kobi Cohen +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Fuze Sun +41w ago

Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture

Humanoid robots can now generate more empathetic and instruction-aware gestures thanks to a new diffusion framework conditioned on affective estimation and pedagogical reasoning.

Fuze Sun, Lingyu Li, Lekan Dai +2

Multimodal Models Robotics & Embodied AI Speech & Audio

Andrew Choi +31w ago

Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

Forget painstakingly designing simulation environments: generative 3D world models let you RL-fine-tune robot VLAs with massive scene diversity, boosting real-world transfer by 3x.

Andrew Choi, Xinjie Wang, Zhizhong Su +1

Multimodal Models Robotics & Embodied AI World Models & Planning

Yueying Zou +71w ago

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.

Yueying Zou, Peiming Li, Pei Pei Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Bingqi Ma +71w ago

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Achieve state-of-the-art joint audio-video generation with fewer resources by fixing key flaws in cross-modal context handling within dual-stream transformers.

Bingqi Ma, Linlong Lang, Ming Zhang +5

Computer Vision Multimodal Models Speech & Audio

Peihang Wu +31w ago·also Shenzhen University of Advanced

Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Token compression and multi-agent systems are enabling more efficient and interpretable multimodal reasoning in computational pathology, paving the way for trustworthy AI-assisted diagnosis.

Peihang Wu, Zehong Chen, Lijia Xu +1

Computer Vision Inference & Quantization Multimodal Models

Yuxiang Lu +131w ago·also Corresponding author

FASTER: Rethinking Real-Time Flow VLAs

Flow-based VLAs can react to environmental changes ten times faster by adaptively prioritizing near-term actions during sampling, unlocking unprecedented real-time responsiveness.

Yuxiang Lu, Yuxiang Lu, Zhe Liu +11

Inference & Quantization Multimodal Models Robotics & Embodied AI

Yuqing Wang +191w ago·also HKU

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

High-dimensional discrete tokens, previously out of reach for generative models, can now be directly generated, unlocking a unified token prediction paradigm for multimodal architectures.

Yuqing Wang, Yuqing Wang, Chuofan Ma +17

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Tianjiao Yu +121w ago

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Text-to-3D generation gets a semantic upgrade: DreamPartGen creates 3D objects with parts that not only look right but also understand their relationships and align with textual descriptions.

Tianjiao Yu, Tianjiao Yu, Xinzhuo Li +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Carlos Hinojosa +31w ago

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

VLMs' safety judgments are easily manipulated by simple semantic cues, revealing a reliance on superficial associations rather than true visual understanding.

Carlos Hinojosa, Clemens Grange, Bernard Ghanem +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Zhuofan Li +91w ago

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Seemingly efficient VLA models can be surprisingly inefficient when deployed on robots, highlighting the need to move beyond standard metrics like FLOPs and parameters.

Zhuofan Li, Zhuo Li, Hongkun Yang +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Aravind Krishnan +31w ago

On Optimizing Multimodal Jailbreaks for Spoken Language Models

SLMs are shockingly vulnerable: combining adversarial audio and text unlocks 1.5x to 10x higher jailbreak rates than attacking either modality alone.

Aravind Krishnan, Karolina Stańczak, Karolina Sta'nczak +1

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

Mohamed Badi +21w ago

Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus

Multi-modal federated learning can be made communication-efficient and robust to outliers by learning a shared latent space, even with heterogeneous client architectures.

Mohamed Badi, Chaouki Ben Issaid, Mehdi Bennis

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Tianci Luo +81w ago

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Spatial awareness is the secret ingredient to unlocking better visual in-context learning, boosting performance across diverse vision tasks.

Tianci Luo, Jinpeng Wang, Shi-Yu Qin +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Xianjin Wu +121w ago

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.

Xianjin Wu, Xian Wu, Dingkang Liang +10

Computer Vision Multimodal Models World Models & Planning

Tsinghua AI1w ago

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.

Huaide Jiang, Huai-Zhou Jiang, Yash Chaudhary +11

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Jiatong Xia +41w ago

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Unlock geometry-precise 3D generation by directly conditioning diffusion models on readily available point cloud priors, outperforming existing image- or text-conditioned methods.

Jiatong Xia, Zicheng Duan, A. Hengel +2

Computer Vision Multimodal Models Robotics & Embodied AI

†Corresponding author1w ago

MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation

Hierarchical memory, inspired by human cognition, beats standard approaches in robotic manipulation tasks requiring both precise tracking and long-term retention.

Liufan Tan, Jiale Li, Gang Jing

Multimodal Models Robotics & Embodied AI World Models & Planning

NVIDIA1w ago·also CAS, PKU, Zhongguancun Academy

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Robots can now manipulate objects with greater dexterity and adaptability thanks to a new world model that leverages both vision and high-frequency tactile feedback to predict and react to contact dynamics.

Yuhang Zheng, Songen Gu, Weize Li +13

Multimodal Models Robotics & Embodied AI World Models & Planning

Corresponding Author1w ago

CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Medical vision-language models are surprisingly brittle: clinically plausible image manipulations, like those introduced during routine acquisition and delivery, can drastically degrade their performance.

Xiang Chen, Xiang Chen, Fan Yang +7

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Yongqiang Zhao +61w ago

ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing

Tactile sensing closes the sim2real gap for deformable object tracing, enabling a single imitation learning model to achieve impressive generalization across diverse objects.

Yongqiang Zhao, Haining Luo, Yupeng Wang +4

Computer Vision Multimodal Models Robotics & Embodied AI

Joerg Deigmoeller +111w ago

MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction

Even the most advanced VLMs like GPT-4o, GPT-5 and Gemini 2.5 Flash are outperformed in multi-actor human-robot interaction grounding by a system that selectively invokes VLMs based on a lightweight perception pipeline.

Joerg Deigmoeller, Nakul Agarwal, Stephan Hasler +9

Computer Vision Multimodal Models Robotics & Embodied AI

Mar 18, 2026

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

2w ago

AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.

Cai Xu, Changhao Sun, Ziyu Guan

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

NVIDIA2w ago·also HUST

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Tianxing Zhou, Fei Xue, Feiyang Xue +4

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI2w ago·also PKU

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.

Yuelin Zhang, Sijie Cheng, Zongzhao Li +2

Multimodal Models Robotics & Embodied AI World Models & Planning

2w ago·also B LLM consistently underperforming

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.

Dongqiang Gou, Xuming He

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

Zirui Li +92w ago·also KU, Sofia University "St. Kliment Ohridski"

Video Understanding: From Geometry and Semantics to Unified Models

The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.

Zirui Li, Mingqiao Ye, Feng Qiao +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Ruixiao Shi +32w ago

A Creative Agent is Worth a 64-Token Template

Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.

Ruixiao Shi, Fu Feng, Yucheng Xie +1

Computer Vision Multimodal Models Tool Use & Agents

2w ago

Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.

Haiyu Yang, Sumit Sharma, Enhong Liu +1

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago·also University of California

DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.

Sarra Harrabi, Yichen Wu, Geoffrey H. Tison +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Kehan Chen +72w ago

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.

Kehan Chen, Yan Huang, Dong An +5

Computer Vision Multimodal Models Robotics & Embodied AI

Xiamen University2w ago·also ECNU, Hanjiang National Laboratory, NJU, Tongren University

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.

Wenbin Tan, Jiawen Lin, Fangyong Wang +4

Computer Vision Multimodal Models Natural Language Processing

2w ago

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Naive fine-tuning of VLMs for multimodal sequential recommendation causes catastrophic modality collapse, but can be fixed with gradient rebalancing and cross-modal regularization.

Junyoung Kim, Woojoo Kim, Jaehyung Lim +2

Multimodal Models Recommendation & Information Retrieval

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Byron Dowling +32w ago

VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.

Byron Dowling, Eleanor Frederick, Jacob Piland +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

AI22w ago

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.

Jianrui Zhang, Winson Han, Ranjay Krishna +3

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Vlad-Constantin Lungu-Stan +22w ago

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

Computer Vision Multimodal Models Natural Language Processing

2w ago

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.

Jing Huang, Jingzhi Huang, Junkai Huang +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

2w ago·also Northwestern

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.

Chi Zhang, Xiang Feng, Qiming Zhang +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Yuxiang Mei +42w ago

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.

Yuxiang Mei, Delai Qiu, Shengping Liu +2

Multimodal Models Speech & Audio Training Efficiency & Optimization

Boyong Wu +12w ago·also Munich Center for Machine Learning

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.

Boyong Wu, Zeynep Akata

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Tsinghua AI2w ago·also DAMO

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.

Songtao Jiang, Sibo Song, Chenyi Zhou +11

Computer Vision Data Curation & Synthetic Data Multimodal Models

Qijie Wei +22w ago

EI: Early Intervention for Multimodal Imaging based Disease Recognition

Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.

Qijie Wei, Hailan Lin, Xirong Li

Computer Vision Multimodal Models Scientific Discovery & Drug Design