April 20 – April 27, 2026

Multimodal Models - Weekly Roundup

100 papers published across 9 labs.

279% acceleration

Selected Labs publishing this week

Tsinghua AI5 UW2 CMU ML2 NVIDIA1 Microsoft Research1

Top Papers

Apr 27, 2026

Apr 27, 2026·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Hao Wang +6Apr 27, 2026

X2SAM: Any Segmentation in Images and Videos

Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.

Hao Wang, Limeng Qiao, Chi Zhang +4

Computer Vision Multimodal Models

Shiyi Zhang +10Apr 27, 2026

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Google ResearchApr 27, 2026·also LinkedIn Corporation

Co-Director: Agentic Generative Video Storytelling

Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.

Yale Song, Yale Song, Yiwen Song +27

Computer Vision Multimodal Models Tool Use & Agents

All Papers (100)

Apr 27, 2026

NVIDIAApr 27, 2026·also Amazon Science, Microsoft Research, UW, Music X Lab +1

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Nvidia Amala Sanjay Deshmukh, K. Chumachenko, Tuomas Rintamaki +209

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Apr 27, 2026·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

Hao Wang +6Apr 27, 2026

X2SAM: Any Segmentation in Images and Videos

Finally, a single model that handles any segmentation task in both images and videos, understanding both text and visual prompts.

Hao Wang, Limeng Qiao, Chi Zhang +4

Computer Vision Multimodal Models

Shiyi Zhang +10Apr 27, 2026

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.

Shiyi Zhang, Yiji Cheng, Tiankai Hang +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Google ResearchApr 27, 2026·also LinkedIn Corporation

Co-Director: Agentic Generative Video Storytelling

Forget handcrafted prompts: a hierarchical multi-agent framework turns diffusion models into coherent storytelling engines by globally optimizing for semantic coherence.

Yale Song, Yale Song, Yiwen Song +27

Computer Vision Multimodal Models Tool Use & Agents

Apr 27, 2026

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.

Yiming Zhang, Jiacheng Chen, Jiaqi Tan +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zhiheng Liu +14Apr 27, 2026

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.

Zhiheng Liu, Weiming Ren, Xiaoke Huang +12

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

G. Channing +6Apr 27, 2026

Contrastive Image-Metadata Pre-Training for Materials Transmission Electron Microscopy

Unlock the secrets hidden in your lab's backed-up microscopy data: style transfer networks can now "re-imagine" images as if they were captured with different instrument settings.

G. Channing, D. Keller, M. Rossell +4

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Jun Li +8Apr 27, 2026·also Tsinghua AI

Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

Frozen vision-language models can dramatically improve abnormality grounding in rare disease imaging by iteratively refining decisions through optimized instructions and visual perturbations.

Jun Li, Mingxuan Liu, Che Liu +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Yifei Wei +7Apr 27, 2026·also Quantstamp

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Decomposing robotic manipulation into coarse and fine-grained actions isn't just conceptually cleaner—it actually unlocks a sweet spot where learning difficulty is balanced, boosting performance.

Yifei Wei, Linqing Zhong, Yi Liu +5

Computer Vision Multimodal Models Robotics & Embodied AI

Soyeon Kim +5Apr 27, 2026

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.

Soyeon Kim, Cheon-kyu Kang, Myeongjin Lee +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Zhihan Zhang +3Apr 27, 2026·also SMU

Aligned Multi-View Scripts for Universal Chart-to-Code Generation

Training on semantically equivalent chart renderings in Python, R, and LaTeX unlocks surprisingly effective multi-lingual chart-to-code generation from a single model.

Zhihan Zhang, Zhihan Zhang, Lizi Liao +1

Code Generation & Program Synthesis Data Curation & Synthetic Data Multimodal Models

Mohamad Zamini +1Apr 27, 2026

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Achieve SOTA zero-shot segmentation by simply fusing two CLIP branches, one focusing on local token reliability and the other on structural priors, all without training.

Mohamad Zamini, Diksha Shukla

Computer Vision Multimodal Models

Apr 27, 2026

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Agentic AI struggles with Earth Observation because reprojection, resampling, and other geospatial operations silently corrupt data, demanding a new agent design paradigm.

Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir +5

Computer Vision Multimodal Models Tool Use & Agents

Maitreya Patel +4Apr 27, 2026

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.

Maitreya Patel, Jingtao Li, Weiming Zhuang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 27, 2026·also UofT

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

Text-guided 3D medical image segmentation just got a whole lot more practical: ESICA achieves state-of-the-art accuracy with a "Lite" variant that slashes parameter count without sacrificing performance.

Yuelin Xin, Gorkem Can Ates, Jun Ma +4

Computer Vision Multimodal Models Natural Language Processing

Nikesh Subedi +2Apr 27, 2026

Interactive Episodic Memory with User Feedback

Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.

Nikesh Subedi, Loris Bazzani, Ziad Al-Halah

Computer Vision Multimodal Models Natural Language Processing

Weijie Wang +9Apr 27, 2026·also Microsoft Research

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Text-to-video models can now learn geometrically consistent world dynamics via reinforcement learning, without expensive architectural changes.

Weijie Wang, Youping Gu, Zeyu Zhang +7

Computer Vision Multimodal Models World Models & Planning

Guangdong University of TechnologyApr 27, 2026·also PKU, SYSU

Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

Test-time adaptation of vision-language models can actually *hurt* performance when modalities shift asymmetrically; MG-MTTA fixes this by explicitly modeling modality reliability.

Lixian Chen, Mingxuan Huang, Yan-Hong Chen +2

Computer Vision Multimodal Models Natural Language Processing

Haoxiao Wang +10Apr 27, 2026·also ZJU

Diffusion Model as a Generalist Segmentation Learner

Turns out, your image-generating diffusion model already knows how to segment anything you ask it to.

Haoxiao Wang, Antao Xiang, Haiyang Sun +8

Computer Vision Multimodal Models

Hamed Rahimi +4Apr 27, 2026

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Robots can now understand human intentions with near-human accuracy thanks to a new video-language model that reasons about goals like a human.

Hamed Rahimi, Clémence Grislain, Adrien Jacquet Cretides +2

Computer Vision Multimodal Models Robotics & Embodied AI

Yifan Xie +5Apr 27, 2026

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Robots can now leverage human intuition for manipulation tasks, learning from a massive video dataset to improve motion plausibility and robustness, even when conditions change.

Yifan Xie, Yuan Wang, Guangyu Chen +3

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Kai Yang +8Apr 27, 2026

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

Network jitter in cloud-based robot control can be overcome by converting temporal lag into spatial pose offsets, restoring the VLA's original geometric intent without fine-tuning.

Kai Yang, Zedong Chu, Yingnan Guo +6

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Zihao Zheng +9Apr 27, 2026

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

Frequency domain analysis unlocks 1.59x speedups in Vision-Language-Navigation by enabling optimal token caching, a feat previously limited by visual domain approaches.

Zihao Zheng, Xingyu Zhou, Z. Mao +7

Inference & Quantization Multimodal Models Robotics & Embodied AI

Kaijun Zhou +5Apr 27, 2026

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Edge NPUs can outperform flagship GPUs in cost and energy efficiency for on-robot VLA model deployment, but only with hardware-aware optimizations that tackle the models' distinct compute and memory-bound phases.

Kaijun Zhou, Qiwei Chen, Dajiang Peng +3

Inference & Quantization Multimodal Models Robotics & Embodied AI

Siyao Xiao +11Apr 27, 2026·also Pinterest

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

Forget end-to-end fine-tuning: $M^2$-VLA unlocks the power of generalized VLMs for robotic manipulation by intelligently mixing layers and incorporating meta-skills.

Siyao Xiao, Yuhong Zhang, Zhifang Liu +9

Computer Vision Multimodal Models Robotics & Embodied AI

Esteban Rodr'iguez-Betancourt +1Apr 27, 2026

Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval

Self-supervised vision models that ace linear probing can still flop at semantic image retrieval because of skewed latent space geometry that breaks approximate nearest neighbor search.

Esteban Rodr'iguez-Betancourt, Edgar Casasola-Murillo

Computer Vision Multimodal Models Recommendation & Information Retrieval

MIT CSAILApr 27, 2026·also AI for Responsible, Beth Israel Deaconess Medical Center, Bordeaux Population Health Research Center, Clinical Research Center +8

Quantum Kernel Advantage over Classical Collapse in Medical Foundation Model Embeddings

Quantum kernels unlock signal in medical image embeddings where classical methods fail, suggesting a new path for extracting value from medical foundation models.

Sebastian Cajas Ordóñez, Sebastian Cajas Ord'onez, Felipe Ocampo Osorio +16

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Apr 27, 2026·also Macquarie, PKU, UNSW

MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

Semantic grounding, not token probability, is the key to better multimodal RAG.

Xihang Wang, Chengkai Huang, Quan Z. Sheng +2

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

Apr 27, 2026·also CAS, SUSTech, United Nova Technology

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Forget slow, multi-step action generation: CF-VLA's coarse-to-fine approach slashes latency by 75% while boosting real-robot success rates to a new high of 83%.

Fan Du, Feng Yan, Jianxiong Wu +6

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Jiawei Wang +10Apr 27, 2026

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Species identification and discovery, traditionally treated as separate problems, can be unified into a single framework that leverages retrieval-augmented reasoning for improved accuracy and interpretability.

Jiawei Wang, Min Lei, Yaning Yang +8

Multimodal Models Recommendation & Information Retrieval Scientific Discovery & Drug Design

Hai Wang +3Apr 27, 2026

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.

Hai Wang, Xiaocheng Yang, Mingzhi Dong +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Weixing Wang +7Apr 27, 2026

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.

Weixing Wang, Liudvikas Zekas, Anton Hackl +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Cheng-Han Lee +5Apr 27, 2026·also Meituan, Northeastern

Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

A new large-scale dataset of human-annotated video crops enables training models that adapt videos to different aspect ratios while preserving visual quality and meaning.

Cheng-Han Lee, Maniratnam Mandal, N. Birkbeck +3

Computer Vision Multimodal Models

Apr 27, 2026·also New Laboratory of Pattern Recognition

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

You don't need billions of parameters to accurately ground GUI elements: GoClick, a 230M parameter model, matches the performance of much larger models, opening the door for on-device GUI agents.

Hongxin Li, Hongxin Li, Yuntao Chen +3

Computer Vision Multimodal Models Tool Use & Agents

Apr 27, 2026·also ZJU

Improving Vision-language Models with Perception-centric Process Reward Models

VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.

Yingqian Min, Kun Zhou, Yifan Li +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 27, 2026

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Apr 27, 2026·also New Laboratory of Pattern Recognition, PolyU

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Existing GUI agents can parrot actions, but AutoGUI-v2 reveals they still lack a deep understanding of GUI functionality and struggle to predict the outcomes of even simple interactions.

Hongxin Li, Hongxin Li, Xiping Wang +10

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Apr 26, 2026

Apr 26, 2026·also Cornell, Technion

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

Achieve surgical 3D edits without training: Prox-E lets you reshape objects with language by manipulating a compact set of geometric primitives.

Etai Sella, Hao Phung, Nitay Amiel +3

Computer Vision Multimodal Models Natural Language Processing

Zhen Ye +10Apr 26, 2026

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Disentangling high-level cross-modal reasoning from low-level modality-specific refinement in talking head generation yields superior lip-sync accuracy, video quality, and audio quality compared to entangled approaches.

Zhen Ye, Xu Tan, Aoxiong Yin +8

Computer Vision Multimodal Models Speech & Audio

UWApr 26, 2026

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

LLM agents struggle to maintain performance in multi-day collaborative tasks, dropping significantly after just one environmental update, revealing a critical gap in adaptation to evolving real-world conditions.

Fanqing Meng, Lingxiao Du, Zijian Wu +42

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Qi Li +7Apr 26, 2026

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

VLA models introduce a fundamentally new risk landscape compared to LLMs or robotics alone, demanding a unified safety perspective that considers irreversible physical consequences and multimodal attack surfaces.

Qi Li, Bo Yin, Weiqi Huang +5

Multimodal Models Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Apr 25, 2026

Yida Xue +7Apr 25, 2026·also ZJU

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Unlock the secrets of the deep: OceanPile, a massive, meticulously curated multimodal dataset, finally brings the power of foundation models to the vast and underexplored ocean.

Yida Xue, Ningyu Zhang, Tingwei Wu +5

Data Curation & Synthetic Data Multimodal Models Scientific Discovery & Drug Design

Tsinghua AIApr 25, 2026·also Cambridge

AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval

Finding similar analog circuits across netlists, schematics, and descriptions just got way easier: a new model achieves 75% recall, unlocking better circuit design automation.

Yihan Wang, Lei Li, Yao Lai +2

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval

Apr 23, 2026

AI4BharatApr 23, 2026·also IIT Madras

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

VLM evaluators, despite their growing use, can miss over 50% of targeted errors in generated images and text, especially when those errors involve fine-grained details or spatial relationships.

Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand +1

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Yao Zhang +3Apr 23, 2026

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Transforming human motion into structured language allows LLMs to achieve unprecedented accuracy in motion understanding without the constraints of traditional encoding methods.

Yao Zhang, Zhu Liu, T. Ploetz +1

Multimodal Models Natural Language Processing Robotics & Embodied AI

Xiaojie Xu +8Apr 23, 2026·also Shanda AI Research Tokyo, UTokyo

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Stop guessing which interactive video model is best: WorldMark offers the first apples-to-apples comparison across leading models on identical scenes and trajectories.

Xiaojie Xu, Zhengyuan Lin, Zhe Lin +6

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Ceyuan Yang +19Apr 23, 2026

Context Unrolling in Omni Models

Training a single model across text, images, video, 3D geometry, and hidden representations unlocks "Context Unrolling," where the model reasons across modalities to improve reasoning fidelity.

Ceyuan Yang, Zhijie Lin, Yang Zhao +17

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Apr 23, 2026

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

LVLMs are often tripped up not by faulty vision, but by over-trusting the textual prompt, leading to surprisingly easy-to-fix hallucinations.

Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny +3

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Wenxuan Bao +2Apr 23, 2026

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Ramen achieves robust test-time adaptation of VLMs in mixed-domain scenarios by selecting the right samples to adapt to, sidestepping the common pitfall of performance degradation when faced with diverse and inconsistent test data.

Wenxuan Bao, Yanjun Zhao, Xiyuan Yang

Computer Vision Multimodal Models

Eghbal A. Hosseini +3Apr 23, 2026

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Stimuli that vision models agree on most strongly drive alignment with language models, doubling cross-modal convergence.

Eghbal A. Hosseini, Brian Cheung, E. Fedorenko +1

Computer Vision Multimodal Models

Apr 23, 2026

Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

LLMs struggle to answer human-generated questions about multi-chart images, highlighting a critical gap in their ability to reason about real-world data visualizations.

Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Apr 23, 2026·also JD.com, Tencent AI

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

Learnable critics that evaluate the model's own GUI grounding proposals, rather than relying on static geometric heuristics, unlock substantial gains in accuracy.

Wenkai Wang, Xiyun Li, Hongcan Guo +5

Computer Vision Multimodal Models Tool Use & Agents

Apr 23, 2026

Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness

Ignoring why clinical data is missing can lead to suboptimal treatment policies; this work shows how explicitly modeling informative missingness in multimodal time series data significantly improves both offline treatment policy learning and outcome prediction.

Zihan Liang, Ziwen Pan, Ruoxuan Xiong

Multimodal Models Natural Language Processing

CMU MLApr 23, 2026·also Datadog

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Even GPT-5 only achieves 63% accuracy on time series anomaly questions from real software incidents, but a model-expert combination reaches 87%, highlighting the potential for hybrid intelligence in incident response.

Stephan Xie, Ben Cohen, Mononito Goswami +6

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

College of Information ScienceApr 23, 2026·also University of Nebraska Omaha

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

LLMs can extract events more effectively when combined with graph-based document representations that overcome their "lost-in-the-middle" limitations.

Praval Sharma

Multimodal Models Natural Language Processing

Yuehan Zhu +4Apr 23, 2026

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Forget rigid workflows: HiCrew's planning layer dynamically orchestrates agents for video understanding, adapting roles and execution paths to the nuances of each question.

Yuehan Zhu, Jingqi Zhao, Jiawen Zhao +2

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Andrew ShinApr 23, 2026

AI-Gram: When Visual Agents Interact in a Social Network

LLM-driven visual agents form complex communication structures, but stubbornly resist stylistic convergence, revealing a fundamental tension between social expression and individual identity.

Andrew Shin

Computer Vision Multimodal Models Tool Use & Agents

B. Lim +3Apr 23, 2026

VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.

B. Lim, Kyeonghyun Kim, Jung-Shin Yun +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Mohit Vaishnav +1Apr 23, 2026

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.

Mohit Vaishnav, T. Tammet

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Jin Guo +2Apr 23, 2026

Can MLLMs"Read"What is Missing?

MLLMs struggle to "read" missing text directly from visual context, even when they possess the necessary visual grounding and layout understanding.

Jin Guo, Xi Fang, Chaozheng Huang

Eval Frameworks & Benchmarks Multimodal Models

Tasnim Kabir +5Apr 23, 2026

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Breno Matos +4Apr 23, 2026

Misinformation Span Detection in Videos via Audio Transcripts

Pinpointing exactly *when* misinformation occurs in videos is now possible, thanks to two new datasets and a strong baseline for misinformation span detection.

Breno Matos, Rennan C. Lima, Savvas Zannettou +2

Multimodal Models Natural Language Processing Speech & Audio

Hao-Yu Hsu +4Apr 23, 2026·also UIUC

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Imagine reconstructing detailed human motion and scene layouts using just your smartwatch and earbuds – no cameras needed.

Hao-Yu Hsu, Tianhang Cheng, Jing Wen +2

Computer Vision Multimodal Models Robotics & Embodied AI

Katharina Prasse +6Apr 23, 2026

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

VLMs can reliably reveal population-level trends in climate change discourse on social media, even when per-image accuracy is only moderate.

Katharina Prasse, Steffen Jung, Isaac Bravo +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AIApr 23, 2026

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.

Chentao Li, Zirui Gao, Mingze Gao +3

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Apr 23, 2026

Grounding Video Reasoning in Physical Signals

Current video Q&A benchmarks can be fooled by textual regularities, failing to actually ground reasoning in the video's physical reality.

Alibay Osmanli, Zixu Cheng, Shaogang Gong

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zixu Li +5Apr 23, 2026

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Multi-modification image retrieval is now possible: TEMA handles complex, real-world instructions that go beyond simple changes, outperforming existing methods on new datasets M-FashionIQ and M-CIRR.

Zixu Li, Yupeng Hu, Zhiheng Fu +3

Computer Vision Multimodal Models Recommendation & Information Retrieval

Apr 23, 2026·also Tsinghua AI, Westlake

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.

Zeyu Cai, Yuliang Xiu, Renke Wang +8

Computer Vision Multimodal Models Robotics & Embodied AI

Chang-Fu Wang +6Apr 23, 2026

CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction

H&E slides can now predict spatial gene expression with significantly improved accuracy and robustness, even when faced with unseen slide variations, thanks to a novel post-hoc calibration technique.

Chang-Fu Wang, Xinran Wang, Donghai Liu +4

Computer Vision Multimodal Models Scientific Discovery & Drug Design

S. Pintea +1Apr 23, 2026

Deep kernel video approximation for unsupervised action segmentation

Forget optimal transport – MMD with Neural Tangent Kernels offers a faster, easier-to-optimize path to unsupervised video action segmentation with competitive accuracy.

S. Pintea, J. Dijkstra

Computer Vision Multimodal Models

Qingxiao Li +6Apr 23, 2026

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.

Qingxiao Li, Lifeng Xu, Qinglin Wang +4

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

Chanhong Hwang +4Apr 23, 2026

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Spatial reasoning gets a boost: a new framework dynamically orchestrates vision-language agents at test time, outperforming fixed-pipeline approaches by adapting to the reliability of different spatial cues.

Chanhong Hwang, Miso Choi, S. On +2

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Zhaohong Huang +4Apr 23, 2026

Prototype-Based Test-Time Adaptation of Vision-Language Models

Ditch the cache: Prototype-Based Test-Time Adaptation (PTA) boosts vision-language model accuracy by nearly 4% while *doubling* inference speed compared to existing cache-based methods.

Zhaohong Huang, Yuxin Zhang, Wenjing Liu +2

Computer Vision Inference & Quantization Multimodal Models

Southern Illinois University CarbondaleApr 23, 2026

FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment

By adversarially removing camera-specific fingerprints, FryNet forces models to learn genuine chemical representations from thermal images, enabling robust and generalizable frying oil oxidation assessment.

Khaled R. Ahmed, Toqi Tahamid Sarker, Taminul Islam +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Wenmin Huang +3Apr 23, 2026

AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing

Achieve more precise facial attribute editing by decoupling attribute manipulation from image synthesis, sidestepping the optimization challenges of directly combining GANs and diffusion models.

Wenmin Huang, Weiqi Luo, Xiaochun Cao +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 23, 2026·also JD.com

KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

Forget boring ads: this new method uses creative knowledge to generate videos that actually match product features and move realistically.

Linkai Liu, Wei Feng, Xi Zhao +9

Computer Vision Multimodal Models Natural Language Processing

Zhiyong Li +5Apr 23, 2026

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

Unsupervised video-based person re-identification is now possible without hard pseudo-label assignments, thanks to a hierarchical temporal prototyping approach that significantly outperforms existing methods.

Zhiyong Li, Wei Jiang, Haojie Liu +3

Computer Vision Multimodal Models

Dhruv Parikh +3Apr 23, 2026

Latent Denoising Improves Visual Alignment in Large Multimodal Models

LMMs can gain surprising robustness and visual understanding by learning to denoise corrupted visual tokens, even without extra inference overhead.

Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan +1

Computer Vision Multimodal Models Training Efficiency & Optimization

Apr 23, 2026·also Tsinghua AI, Sheffield

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.

Jingkun Chen, Ru Xu, Mingqi Gao +2

Computer Vision Multimodal Models Robotics & Embodied AI

Wenmin Huang +3Apr 23, 2026

LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

Achieve state-of-the-art facial attribute editing and style manipulation with a diffusion model by ditching semantic directions for style codes and a clever forward-backward consistency training strategy that avoids paired images.

Wenmin Huang, Weiqi Luo, Xiaochun Cao +1

Computer Vision Multimodal Models

I. Liu +9Apr 23, 2026

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

Forget brittle visual-history buffers: LoHo-Manip uses a VLM task manager with visual trace prompts to achieve robust long-horizon robotic manipulation through implicit closed-loop replanning.

I. Liu, An-Chieh Cheng, Rui Yan +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

CMU MLApr 23, 2026·also NTU, UB

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Real-world robots can now navigate complex environments with human-level instructions, thanks to a new system that combines efficient perception with high-level reasoning, all while running in real-time on limited hardware.

Kuan Xu, Ruimeng Liu, Yizhuo Yang +5

Computer Vision Multimodal Models Robotics & Embodied AI

Amir Rasouli +6Apr 23, 2026

How VLAs (Really) Work In Open-World Environments

Current VLA benchmarks may be overstating real-world readiness, as models succeeding by standard metrics often exhibit unsafe behaviors and poor robustness.

Amir Rasouli, Yangzheng Wu, Zhiyuan Li +4

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Byounggun Park +1Apr 23, 2026

Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning

Fine-tuning VLMs with action-aligned language supervision and terrain-aware preference optimization unlocks more robust off-road autonomous driving, outperforming prior approaches on key traversability metrics.

Byounggun Park, Soonmin Hwang

Multimodal Models Robotics & Embodied AI World Models & Planning

Dachong Li +3Apr 23, 2026·also shenzhen university

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

Explicitly constraining action generation with predicted spatial "corridors" boosts VLA model performance by up to 12.4% on challenging robotic manipulation tasks.

Dachong Li, Zhuangzhuang Chen, Jin Zhang +1

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Yanjun Zhao +8Apr 23, 2026·also Univeristy of Illinois Urbana Champaign

PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

Current multimodal LLMs still struggle to integrate information and reason critically when assessed on real scientific papers, despite progress on isolated tasks.

Yanjun Zhao, Tianxin Wei, Jiaru Zou +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

L. V. D. Heuvel +2Apr 23, 2026

Neurodiversity and Technostress: Towards a Multimodal Research Design for Evaluating Subjective, Physiological, and Behavioral Responses

Current technostress research overlooks neurodiversity, but this multimodal design could reveal hidden vulnerabilities and inform more inclusive digital work environments.

L. V. D. Heuvel, Igor Ivki'c, René Riedl

Multimodal Models Natural Language Processing

Yiming Zhong +7Apr 23, 2026

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

By spectrally decoupling robot control into intent and dynamics, ResVLA offers a more efficient and robust approach to generative VLA policies.

Yiming Zhong, Yaoyu He, Zemin Yang +5

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Apr 23, 2026

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

Early fusion UMR models lean too heavily on text, while late fusion struggles to relate semantically similar content – MiMIC offers a fix.

Juanxi Li, Chuanghao Ding, Xujie Zhang +1

Computer Vision Multimodal Models Recommendation & Information Retrieval

Yuanchen Fei +5Apr 23, 2026

Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

Synthetic data can significantly boost controllable human video generation, but only if you carefully select which synthetic samples to use.

Yuanchen Fei, Yude Zou, Zejian Kang +3

Computer Vision Data Curation & Synthetic Data Multimodal Models

AI2Apr 23, 2026

Seeing Fast and Slow: Learning the Flow of Time in Videos

Time is a learnable visual concept: models can now reason about and manipulate the flow of time in videos, opening doors to temporally controllable video generation and temporal forensics.

Yen-Siang Wu, Rundong Luo, Jingsen Zhu +6

Computer Vision Multimodal Models

Apr 23, 2026·also D consistency. Vista, D-grounded priors for the video diffusion model. 3.2 Training with noisy multiview data So far, Eyeline Labs

Vista4D: Video Reshooting with 4D Point Clouds

Reshooting video from arbitrary viewpoints just got a whole lot better thanks to a 4D point cloud representation that maintains temporal consistency and precise camera control.

Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca +9

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 23, 2026·also RUC

Probing Visual Planning in Image Editing Models

Image editing models can learn to solve visual planning puzzles with finetuning, but still lag far behind humans in zero-shot efficiency, revealing a key gap in neural visual reasoning.

Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao +3

Computer Vision Multimodal Models World Models & Planning

Apr 22, 2026

Apr 22, 2026·also Adobe Research

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Ditch the fixed trade-offs: ParetoSlider lets you smoothly navigate competing generative goals in diffusion models at inference time, without retraining.

Shelly Golan, Michael Finkelson, Ariel Bereslavsky +2

Computer Vision Multimodal Models RLHF & Preference Learning

Tsinghua AIApr 22, 2026·also Imperial, of CAD & CG, State Key Laboratory, ZJU

Exploring Spatial Intelligence from a Generative Perspective

Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.

Muzhi Zhu, Shunyao Jiang, Huan Zheng +11

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Humanoid Robot (Shanghai) Co.Apr 22, 2026·also HIT, Tongji, UMich

VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation

Vision-based tactile signals in the VTOUCH dataset significantly enhance bimanual manipulation capabilities, paving the way for more effective robotic interactions.

Qianxi Hua, Xinyue Li, Zheng Yan +3

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 22, 2026·also Aristotle University of Thessaloniki, Max Planck

LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

Ditch sparse contact cues: LEXIS-Flow uses a learned manifold of interaction signatures to capture dense, continuous proximity between humans and objects, leading to more realistic 3D HOI reconstructions.

Dimitrije Antić, Alvaro Budria, George Paschalidis +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 22, 2026·also Meituan

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Open-source MLLMs can now achieve state-of-the-art accuracy on complex tabular reasoning tasks, even outperforming models 18x their size, by explicitly penalizing visual hallucinations and shortcut guessing through process-supervised RL.

Yubo Jiang, Yitong An, Abudukelimu Wuerkaixi +3

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026·also Aarhus Univeristy, Beihang, JDT AI Infra

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Current MLLMs fail to detect covert advertisements, revealing a critical gap in social media moderation that could mislead consumers and pose ethical risks.

Jingyi Zheng, Tianyi Hu, Yule Liu +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Search

Multimodal Models - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)