March 4 – March 11, 2026

Multimodal Models - Weekly Roundup

100 papers published across 9 labs.

1% acceleration

Selected Labs publishing this week

Tsinghua AI6 Google Research2 DAMO2 UW1 DeepMind1

Top Papers

Mar 11, 2026

3w ago

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.

Qingtao Pan, Zhihao Dou, Shuo Li

Computer Vision Multimodal Models Training Efficiency & Optimization

3w ago

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.

Wenxuan Ma, Chaofan Zhang, Yinghao Cai +3

Computer Vision Multimodal Models Robotics & Embodied AI

Yuquan Li +33w ago

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.

Yuquan Li, Lianjie Ma, Han Ding +1

Inference & Quantization Multimodal Models Robotics & Embodied AI

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

3w ago·also Google Research

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.

Yan-Bo Lin, Jonah Casebeer, Long Mai +3

Computer Vision Multimodal Models Speech & Audio

All Papers (100)

Mar 11, 2026

3w ago

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.

Qingtao Pan, Zhihao Dou, Shuo Li

Computer Vision Multimodal Models Training Efficiency & Optimization

3w ago

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.

Wenxuan Ma, Chaofan Zhang, Yinghao Cai +3

Computer Vision Multimodal Models Robotics & Embodied AI

Yuquan Li +33w ago

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Achieve up to 1.28x faster VLA model inference for robotic manipulation without retraining, simply by merging visual tokens based on depth.

Yuquan Li, Lianjie Ma, Han Ding +1

Inference & Quantization Multimodal Models Robotics & Embodied AI

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

3w ago·also Google Research

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Forget paired video-music training data: V2M-Zero aligns video and music by matching the *timing* of changes within each modality, not the content itself.

Yan-Bo Lin, Jonah Casebeer, Long Mai +3

Computer Vision Multimodal Models Speech & Audio

Jiarui Yang3w ago

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

VLA-controlled robots can now detect anomalies in under 100ms using a plug-and-play module, enabling real-time recovery from unexpected situations.

Jiarui Yang

Computer Vision Multimodal Models Robotics & Embodied AI

Minsak Nanang +23w ago

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints

Automating museum video metadata curation is now possible with a locally deployable video language model, unlocking previously inaccessible audiovisual archives.

Minsak Nanang, Adrian Hilton, Armin Mustafa

Computer Vision Data Curation & Synthetic Data Multimodal Models

Kejin Yu +53w ago

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.

Kejin Yu, Yuhan Sun, Taiqiang Wu +3

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Yuanbo Hou +53w ago

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago·also Department of Radiation Oncology, Institute for AI and Data Science, Wayne State University

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

LVLMs can now provide depth-aware pedestrian navigation guidance by grounding language reasoning and segmentation, without needing user-provided cues or anchor points.

R. Sultan, Hui Zhu, Xiangyu Zhou +4

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Explicitly aligning audio and video streams in a multimodal Transformer boosts emotion recognition, showing that ignoring frame-rate differences hurts performance.

Inyong Koo, Yeeun Seong, Minseok Son +2

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Nolan Chan +43w ago

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.

Nolan Chan, Timmy Gang, Yongqian Wang +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Fanqi Yu +43w ago

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Forget catastrophic forgetting: this imitation learning framework remembers up to 65% more while improving AUC by 10-17 points on the LIBERO benchmark.

Fanqi Yu, Matteo Tiezzi, Tommaso Apicella +2

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Penghua Ren +73w ago

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation

Achieve robust humanoid task execution in complex environments by turning high-level language instructions into verifiable, geometrically-grounded task programs that can recover from failures.

Penghua Ren, Haoyang Ge, Chuan Qi +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago

Speech Codec Probing from Semantic and Phonetic Perspectives

Speech tokenizers, despite being crucial for multimodal LLMs, primarily capture phonetic information, creating a semantic mismatch with text-derived semantics that hinders performance.

Xuan Shi, Chang Zeng, Tiantian Feng +3

Multimodal Models Natural Language Processing Speech & Audio

UW3w ago

COMIC: Agentic Sketch Comedy Generation

AI can now (almost) write and direct Saturday Night Live.

Computer Vision Multimodal Models Tool Use & Agents

3w ago·also BGI Research

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.

Daxiang Dong, Mingming Zheng, Dong Xu +17

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also PKU, ZJU

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Ditch discrete visual tokens: UniCom achieves SOTA multimodal generation by compressing continuous semantic representations, unlocking better controllability and consistency in image editing.

Yaqi Zhao, Wang Lin, Miles Yang +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

3w ago

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Achieve 2.5x higher success in UAV navigation by decoupling target generation from progress monitoring, enabling safer and more efficient zero-shot flight.

Guiyong Zheng, Y. Ban, Mingjie Zhang +2

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI3w ago·also CAS, SUCCESS Lab, ZJU

GLM-OCR Technical Report

A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.

Shuaiqi Duan, Ya-Qi Xue, Weihan Wang +18

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

DeepMind3w ago

Taking Shortcuts for Categorical VQA Using Super Neurons

Forget fine-tuning: surprisingly, single neuron activations in VLMs can be directly probed to create classifiers that outperform the full model, with 5x speedups.

Pierre Musacchio, Jaeyi Jeong, Dahun Kim

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Sunpill Kim +33w ago

Na\"ive Exposure of Generative AI Capabilities Undermines Deepfake Detection

Generative AI's ability to reason about and refine images based on authenticity criteria inadvertently creates a powerful evasion strategy that renders current deepfake detectors ineffective.

Sunpill Kim, Chan-Hue Hwang, Minsu Kim +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

S. Song +33w ago

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

A training-free visual distillation method boosts VLA model performance in cluttered environments by over 34%, proving that targeted noise reduction is more effective than brute-force scaling.

S. Song, S. Kodagoda, Marc Carmichael +1

Computer Vision Multimodal Models Robotics & Embodied AI

Google Research3w ago·also Columbia, UMich

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.

Tianyu Xu, Sieun Kim, Qianhuizhi Zheng +6

Computer Vision Multimodal Models Speech & Audio

3w ago·also Shanghai AI Lab

FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

By decoupling visual and motor information during pretraining, FutureVLA unlocks more effective visuomotor prediction for vision-language-action models, boosting performance without modifying downstream architectures.

Xiaoxu Xu, Ji-lu Ye, Yilun Chen +6

Multimodal Models Robotics & Embodied AI World Models & Planning

Mondo Robotics3w ago·also B trainable parameters, Soyeon Caren Han is the corresponding

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

By jointly modeling video dynamics and actions, DiT4DiT achieves 10x sample efficiency and 7x faster convergence in robot policy learning, showing that video generation can be a powerful scaling proxy.

Teli Ma, Jiayu Zheng, Zifan Wang +4

Multimodal Models Robotics & Embodied AI World Models & Planning

Tsinghua AI3w ago·also DAMO, NanKai University, NJU, Scale +1

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.

Tongkun Guan, Zhibo Yang, Jianqiang Wan +13

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

3w ago

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Vision-language models can significantly enhance language models through knowledge distillation, even without direct textual understanding, challenging conventional KD paradigms.

Ayan Sengupta, Shantanu Dixit, Md. Shad Akhtar +1

Inference & Quantization Multimodal Models Training Efficiency & Optimization

3w ago

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.

Yuhong Dai, Yanlin Lai, Mitt Huang +9

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 10, 2026

3w ago

$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

Autonomous vehicles can now better "see" the world even when cameras fail, thanks to a new method that fills in the blanks by leveraging spatial overlaps and learned semantic priors.

Kaixin Lin, Di Wen, Yufan Chen +1

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model

Skip expensive manual annotation: this method extracts accurate 3D UAV trajectories and classifications directly from readily available internet videos.

Haoxiang Lei, Daotong Wang, Shenghai Yuan +1

Computer Vision Multimodal Models Natural Language Processing

3w ago

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.

Jiazhi Guan, Quanwei Yang, Luying Huang +5

Computer Vision Multimodal Models World Models & Planning

Stanford HAI3w ago

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Nankai University3w ago·also COWARobot Co. Ltd, Melbourne, NKIARI, TU Munich

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

By converting point clouds into a format VLMs can understand, VLM-Loc significantly boosts text-to-point-cloud localization accuracy, outperforming prior methods that rely on shallower text-point cloud correspondences.

Shuhao Kang, Youqi Liao, Peijie Wang +4

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.

Yuchen Yang, Yuqing Shao, Duxiu Huang +14

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also BUPT, CUHK, Fudan, NJU +3

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

A 4B-parameter model, InternVL-U, outperforms 14B-parameter models in multimodal generation and editing, proving that size isn't everything.

Changyao Tian, Danni Yang, Guanzhou Chen +27

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

Tengjin Weng +33w ago

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.

Tengjin Weng, Jingyi Wang, Lin Ma +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation

Forget manual labeling: STONE offers a massive, automatically-labeled dataset for off-road robot navigation, unlocking scalable training for robust 3D traversability prediction.

Konyul Park, Daehun Kim, Jiyong Oh +6

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

ImpedanceDiffusion: Diffusion-Based Global Path Planning for UAV Swarm Navigation with Generative Impedance Control

Ditch the map: a diffusion model learns to plan UAV swarm trajectories directly from RGB images, enabling reactive and adaptive navigation in cluttered environments.

Faryal Batool, Yasheerah Yaqoot, Muhammad Ahsan Mustafa +3

Multimodal Models Robotics & Embodied AI World Models & Planning

3w ago·also Shanghai AI Lab

DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation

Human-in-the-loop learning can now boost dexterous manipulation VLA models by 25%, thanks to a new framework that smartly samples corrective actions and enables real-time intervention.

Yifan Han, Zhongxi Chen, Yuxuan Zhao +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

Explicitly teaching LVLMs to reason step-by-step with reinforcement learning unlocks state-of-the-art performance on multimodal object-entity relation extraction.

Xiang Yuan, Xu Chu, Xinrong Chen +6Code

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Shilei Wang +43w ago

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

Achieve SOTA multi-modal object tracking by adaptively fusing modalities with a Mixture of Experts and decoupling temporal propagation with separate State Space Models.

Shilei Wang, Pujian Lai, Dong Gao +2

Computer Vision Multimodal Models

Shuang Liu +63w ago

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

By explicitly bridging the gap between on-body appearances and flat layouts, BridgeDiff achieves state-of-the-art virtual try-off results, generating more realistic and structurally sound flat-garment representations.

Shuang Liu, Ao Yu, Linkang Cheng +4

Computer Vision Multimodal Models

3w ago

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Unlock real-time semantic SLAM and multimodal interaction with 3D Gaussian Splatting using X-GS, a unified and extensible open framework.

Yueen Ma, Irwin King

Computer Vision Multimodal Models Robotics & Embodied AI

Enming Zhang +23w ago

Evolving Prompt Adaptation for Vision-Language Models

Steer clear of catastrophic forgetting in VLMs with EvoPrompt, a new method that evolves prompts by preserving learned semantic directions while adapting their magnitude.

Enming Zhang, Jiayang Li, Zhenyu Liu

Computer Vision Multimodal Models Training Efficiency & Optimization

Yunnan Normal University3w ago·also CAS

ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts

Large models are emerging as a promising new paradigm for translating complex-layout document images, as shown by the ICDAR 2025 DIMT competition.

Yaping Zhang, Yupu Liang, Lu Xiang +2

Computer Vision Multimodal Models Natural Language Processing

360 AI Security Lab3w ago·also Beihang, College of Science

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

LVLMs can be jailbroken by "Reasoning-Oriented Programming," which chains together harmless visual inputs to trigger harmful reasoning, much like return-oriented programming in traditional security exploits.

Quanchen Zou, Moyang Chen, Zonghao Ying +5

Multimodal Models Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Tran Bao Sam +53w ago

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

By explicitly modeling how abnormalities relate within and across different medical image views, GIIM achieves significantly higher diagnostic accuracy and robustness, even with incomplete data.

Tran Bao Sam, Hung Vu, Dao Trung Kien +3

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Huawen Shen +33w ago

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Skip the expensive proxy model training: this training-free method boosts VLLM performance by up to 4.8% using only 10-15% of the data, simply by measuring how much the question *changes* the model's view of the answer.

Huawen Shen, Yi Ban, Tianfan Fu +1

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Yun-Shao Tsai +73w ago

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.

Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai +5

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Soumya Dutta3w ago

Acoustic and Semantic Modeling of Emotion in Spoken Language

Controllable emotion style transfer in speech is now possible without needing paired data, opening new avenues for data augmentation and expressive AI.

Soumya Dutta

Multimodal Models Natural Language Processing Speech & Audio

Soroush Seifi +33w ago

Ego: Embedding-Guided Personalization of Vision-Language Models

Forget retraining: Ego personalizes VLMs on the fly by extracting and leveraging visual tokens that represent specific concepts using the model's internal attention.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas +1

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

3w ago

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

A 4B-parameter model outperforms Gemini-3-Pro in autonomous driving by incorporating physics-informed constraints and style-aware training, suggesting specialized models can surpass larger, general-purpose models in domain-specific tasks.

Yuan Gao, Dengyuan Hua, Mattia Piccinini +5

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.

Ronghao Fu, Haoran Liu, Weijie Zhang +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Forget blurry sketch-to-image outputs: this method uses component-aware self-attention and coordinate-preserving fusion to generate photorealistic images with unprecedented fidelity and spatial accuracy.

Ali Zia, Muhammad Umer Ramzan, Usman Ali +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago

MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction

Finally, a GelSight-style sensor that doesn't force you to choose between pre-contact vision and high-fidelity tactile sensing.

Zhixian Hu, Zhengtong Xu, Sheeraz Athar

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

TopoOR: A Unified Topological Scene Representation for the Operating Room

Ditch the flat scene graphs: TopoOR models surgical environments as higher-order topological structures, unlocking superior performance in safety-critical tasks by preserving complex relationships and multimodal data.

Tony Danjun Wang, Ka Young Kim, Tolga Birdal +2

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

Precisely steer text-to-image generation along cognitive dimensions like valence and memorability with CogBlender, a framework that lets you dial in psychological intent.

Shengqi Dang, Jiaying Lei, Ziqing Qian +1

Computer Vision Multimodal Models

Tsinghua AI3w ago·also Artificial Intelligence Thrust, Beijing National Research Center for Infor-, CAS

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.

Yinrui Ren, Jinjing Zhu, Zhuoxiao Li +7

Computer Vision Inference & Quantization Multimodal Models

MIT CSAIL3w ago·also D visual features, TJU

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.

William Shen, Nishanth Kumar, Sahit Chintalapudi +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago·also SYSU

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Unlock the power of web videos for embodied AI: implicit geometry representations let agents learn to navigate from real-world room tours without relying on fragile 3D reconstruction.

Mingfei Han, Haihong Hao, Liang Ma +5

Computer Vision Multimodal Models Robotics & Embodied AI

Md Selim Sarowar +23w ago

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

By representing visual inputs as 3D Gaussian primitives, GST-VLA unlocks a new level of geometric understanding for vision-language-action models, leading to substantial performance gains in robotic manipulation tasks.

Md Selim Sarowar, Omer Tariq, Sungho Kim

Computer Vision Multimodal Models Robotics & Embodied AI

Kirak Kim +13w ago

Finetuning a Text-to-Audio Model for Room Impulse Response Generation

Unlock realistic acoustic simulations with a text prompt: fine-tuning a text-to-audio model generates plausible room impulse responses, even with limited paired data.

Kirak Kim, Sungyoung Kim

Multimodal Models Natural Language Processing Speech & Audio

3w ago

From Verification to Amplification: Auditing Reverse Image Search as Algorithmic Gatekeeping in Visual Misinformation Fact-checking

Reverse image search, a key tool for visual fact-checking, often amplifies misinformation and irrelevant content, burying debunking information.

Cong Lin, Yifei Chen, Jiangyue Chen +3

Computer Vision Multimodal Models Recommendation & Information Retrieval

Fredrik K. Gustafsson +43w ago

SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

Domain-specific biosignal foundation models, fused with multimodal ECG and PPG data, substantially outperform general time-series models on clinically relevant tasks, but bigger isn't always better.

Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti +2

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

3w ago

A Text-Native Interface for Generative Video Authoring

Imagine writing a script and instantly seeing it come to life – Doki makes generative video authoring as intuitive as writing a text document.

Xingyu Bruce Liu, Mira Dontcheva, Dingzeyu Li

Computer Vision Multimodal Models Natural Language Processing

Nguyen Anh Tuong +63w ago

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

A new large-scale dataset could jumpstart Vietnamese VQA research by providing a crucial resource for training and evaluating multimodal models in a low-resource language.

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc +4

Data Curation & Synthetic Data Multimodal Models Natural Language Processing

3w ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

MLLMs still struggle to reliably predict the long-term consequences of actions in egocentric videos, even with structured scene annotations.

Chengjun Yu, Xuhan Zhu, Chaoqun Du +4

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

3w ago·also CUHK

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

VLMs can now self-evolve from *zero* data, thanks to a multi-agent RL framework that synthesizes its own visual concepts and reasoning tasks.

Zongxia Li, Hongyang Du, Chengsong Huang +10

Data Curation & Synthetic Data Multimodal Models Tool Use & Agents

Minchi Ruan +63w ago

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

A robot can now achieve 90% success in peg-in-hole tasks, even with only 0.1mm clearance, by intelligently fusing vision and tactile feedback when visual occlusion occurs.

Minchi Ruan, LiangQing Zhou, Hongtong Li +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

3w ago

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Kanishkha Jaisankar +43w ago

Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning

Combining pre-trained and custom neural networks with data augmentation and transfer learning yields a robust autonomous driving system capable of accurately perceiving and reacting to its environment.

Kanishkha Jaisankar, P. Pawar, Diana Susane Joseph +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apple ML3w ago

ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Finally, a single model that can generate both your face and voice, convincingly controlled by text prompts and reference clips.

Aviad Dahan, Moran Yanuka, Noa Kraicer +2

Computer Vision Multimodal Models Speech & Audio

Yanan Li3w ago

Robust Provably Secure Image Steganography via Latent Iterative Optimization

Provably secure steganography can now withstand real-world image compression and processing thanks to a clever latent-space optimization technique.

Yanan Li

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

NUS3w ago

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Medical multi-agent systems can reason deeply, but fall apart when switching between medical specialties, highlighting a critical need for more robust architectures.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Jiarun Song +13w ago

Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

LLMs can drive pedagogical agents to be more engaging and effective by dynamically generating speech and gestures that align with the semantic context of instructional content.

Jiarun Song, FuZheng Yang

Multimodal Models Natural Language Processing Tool Use & Agents

3w ago·also Shenzhen University

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Panoramic vision-language models can achieve a level of holistic scene understanding and robustness in adverse conditions that's impossible for traditional pinhole-based VLMs.

Weijia Fan, Ruiping Liu, Jiale Wei +6

Computer Vision Multimodal Models Natural Language Processing

3w ago·also CUHK, MBZUAI

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Robots can now recover from failures during manipulation tasks by explicitly tracking progress against spatial subgoals, without needing extra training data or models.

Tingjun Dai, Mingfei Han, Tingwen Du +4

Multimodal Models Robotics & Embodied AI World Models & Planning

Zhaofeng Shi +43w ago

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

Adapt your action anticipation model on-the-fly to new viewpoints (egocentric or exocentric) with a novel test-time adaptation method that leverages multi-label prototype growing and dual-clue consistency.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang +2

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Ditch global embeddings for text-motion retrieval: this method uses joint-angle motion images and token-patch late interaction to achieve state-of-the-art accuracy and interpretability.

Yao Zhang, Zhuchenyang Liu, Yanlan He +2

Computer Vision Multimodal Models Recommendation & Information Retrieval

KunHo Heo +43w ago

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

Generate more realistic and nuanced human movements from text by explicitly modeling individual body parts, overcoming the limitations of existing holistic approaches.

KunHo Heo, SuYeon Kim, Yonghyun Gwon +2

Multimodal Models Natural Language Processing Robotics & Embodied AI

Won Shik Jang +13w ago

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Skip the costly policy training: this zero-shot method nails text-goal instance navigation by grounding language in 3D geometry for smarter exploration and verification.

Won Shik Jang, Ue-Hwan Kim

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Current AI models fall short when asked to understand a situation from the combined perspectives of multiple embodied agents, as revealed by a new challenging benchmark.

Kangsan Kim, Yanlai Yang, Suji Kim +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Xiaotian Hu +113w ago

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents leapfrogs existing fetal ultrasound analysis tools by dynamically orchestrating specialized AI agents, outperforming monolithic models across diverse clinical tasks and delivering structured clinical reports from video streams.

Xiaotian Hu, Junwei Huang, Mingxuan Liu +9

Computer Vision Multimodal Models Tool Use & Agents

VNU University of Engineering and Technology3w ago

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

Multimodal models that seem robust can still fail when some modalities are systematically missing, a problem MissBench exposes with new metrics for modality equity and learning balance.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le +1

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

3w ago

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

By fusing confidence-weighted point cloud projections with a Kalman-inspired update mechanism, ConfCtrl enables diffusion models to generate geometrically consistent novel views from sparse inputs, even under significant viewpoint shifts.

Liudi Yang, George Eskandar, Fengyi Shen +3

Computer Vision Multimodal Models

3w ago·also Fudan

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

By translating visual observations into language, LAP achieves state-of-the-art procedure planning by disambiguating visually similar actions, outperforming vision-only methods.

Lei Shi, Victor Aregbede, Andreas Persson +3

Multimodal Models Natural Language Processing World Models & Planning

Xiaomi EV3w ago

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

By injecting symbolic reasoning into vision-language-action models, NS-VLA achieves remarkable gains in data efficiency and generalization for robotic manipulation.

Ziyue Zhu, Shangyang Wu, Shuai Zhao +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Tsinghua AI3w ago·also Google Research, CAS

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.

Wenzhao Xiang, Yue Wu, Hongyang Yu +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Lu Yue +63w ago

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

A single spatial token, learned via occupancy prediction on a massive dataset, is surprisingly effective at injecting crucial spatial awareness into vision-language navigation, leading to state-of-the-art performance.

Lu Yue, Jiazhao Zhang, Qisheng Zhao +4

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also TeleAI

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

By having a single VLM critique its own SVG renderings, IntroSVG learns to generate more complex, semantically aligned, and editable vector graphics from text prompts.

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao +3

Code Generation & Program Synthesis Computer Vision Multimodal Models

Tsinghua AI3w ago·also Melbourne

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

Forget training separate models for different field-of-views in geo-localization — SinGeo achieves SOTA robustness with a single model, even outperforming specialized architectures.

Xieyuanli Chen, Junxiang Li, Tao Wu

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

Stop letting sparse rewards bottleneck your VLN agent: SACA disentangles failed trajectories into valid prefixes and divergence points for dense supervision, unlocking SOTA performance.

Haoyuan Li, Rui Liu, Yi Yang

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Minyeol Bae +13w ago

Training-Free Coverless Multi-Image Steganography with Access Control

Unlock scalable, privacy-sensitive image steganography with MIDAS, a training-free diffusion framework that grants user-specific access control to hidden multi-image content.

Minyeol Bae, Si-Hyeon Lee

Computer Vision Multimodal Models

3w ago

Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

Even with 80% of brain scan data missing, ACADiff can accurately generate the missing modalities and maintain robust diagnostic performance for Alzheimer's disease.

Rong Zhou, Houliang Zhou, Yao Su +3

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Tsinghua AI3w ago

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.

Jinyue Li, Yuci Liang, Qiankun Li +5

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

DAMO3w ago

Logics-Parsing-Omni Technical Report

Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.

Computer Vision Multimodal Models Speech & Audio

3w ago

World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models

Text-only foundation models can perform surprisingly well on complex 3D spatial reasoning tasks, rivaling multimodal models, when equipped with a structured spatial representation derived from 3D reconstruction.

Shouwei Ruan, Qihui Zhu, Yuxiang Zhang +1

Multimodal Models Reasoning & Chain-of-Thought World Models & Planning

Liding Zhang +73w ago

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ditch slow, iterative ODE solvers for robot control: this method distills flow-based policies into a single-step model that's fast enough for real-time replanning without sacrificing multi-modal action diversity.

Liding Zhang, Yu Fu, Kaixin Bai +5

Inference & Quantization Multimodal Models Robotics & Embodied AI