March 11 – March 18, 2026

Multimodal Models - Weekly Roundup

100 papers published across 7 labs.

1% acceleration

Selected Labs publishing this week

Tsinghua AI3 NVIDIA1 AI21 DAMO1 Google Research1

Top Papers

Mar 18, 2026

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

2w ago

AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.

Cai Xu, Changhao Sun, Ziyu Guan

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

NVIDIA2w ago·also HUST

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Tianxing Zhou, Feiyang Xue, Fei Xue +4

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI2w ago·also PKU

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.

Yuelin Zhang, Sijie Cheng, Zongzhao Li +2

Multimodal Models Robotics & Embodied AI World Models & Planning

2w ago·also B LLM consistently underperforming

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.

Dongqiang Gou, Xuming He

Computer Vision Multimodal Models Robotics & Embodied AI

All Papers (100)

Mar 18, 2026

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

2w ago

AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

Cai Xu, Changhao Sun, Ziyu Guan

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

NVIDIA2w ago·also HUST

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Tianxing Zhou, Feiyang Xue, Fei Xue +4

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI2w ago·also PKU

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

By iteratively reasoning over video snippets with a Chain-of-Thought, $\text{R}^2$VLM achieves state-of-the-art long-horizon task progress estimation without needing to process entire videos at once.

Yuelin Zhang, Sijie Cheng, Zongzhao Li +2

Multimodal Models Robotics & Embodied AI World Models & Planning

2w ago·also B LLM consistently underperforming

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

Dongqiang Gou, Xuming He

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

Zirui Li +92w ago·also KU, Sofia University "St. Kliment Ohridski"

Video Understanding: From Geometry and Semantics to Unified Models

The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.

Zirui Li, Mingqiao Ye, Feng Qiao +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Ruixiao Shi +32w ago

A Creative Agent is Worth a 64-Token Template

Unleash creativity in text-to-image models with a single, reusable 64-token template, sidestepping costly iterative prompt engineering and reasoning.

Ruixiao Shi, Fu Feng, Yucheng Xie +1

Computer Vision Multimodal Models Tool Use & Agents

2w ago

Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Even with a 98:1 test-to-train ratio, PEFT methods like QLoRA can unlock surprisingly strong generalization from billion-parameter vision models for agricultural image classification, suggesting underfitting is the bigger risk than overfitting.

Haiyu Yang, Sumit Sharma, Enhong Liu +1

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago·also University of California

DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

This model beats clinical reports in quantitative coronary angiography, opening the door to automated, comprehensive assessment of coronary artery disease at the point of care.

Sarra Harrabi, Yichen Wu, Geoffrey H. Tison +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Kehan Chen +72w ago

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

Forget verbose instructions: this new VLN paradigm uses floor plans to guide navigation with concise commands, boosting success rates by 60%.

Kehan Chen, Yan Huang, Dong An +5

Computer Vision Multimodal Models Robotics & Embodied AI

Xiamen University2w ago·also ECNU, Hanjiang National Laboratory, NJU, Tongren University

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Existing 3D visual grounding methods crumble in complex scenes, but PC-CrossDiff's dual-level attention unlocks a +10% accuracy boost by parsing subtle spatial cues.

Wenbin Tan, Jiawen Lin, Fangyong Wang +4

Computer Vision Multimodal Models Natural Language Processing

2w ago

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Naive fine-tuning of VLMs for multimodal sequential recommendation causes catastrophic modality collapse, but can be fixed with gradient rebalancing and cross-modal regularization.

Junyoung Kim, Woojoo Kim, Jaehyung Lim +2

Multimodal Models Recommendation & Information Retrieval

2w ago·also Netease Yidun AI Lab ∗Equal contribution

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.

Jingxiao Yang, DaLin He, Miao Pan +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Byron Dowling +32w ago

VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.

Byron Dowling, Eleanor Frederick, Jacob Piland +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

AI22w ago

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Pruning vision tokens across both the ViT and LLM can yield a 62% efficiency boost in video VLMs with minimal performance loss, and without complex text conditioning.

Jianrui Zhang, Winson Han, Ranjay Krishna +3

Inference & Quantization Multimodal Models Training Efficiency & Optimization

Vlad-Constantin Lungu-Stan +22w ago

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Forget fixed layer counts: LaDe generates fully editable, layered media designs with a *flexible* number of semantically meaningful layers, outperforming existing methods in text-to-layer alignment.

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

Computer Vision Multimodal Models Natural Language Processing

2w ago

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Robots can now nimbly navigate complex, multi-floor environments without prior training, thanks to a new strategy that dynamically switches between exploration, recovery, and memory recall.

Jingzhi Huang, Jing Huang, Junkai Huang +4

Multimodal Models Robotics & Embodied AI Tool Use & Agents

2w ago·also Northwestern

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.

Chi Zhang, Xiang Feng, Qiming Zhang +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Yuxiang Mei +42w ago

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Stop struggling with the stability-plasticity dilemma in multilingual Speech-LLMs: Zipper-LoRA dynamically disentangles LoRA updates to boost low-resource ASR without sacrificing cross-lingual transfer.

Yuxiang Mei, Delai Qiu, Shengping Liu +2

Multimodal Models Speech & Audio Training Efficiency & Optimization

Boyong Wu +12w ago·also Munich Center for Machine Learning

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

MLLMs' image segmentation prowess isn't a given: a critical adapter layer actually *hurts* performance, with the LLM having to recover via attention-mediated refinement.

Boyong Wu, Zeynep Akata

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Tsinghua AI2w ago·also DAMO

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.

Songtao Jiang, Sibo Song, Chenyi Zhou +11

Computer Vision Data Curation & Synthetic Data Multimodal Models

Qijie Wei +22w ago

EI: Early Intervention for Multimodal Imaging based Disease Recognition

Injecting semantic information from related modalities early in the embedding process significantly boosts performance on multimodal medical image classification tasks.

Qijie Wei, Hailan Lin, Xirong Li

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago

UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Achieve state-of-the-art semantic 3D reconstruction from sparse views by intelligently pruning redundant Gaussians and blending 2D and 3D semantic cues.

Guibiao Liao, Kaimin Liao, Hua Wang +3

Computer Vision Multimodal Models Robotics & Embodied AI

Huan Song +72w ago

Ruyi2.5 Technical Report

Ruyi2.5 achieves comparable performance to Qwen3-VL on general multimodal benchmarks while significantly outperforming it in privacy-constrained surveillance, demonstrating the effectiveness of its edge-cloud architecture.

Huan Song, Shuyu Tian, Qingfei Zhao +5

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

University2w ago

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Simply translating symbolic sign language notations into natural language unlocks significantly better motion generation when conditioning on phonological attributes with CLIP.

Rui Hong, Jana Kosecka

Computer Vision Multimodal Models Natural Language Processing

2w ago·also Tencent AI

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Finally, a unified framework lets you control both facial appearance and voice timbre for personalized audio-video generation across multiple identities.

Yingjie Chen, Shilun Lin, Cai Xing +5

Computer Vision Multimodal Models Speech & Audio

2w ago

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Synthesizing realistic 6-DOF object manipulation trajectories in complex 3D environments just got a whole lot better with GMT, a multimodal transformer that substantially outperforms existing methods.

Huajian Zeng, Huajian Zeng, Abhishek Saroha +2

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

2w ago

TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

Unlock automated creation of production-ready 3D assets from untextured meshes with TAPESTRY, which generates geometrically consistent turntable videos that can be back-projected into UV textures or used to supervise neural rendering.

Yan Zeng, Haoran Jiang, Kaixin Yao +4

Computer Vision Multimodal Models

Xiangyu Kong +82w ago

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Interactive avatars can now exhibit more emotionally appropriate and contextually aware facial behaviors thanks to a novel architecture that disentangles audio-driven lip movements from user-driven non-lip facial expressions.

Xiangyu Kong, Xiaoyu Jin, Yihan Pan +6

Computer Vision Multimodal Models Speech & Audio

2w ago·also HKU

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

By disentangling semantic and contextual cues in vision-language models, PCA-Seg achieves state-of-the-art open-vocabulary segmentation with only 0.35M additional parameters per block.

Jianjian Yin, Tao Chen, Yi Chen +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Mohammad Robaitul Islam Bhuiyan +52w ago

LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Radiologist dictation, combined with foundation models and minimal parameter updates, can achieve state-of-the-art MRI brain tumor segmentation.

Mohammad Robaitul Islam Bhuiyan, Melika Qahqaie, Tri-Thien Nguyen +3

Computer Vision Multimodal Models Natural Language Processing

Fei Zhang +102w ago

TransText: Transparency Aware Image-to-Video Typography Animation

Achieve high-fidelity transparent text animations from image-to-video models without retraining the VAE, sidestepping data scarcity and latent pattern mixing issues.

Fei Zhang, Zijian Zhou, Bohao Tang +8

Computer Vision Multimodal Models

2w ago

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.

Yuxin Liu, Fei Wang, Yiqi Nie +3

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Yizheng Song +72w ago

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

Reconstructing realistic 3D human crowds from a single image is now possible, thanks to a new method that cleverly handles occlusions and appearance variations.

Yizheng Song, Yiyu Zhuang, Qipeng Xu +5

Computer Vision Multimodal Models

Kevin Qu +52w ago

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.

Kevin Qu, Haozhe Qi, Mihai Dusmanu +3

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

By cleverly using readily available video segmentation masks, this method boosts DINOv2's point tracking performance by over 14% – a surprisingly effective way to inject temporal awareness into static image-pretrained models.

Qiangqiang Wu, Tianyu Yang, Jia Wan +3

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

VLN agents can navigate more effectively by predicting their future states and proactively planning based on forecasted semantic map cues, rather than relying solely on historical context.

Tian Li, Tianfu Li, Wenbo Chen +4

Multimodal Models Robotics & Embodied AI World Models & Planning

2w ago

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Forget training wheels: GoalVLM lets multi-agent robots navigate to any object you describe, no pre-programmed categories needed.

MoniJesu James, M. James, Amir Atef Habel +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Munich Center for Machine Learning2w ago·also Google Research

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.

Rui Xiao, Sanghwan Kim, Yongqin Xian +2

Eval Frameworks & Benchmarks Multimodal Models

2w ago

Harnessing the Power of Foundation Models for Accurate Material Classification

Overcome scarce data and boost material classification accuracy by generating synthetic training data and distilling knowledge from vision-language foundation models.

Qingran Lin, Fengwei Yang, Chaolun Zhu

Computer Vision Data Curation & Synthetic Data Multimodal Models

Zechang Xiong +102w ago

Beyond Forced Modality Balance: Intrinsic Information Budgets for Multimodal Learning

Instead of forcing modalities to imitate each other, IIBalance lets each modality contribute according to its intrinsic information budget, leading to better multimodal fusion.

Zechang Xiong, Zechang Xiong, Da Li +8

Multimodal Models Training Efficiency & Optimization

2w ago·also Fondazione Bruno Kessler

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Even when visual data is missing or noisy, EgoAdapt accurately determines who is talking to the camera wearer by adaptively integrating head orientation, lip movement, and robust audio features.

Xinyuan Qian, Xinjia Zhu, A. Brutti +1

Computer Vision Multimodal Models Speech & Audio

Guandong Li +12w ago

Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Image editing models leak fascinating hints about their world knowledge through "edit spillover"—unintended changes to semantically related regions—and this paper turns that leakage into a probe.

Guandong Li, Zhaobin Chu

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Zhihua Wei +52w ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

VLMs don't fail to *recognize* harmful intent when jailbroken; instead, visual inputs *shift* their internal representations into a distinct "jailbreak state," opening a new avenue for defense.

Zhihua Wei, Jian Ruan, Zhenxin Qin +3

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Tsinghua AI2w ago·also CAS, Northwestern

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.

Chen Zhao, Zhuoran Wang, Shifeng Bao +5

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago·also Beihang, Beijing National Research Center for Information, HKU, Institute of Artificial Intelligence

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Quantizing large vision-language models just got a whole lot better: a new token-level sensitivity metric closes the accuracy gap with full-precision models by up to 1.6% in 3-bit weight-only quantization.

Ziwei Xiang, Fanhu Zeng, Hongjian Fang +6

Computer Vision Inference & Quantization Multimodal Models

Yaze Zhao +32w ago

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

CLIP struggles with fine-grained details in cross-domain few-shot learning, but a cycle-consistency method can fix its vision-language alignment and boost performance.

Yaze Zhao, Yixiong Zou, Yuhua Li +1

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Tae Eun Choi +32w ago

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.

Tae Eun Choi, Sumin Shim, Junhyeok Kim +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Segyu Lee +102w ago

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Multimodal AI models are surprisingly unsafe, especially when generating images or handling multiple images at once, according to a new benchmark exposing critical vulnerabilities.

Segyu Lee, Boryeong Cho, Hojung Jung +8

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Chupeng Liu +42w ago

VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

By probabilistically fusing visual context into text prompts, VirPro closes the semantic gap in weakly-supervised 3D detection, boosting performance by nearly 5% on KITTI.

Chupeng Liu, Jiyong Rao, Shangquan Sun +2

Computer Vision Multimodal Models

School of Computer Science and Technology2w ago

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Unlock the power of MLLMs for structured data like human skeletons with a differentiable rendering approach that allows end-to-end training.

Ziyi Wang, Xinshun Wang, Yang Tang +2

Computer Vision Multimodal Models Robotics & Embodied AI

Shuyao Shi +12w ago·also Corresponding author

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

By fusing IMU-derived egomotion with visual data, Motion-MLLM lets MLLMs achieve SOTA 3D scene understanding with 40% less compute.

Shuyao Shi, Kang G. Shin

Computer Vision Multimodal Models Robotics & Embodied AI

Kai Zou +42w ago

EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

By unifying layout-to-image generation and image grounding with a novel cycle-consistent learning approach, EchoGen achieves state-of-the-art results in both tasks, proving that solving two problems at once can be better than solving them separately.

Kai Zou, Hongbo Liu, Jianxiong Gao +2

Computer Vision Multimodal Models

DeepMind2w ago

Versatile Editing of Video Content, Actions, and Dynamics without Training

Forget finetuning: DynaEdit unlocks complex video edits like action modification and object insertion, all without training, using clever manipulation of pretrained text-to-video models.

Vladimir Kulikov, Roni Paiss, Andrey Voynov +3

Computer Vision Multimodal Models World Models & Planning

Chen Liyi +42w ago

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Forget waiting minutes for iterative optimization – Omni-3DEdit performs diverse 3D editing tasks in a single forward pass.

Chen Liyi, Wang Pengfei, Zhang Guowen +2

Computer Vision Multimodal Models Robotics & Embodied AI

Jingchun Yang is with Northeast2w ago

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.

Jingchun Yang, Jinchang Zhang

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

By adaptively calibrating facts and augmenting emotions, FACE-net overcomes the factual-emotional bias that plagues emotional video captioning.

Weidong Chen, Cheng Ye, Zhendong Mao +5

Computer Vision Multimodal Models Natural Language Processing

Haoyun Chen +32w ago·also University of Science and Technology

Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

A new prompt-free medical image segmentation model achieves impressive zero-shot and cross-modal transfer performance by explicitly disentangling geometric and semantic anatomical knowledge.

Haoyun Chen, Fenghe Tang, Wenxin Ma +1

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago·also D features into accurate

ReLaGS: Relational Language Gaussian Splatting

Skip the costly training and go straight to open-vocabulary 3D reasoning with ReLaGS, which builds a 3D semantic scene graph from language-distilled Gaussians.

Yaxu Xie, Alireza Javanmardi, Christen Millerdurai +4

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Overcome weather limitations in remote sensing with MM-OVSeg, a multimodal Optical-SAR fusion framework that enables robust open-vocabulary segmentation even under cloudy conditions.

Yimin Wei, Aoran Xiao, Junshi Xia +1

Computer Vision Multimodal Models

Mengyu Zhao +62w ago

Shot-Aware Frame Sampling for Video Understanding

Grabbing two keyframes per shot – one for the gist, one for the glitch – lets you compress videos for VLMs without missing critical anomalies.

Mengyu Zhao, Di Fu, Yongyu Xie +4

Computer Vision Multimodal Models

Guanlin Feng +12w ago

AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.

Guanlin Feng, Hongsheng Li

Computer Vision Multimodal Models RLHF & Preference Learning

Chaeyun Kim +32w ago

Towards Motion-aware Referring Image Segmentation

RIS models struggle with motion-based queries, but a new data augmentation and contrastive learning approach closes the gap without sacrificing performance on appearance-based descriptions.

Chaeyun Kim, Seunghoon Yi, Yohan Jo +1

Computer Vision Data Curation & Synthetic Data Multimodal Models

Zhou Fang +52w ago

ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models

Robot control gets a whole lot faster: ProbeFlow slashes action decoding latency by 14.8x in Vision-Language-Action models, all without retraining.

Zhou Fang, Jiaqi Wang, Yi Zhou +3

Inference & Quantization Multimodal Models Robotics & Embodied AI

Yigit Ekin +12w ago

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Surprisingly, you can achieve smooth, controllable image editing in text-to-image models without any training, just by intelligently nudging the text embeddings.

Yigit Ekin, Yossi Gandelsman

Computer Vision Multimodal Models Natural Language Processing

University2w ago

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Gesture-aware pretraining unlocks significant improvements in 3D hand pose estimation, proving that semantic gesture information acts as a powerful inductive bias.

Rui Hong, Jana Kosecka

Computer Vision Multimodal Models Robotics & Embodied AI

Podakanti Satyajith Chary +12w ago

Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Differential attention and asymmetric loss functions can significantly improve the performance of BiomedCLIP on highly imbalanced video classification tasks like identifying rare pathologies in video capsule endoscopy.

Podakanti Satyajith Chary, Nagarajan Ganapathy

Computer Vision Multimodal Models Training Efficiency & Optimization

2w ago·also Imperial, KAUST

AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Reconstructing complete, animatable 3D avatars from heavily occluded YouTube videos is now possible, thanks to a hallucination-as-supervision pipeline using diffusion models.

Aymen Mir, Riza Alp Guler, Xiangjun Tang +2

Computer Vision Multimodal Models

David Restrepo +132w ago

On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

Medical vision-language models perform better when the modality gap is tuned to an intermediate level, challenging the assumption that minimizing it is always optimal.

David Restrepo, David Restrepo, Miguel L Martins +11

Computer Vision Multimodal Models Scientific Discovery & Drug Design

2w ago·also CUHK, SenseTime, UT Dallas

FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

Turning past programming failures into reusable knowledge boosts automated repair performance by 3.7% on a multimodal benchmark.

Ruize Ma, Shilin Zhang, Zheng Ma +7

Code Generation & Program Synthesis Multimodal Models

Yuhe Tian +62w ago·also University of Science and Technology

DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

By focusing on semantic differences between scans, DiffVP lets LLMs generate more accurate CT reports without needing explicit lesion localization.

Yuhe Tian, Kun Zhang, Haoran Ma +4

Computer Vision Multimodal Models Natural Language Processing

2w ago·also Didichuxing Co. Ltd

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.

Weiming Wu, Zi-Jian Cheng, Jie Meng +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

2w ago

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Symphony's cognitively-inspired multi-agent system significantly boosts long-form video understanding by mimicking human reasoning, achieving state-of-the-art results on multiple benchmarks.

Haiyang Yan, Hongyun Zhou, Xiaoxue Feng +1

Computer Vision Multimodal Models Tool Use & Agents

Mohamed Eltahir +52w ago

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Forget collapsing videos into text – this hierarchical grid lets you zoom into any moment with lossless visual fidelity, unlocking logarithmic compute scaling for long-form video understanding.

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

2w ago

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Video fine-tuning boosts MLLMs' video smarts, but surprisingly dumbs them down on static images – a trade-off you can't simply brute-force away with more frames.

Linghao Zhang, Jungang Li, Yonghua Hei +12

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Gaoge Han +72w ago

KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

Achieve more precise robot control by explicitly disentangling high-level goals from low-level kinematic instructions.

Gaoge Han, Zhengqing Gao, Ziwen Li +5

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Concept erasure in text-to-image models is mostly smoke and mirrors: a text-free attack can still regenerate "forgotten" concepts by exploiting the model's latent visual knowledge.

Qianlong Xiang, Miao Zhang, Haoyu Zhang +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

2w ago·also Laboratoire IBISC, Paris-Saclay, Univ-Evry

WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

VLMs struggle to reason about visual scenes in adverse weather, losing significant segmentation accuracy as rain, snow, or fog intensifies.

Wanjun Du, Zifeng Yuan, Tingting Chen +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago

Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

Achieve state-of-the-art performance in multimodal remote sensing semantic segmentation with significantly fewer trainable parameters by using a novel parameter-efficient and modality-balanced symmetric fusion framework.

Haocheng Li, Juepeng Zheng, Shuangxi Miao +4

Computer Vision Multimodal Models Training Efficiency & Optimization

CMU ML2w ago

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

LLMs can navigate complex 3D environments more effectively and with far fewer tokens by using a hierarchical scene graph representation derived from omnidirectional sensor data.

Zhongyuang Liu, Zhongyuan Liu, Min He +6

Multimodal Models Robotics & Embodied AI Tool Use & Agents

2w ago·also McGill, Purdue

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.

Zilin Huang, Zihao Sheng, Zhengyan Wan +3

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

2w ago

DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation

Policies trained on DexViTac's multimodal dataset achieve over 85% success in real-world dexterous manipulation, proving that high-fidelity tactile data unlocks a new level of robotic dexterity.

Xitong Chen, Yifeng Pan, Min Li +1

Computer Vision Multimodal Models Robotics & Embodied AI

2w ago·also Lenovo, PKU

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

AdaZoom-GUI achieves SOTA GUI grounding by adaptively zooming in on small elements and refining ambiguous instructions, outperforming even larger models.

Siqi Pei, Liang Tang, Tiaonan Duan +7

Computer Vision Multimodal Models Tool Use & Agents

Zihao Xin +72w ago

AgentVLN: Towards Agentic Vision-and-Language Navigation

VLMs can now drive embodied agents to navigate complex environments with unprecedented efficiency, thanks to a novel framework that bridges the gap between 2D semantic understanding and 3D spatial reasoning.

Zihao Xin, Wentong Li, Yixuan Jiang +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Yanchuan Tang +82w ago

Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model

Don't let your robot's brief moment of panic get lost in the noise – this new uncertainty method spotlights those critical spikes to predict failures before they happen.

Yanchuan Tang, Taowen Wang, Yue Chen +6

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Zihao Zheng +112w ago·also Corresponding Author

HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Robots can think (and act) twice as fast: HeiSD's hybrid speculative decoding turbocharges embodied agents by intelligently switching between draft and retrieval strategies.

Zihao Zheng, Zhihao Mao, Z. Mao +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

Hashini Senaratne +62w ago

HRI-SA: A Multimodal Dataset for Online Assessment of Human Situational Awareness during Remote Human-Robot Teaming

Human-robot teams can get a boost: eye-tracking data alone can predict when a human teammate is struggling to understand the robot's situation with nearly 90% recall.

Hashini Senaratne, Richard Attfield, S. Widhanapathirana +4

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Mar 17, 2026

2w ago

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou +12

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Sainan Liu +32w ago

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Achieve object-level motion control in image-to-video generation without any training by cleverly exploiting attention maps and first-last-frame priors.

Sainan Liu, Tz-Ying Wu, Hector A Valdez +1

Computer Vision Multimodal Models

L3S -Leibniz University Hannover2w ago

BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Normalizing flows can flag anomalous relationships in scene graphs with 10% better accuracy and 5x faster speed than existing methods, while also exhibiting superior robustness to semantic variations.

Melissa Schween, Mathis Kruse, Bodo Rosenhahn

Computer Vision Multimodal Models Natural Language Processing

2w ago·also Snap Research

Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Compressing images into 1D token sequences can yield state-of-the-art reconstruction fidelity, challenging the necessity of 2D spatial grids for visual tokenization.

Yunpeng Qu, Kaidong Zhang, Yukang Ding +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jiaxin Zhang +52w ago

GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Text-heavy fine-tuning is blinding your MLLM to crucial 3D spatial information, but GAP-MLLM's geometry-aligned pre-training can restore its sight.

Jiaxin Zhang, Junjun Jiang, Haijie Li +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

Stop blindly steering all layers of your LVLM - this new method uses attribution to apply targeted interventions only where hallucinations originate, preserving performance on general tasks.

TianTian Dang, Chao Bi, Shufan Shen +3

Computer Vision Multimodal Models

Shihao Zhu +92w ago

Mixture of Style Experts for Diverse Image Stylization

Diffusion models can now capture nuanced semantic and material details in image stylization, moving beyond simple color-driven transformations, thanks to a Mixture of Experts architecture.

Shihao Zhu, Zi-Juan Ouyang, Ziheng Ouyang +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago·also TRI

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Fine-tuning Vision-Language Model planners for robotic manipulation is now significantly more efficient and safer thanks to a novel framework that leverages video world models to simulate real-world physics.

E. Jia, Emily Yue-Ting Jia, Weiduo Yuan +5

Multimodal Models Robotics & Embodied AI World Models & Planning

Hongwei Lin +22w ago

AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Autonomous vehicles can now see through the storm: a new Mixture of Experts approach boosts 3D object detection accuracy by 15% in adverse weather, without slowing things down.

Hongwei Lin, Xun Huang, Chenglu Wen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Alejandro Paredes La Torre2w ago

Adversarial attacks against Modern Vision-Language Models

Open-source VLMs can be easily fooled by simple gradient-based attacks, but the degree of vulnerability varies drastically across architectures.

Alejandro Paredes La Torre

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

Viraj Panchal +42w ago

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

A multimodal pipeline integrating vision, OCR, and LLMs can achieve state-of-the-art content moderation performance at significantly lower latency than existing methods, especially for threats embedded in text.

Viraj Panchal, Tanmay Talsaniya, P. Patel +2

Computer Vision Multimodal Models Natural Language Processing