March 25 – April 1, 2026

Multimodal Models - Weekly Roundup

100 papers published across 3 labs.

1% acceleration

Selected Labs publishing this week

Tsinghua AI1 DAMO1 Google Research1

Top Papers

Mar 31, 2026

1d ago

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.

Yan Luo, Jianhua Wu, Zhenhua Xiong +1

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Seungwoo Yoon +21d ago

Extend3D: Town-Scale 3D Generation

Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.

Seungwoo Yoon, Jinmo Kim, Jaesik Park

Computer Vision Multimodal Models World Models & Planning

Shuang Chen +151d ago

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

By tightly coupling reasoning, searching, and generation, Unify-Agent achieves state-of-the-art world-grounded image synthesis, rivaling closed-source models and opening new avenues for agent-based multimodal generation.

Shuang Chen, Quanxin Shou, Hangting Chen +13

Computer Vision Multimodal Models Tool Use & Agents

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Wenli Li +51d ago·also Shanghai University

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.

Wenli Li, Kai Zhao, Haoran Jiang +3

Computer Vision Inference & Quantization Multimodal Models

All Papers (100)

Mar 31, 2026

1d ago

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.

Yan Luo, Jianhua Wu, Zhenhua Xiong +1

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Seungwoo Yoon +21d ago

Extend3D: Town-Scale 3D Generation

Forget training data – Extend3D generates impressive town-scale 3D scenes from a single image by cleverly extending and patching the latent space of an object-centric 3D generative model.

Seungwoo Yoon, Jinmo Kim, Jaesik Park

Computer Vision Multimodal Models World Models & Planning

Shuang Chen +151d ago

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen +13

Computer Vision Multimodal Models Tool Use & Agents

Shifang Zhao +41d ago

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Forget tedious manual editing: CutClaw's multi-agent system can automatically transform hours of raw footage into engaging, rhythm-aligned short videos.

Shifang Zhao, Yihan Hu, Ying Shan +2

Multimodal Models Speech & Audio Tool Use & Agents

Wenli Li +51d ago·also Shanghai University

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.

Wenli Li, Kai Zhao, Haoran Jiang +3

Computer Vision Inference & Quantization Multimodal Models

Iain Swift +21d ago

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Multimodal deep learning models for cancer prognosis may not be synergizing information across modalities as much as we think; better performance seems to come from simply adding complementary signals.

Iain Swift, JingHua Ye, Ruairi O'Reilly

Interpretability & Mechanistic Interp Multimodal Models Scientific Discovery & Drug Design

Iain Swift +11d ago

Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

Adding MRI data to histopathology and gene expression modestly improves glioma survival prediction, but only when combined effectively in a trimodal deep learning model.

Iain Swift, JingHua Ye

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Tim R. Davidson +41d ago

Reasoning-Driven Synthetic Data Generation and Evaluation

Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.

Tim R. Davidson, Benoit Seguin, Enrico Bacis +2

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

1d ago·also XPENG Robotics

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Achieve superhuman robot dexterity with 10x fewer demonstrations by decoupling intent and action through latent world modeling.

Yi Chen, Yuying Ge, Hui Zhou +3

Multimodal Models Robotics & Embodied AI World Models & Planning

Qiucheng Yu +61d ago

TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.

Qiucheng Yu, Ruijie Xu, Mingang Chen +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Dustin Eisenhardt +21d ago

Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning

Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.

Dustin Eisenhardt, Yunhee Jeong, Florian Buettner

Eval Frameworks & Benchmarks Multimodal Models Training Efficiency & Optimization

Hengyu Zeng +71d ago

MacTok: Robust Continuous Tokenization for Image Generation

Image generation models can now achieve state-of-the-art fidelity with up to 64x fewer tokens, thanks to a novel masking strategy that prevents latent space collapse.

Hengyu Zeng, Xin Gao, Guanghao Li +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Sowmya Vajrala +61d ago

Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.

Sowmya Vajrala, Aakash Parmar, Prasanna R +4

Computer Vision Inference & Quantization Multimodal Models

Seung-Hun Han +21d ago

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.

Seung-Hun Han, Youssef Mohamed, Mohamed Elhoseiny

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Meiwen Ding +31d ago

Adversarial Prompt Injection Attack on Multimodal Large Language Models

MLLMs are more vulnerable than we thought: imperceptible visual prompts can effectively hijack their behavior.

Meiwen Ding, Song Xia, Chenqi Kong +1

Multimodal Models Red-Teaming & Adversarial Robustness

T. Simon +41d ago

Few-shot Writer Adaptation via Multimodal In-Context Learning

Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.

T. Simon, Stéphane Nicolas, Pierrick Tranouez +2

Computer Vision Multimodal Models Natural Language Processing

1d ago

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Adversarial training doesn't have to destroy VLMs' zero-shot abilities: aligning adversarial visual features with textual embeddings using the original model's probabilistic predictions can actually *improve* robustness.

Yubo Cui, Xianchao Guan, Zijun Xiong +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Jianpeng Wang +61d ago·also Tsinghua AI

PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

AI-generated image forgery detection gets a major boost with PromptForge-350k, a dataset so large and well-annotated it pushes IoU scores 5% higher and generalizes to unseen models.

Jianpeng Wang, Haoyu Wang, Baoying Chen +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Corresponding authors1d ago

Hallucination-aware intermediate representation edit in large vision-language models

Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.

Wei Suo, Hanzu Zhang, Lijun Zhang +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Amirreza Rouhi +81d ago

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Physical AI systems struggle not with visual recognition, but with understanding space, physics, and action – and PRISM, a new retail video dataset, dramatically closes this gap.

Amirreza Rouhi, P. Sakurikar, Satya Sai Reddy +6

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Guozhi Qiu +61d ago

MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network

Diffusion-based denoising can significantly improve composed image retrieval by making similarity scores more robust to hard negative samples.

Guozhi Qiu, Zhiwei Chen, Zixu Li +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

1d ago·also BRAC University

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Throw out your full images: focusing on pathology-relevant visual patches in radiology reports dramatically outperforms using the entire image for summarization.

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman +3

Computer Vision Multimodal Models Natural Language Processing

Lixin Xiu +21d ago

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

LVLMs aren't all that glitters: a new information-theoretic analysis reveals that some lean heavily on language priors while others genuinely fuse vision and language.

Lixin Xiu, Xufang Luo, Hideki Nakayama

Interpretability & Mechanistic Interp Multimodal Models

D. Bani-Harouni +81d ago

Calibrated Confidence Expression for Radiology Report Generation

Radiology report generation models can now verbalize calibrated confidence estimates, enabling targeted radiologist review of potentially hallucinated findings.

D. Bani-Harouni, Chantal Pellegrini, J. Luers +6

Computer Vision Multimodal Models Natural Language Processing

Hillary Mutisya +41d ago

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.

Hillary Mutisya, J. Mugane, Gavin Nyamboga +2

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Hejin Huang +41d ago·also Snap Research

Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.

Hejin Huang, Jusheng Zhang, Kaitong Cai +2

Multimodal Models Recommendation & Information Retrieval RLHF & Preference Learning

Zhiqian Zhang +71d ago

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu +5

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

1d ago·also CUHK, Independent Researcher

EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback

Multimodal repair isn't always better: selectively escalating to multimodal prompting based on runtime signals in Scratch yields a superior success-cost-energy tradeoff compared to uniformly applied multimodal approaches.

Yuan Si, Ming Wang, Daming Li +2

Code Generation & Program Synthesis Multimodal Models

Shi Li +81d ago

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Surgical VQA gets a major upgrade: SurgTEMP's hierarchical visual memory and competency-based training leapfrog existing models in understanding complex, time-sensitive surgical procedures.

Shi Li, Vinkle Srivastav, Nicolas Chanel +6

Computer Vision Multimodal Models Natural Language Processing

Yuhang Yang +71d ago

Gloria: Consistent Character Video Generation via Content Anchors

Forget generating uncanny valley characters - Gloria lets you create consistent, expressive digital characters in videos exceeding 10 minutes, a leap towards believable virtual actors.

Yuhang Yang, Fan Zhang, Huaijin Pi +5

Computer Vision Multimodal Models

Léopold Maillard +71d ago

SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Even state-of-the-art VLMs exhibit systematic failures in reasoning about the physical feasibility of actions in 3D environments, despite high semantic confidence.

Léopold Maillard, Francis Engelmann, Tom Durand +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Thomas Tanay +71d ago

GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

Achieve fine-grained, six-degrees-of-freedom camera control in dynamic scenes with a generalizable model that outperforms scene-specific and diffusion-based approaches.

Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk +5

Computer Vision Multimodal Models World Models & Planning

Abderrezzaq Sendjasni +11d ago

Multi-Feature Fusion Approach for Generative AI Images Detection

Fusing low-level statistical anomalies, high-level semantic coherence, and mid-level texture patterns makes AI-generated image detection far more reliable across diverse generative models.

Abderrezzaq Sendjasni, Mohamed-Chaker Larabi

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Boshko Koloski +41d ago

MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

Achieve massive gains in few-shot hierarchical multi-label classification (+42%) by adaptively balancing semantic priors and visual evidence using level-aware embeddings.

Boshko Koloski, Marjan Stoimchev, Jurica Levatić +2

Computer Vision Multimodal Models

1d ago·also UWA, Xidian

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

By injecting LLM-derived contextual cues into skeleton representations, SkeletonContext achieves state-of-the-art zero-shot action recognition, even distinguishing visually similar actions without explicit object interactions.

Ning Wang, Tieyue Wu, Naeha Sharif +5

Computer Vision Multimodal Models Natural Language Processing

Andrea DeMarco +41d ago

STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Radio astronomy-aware self-supervised pre-training beats out-of-the-box Vision Transformers for transfer learning on radio astronomy morphology tasks.

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

1d ago·also WHU

Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

Masked motion generators struggle with complex movements because they treat all frames the same – until now.

Pengfei Zhou, Xiangyue Zhang, Xukun Shen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Sherif Abdelwahab1d ago

Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras

Edge cameras can achieve a 45% improvement in cross-modal retrieval accuracy by ditching redundant frames and focusing only on what's new.

Sherif Abdelwahab

Computer Vision Multimodal Models Recommendation & Information Retrieval

Johann-Ludwig Herzog +71d ago

BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

VLMs struggle with Earth observation tasks involving complex land use, but a new dataset with nearly 10 million text annotations could change that.

Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel +5

Computer Vision Data Curation & Synthetic Data Multimodal Models

1d ago·also Asterisk Labs, LGND AI

EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images

Querying satellite imagery just got easier: EarthEmbeddingExplorer lets you find images using text, visuals, or location, unlocking insights previously trapped in research papers.

Yijie Zheng, Weijie Wu, Bingyue Wu +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

Chenxin Zhu +71d ago

A2BFR: Attribute-Aware Blind Face Restoration

Finally, a blind face restoration method that doesn't just hallucinate details, but lets you precisely control facial attributes via text prompts while maintaining high fidelity.

Chenxin Zhu, Yushun Fang, Lu Liu +5

Computer Vision Multimodal Models

Marina Villanueva +21d ago

Multimodal Models Meet Presentation Attack Detection on ID Documents

Multimodal models surprisingly falter when applied to presentation attack detection on ID documents, challenging the assumption that combining visual and textual data inherently improves security.

Marina Villanueva, Juan M. Espín, Juan E. Tapia

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Ni Ou +31d ago·also ByteDance

Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

Ditching depth map projections for camera-LiDAR calibration unlocks significant gains in accuracy and robustness, especially when starting from poor initial extrinsic estimates.

Ni Ou, Zhuo Chen, Xinru Zhang +1

Computer Vision Multimodal Models Robotics & Embodied AI

1d ago

Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.

Fabian Kabus, Julia Hindel, Jelena Bratuli'c +6

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Jiaju Ma +31d ago

Self-Consistency for LLM-Based Motion Trajectory Generation and Verification

LLMs can generate more accurate motion trajectories by clustering them into geometrically consistent families, even without retraining.

Jiaju Ma, R. Kenny Jones, Jiajun Wu +1

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Yaning Zhang +41d ago

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Gaze, often overlooked, reveals deepfake origins with surprising accuracy, enabling a new CLIP-based approach that significantly boosts deepfake attribution and detection.

Yaning Zhang, Linlin Shen, Zitong Yu +2

Computer Vision Multimodal Models Natural Language Processing

Yunnan Normal University1d ago

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Stop segmenting remote sensing images in isolation: modeling inter-unit dependencies boosts open-vocabulary segmentation accuracy by up to 6%.

Wenyang Chen, Zhanxuan Hu, Yaping Zhang +2

Computer Vision Multimodal Models

Jingqi Xu1d ago

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Negation, a known weakness in VLMs like CLIP, can be dramatically improved by strategically fine-tuning only the *front* layers of the text encoder with a modified contrastive loss.

Jingqi Xu

Computer Vision Multimodal Models Natural Language Processing

Tao Chen +71d ago

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.

Tao Chen, Kun Zhang, Qiong Wu +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Tianyu Huang +71d ago

LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

Forget tedious optimization – LightHarmony3D generates realistic lighting and shadows for inserted 3D objects in a single pass, making scene augmentation feel truly real.

Tianyu Huang, Zhenyang Ren, Zhenchen Wan +5

Computer Vision Multimodal Models Robotics & Embodied AI

Yi Hu +31d ago

Editing on the Generative Manifold: A Theoretical and Empirical Study of General Diffusion-Based Image Editing Trade-offs

Diffusion-based image editing's impressive flexibility comes with fundamental trade-offs between controllability, faithfulness, consistency, locality, and quality, which this paper exposes with clear theoretical bounds.

Yi Hu, Leying Yi, Emily Davis +1

Computer Vision Multimodal Models

1d ago·also V evaluation systems. Numerous

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AI1d ago·also ByteDance, Rice

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.

Zeyu Jin, Songtao Zhou, Haoyu Wang +5

Multimodal Models Natural Language Processing Speech & Audio

Yuan Hao +51d ago

CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Humanoids can now nimbly navigate real-world clutter and complex terrain using only raw depth data, ditching hand-engineered geometric representations.

Yuan Hao, Ruiqi Yu, Shixin Luo +3

Computer Vision Multimodal Models Robotics & Embodied AI

1d ago

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Achieve state-of-the-art robotic manipulation with a model orders of magnitude smaller than VLAs by explicitly aligning kinematic and semantic transitions.

Andrew Jeong, Jaemin Kim, Sebin Lee +1

Multimodal Models Robotics & Embodied AI World Models & Planning

Haihong Hao +71d ago

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

VLN agents can now "dream ahead" by learning action-conditioned visual dynamics in a latent space, leading to SOTA results and improved real-world navigation.

Haihong Hao, Lei Chen, Mingfei Han +5

Multimodal Models Robotics & Embodied AI World Models & Planning

Mingyeong Song +21d ago

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Computer Vision Multimodal Models Speech & Audio

DAMO1d ago

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.

Yinuo Liu, Heng Zhou, Jiahao Zhang +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Xuesong Wang +11d ago

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Giving VLMs access to basic image manipulation tools and a strategic routing system dramatically improves their ability to "see through" visual illusions, even generalizing to unseen illusion types.

Xuesong Wang, Harry Wang

Computer Vision Multimodal Models Tool Use & Agents

1d ago

Video-Oasis: Rethinking Evaluation of Video Understanding

Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.

Geuntaek Lim, Minho Shim, Sungjune Park +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Min Lu +61d ago

Abstraction in Style

Style transfer can now capture the essence of artistic abstraction, not just surface-level appearance, by explicitly reinterpreting image structure.

Min Lu, Yuanfeng He, Anthony Chen +4

Computer Vision Multimodal Models

1d ago·also Adobe Research, HKU, UCSD, UPenn

OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

Finally, a video generation model lets you roam through a scene with long-term spatial and temporal consistency, opening up new possibilities for virtual exploration.

Yuheng Liu, Xin Lin, Xinke Li +9

Computer Vision Multimodal Models World Models & Planning

Mar 30, 2026

2d ago

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Unleashing creative potential in text-to-image models just got easier: on-the-fly repulsion in the contextual space lets you steer diffusion transformers towards richer diversity without sacrificing image quality or blowing your compute budget.

Omer Dahary, Benaya Koren, Daniel Garibi +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2d ago

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.

Kailai Feng, Yuxiang Wei, Bo Chen +6

Computer Vision Inference & Quantization Multimodal Models

Kaituo Feng +92d ago

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Image generation takes a leap towards real-world knowledge by training an agent that actively searches for and integrates external information, substantially boosting performance on knowledge-intensive tasks.

Kaituo Feng, Manyuan Zhang, Shuang Chen +7

Computer Vision Multimodal Models Tool Use & Agents

Ikechukwu Uchendu +72d ago

See it to Place it: Evolving Macro Placements with Vision-Language Models

Zero-shot Vision-Language Models can now guide chip floorplanning, beating specialized ML methods by up to 32% without any fine-tuning.

Ikechukwu Uchendu, Swati Goel, Karly Hou +5

Computer Vision Multimodal Models

Min Wang +12d ago

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Current vision-language benchmarks miss the mark: AMIGO reveals how hard it is for agents to ground visual information across multiple images and turns.

Min Wang, Ata Mahjoubfar

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Mohamad Koohi-Moghadam +32d ago

ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning

Anticancer drugs, whether organic or inorganic, can now be understood through a single unified representation, unlocking knowledge transfer between previously siloed chemical domains.

Mohamad Koohi-Moghadam, Hongzhe Sun, Hongyan Li +1

Data Curation & Synthetic Data Multimodal Models Scientific Discovery & Drug Design

Doan Nam Long Vu +12d ago

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

VLMs can appear to gain up to 58% F1 on clinical tasks simply by *mentioning* MRI data in the prompt, even when the data is uninformative, revealing a "scaffold effect" that inflates performance metrics.

Doan Nam Long Vu, Simone Balloccu

Eval Frameworks & Benchmarks Multimodal Models Open-Source Models & Weights

Yangmei Chen +62d ago

Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

Multi-view learning with prototype-based correction significantly boosts the robustness of thyroid nodule ultrasound classification across different ultrasound devices and clinical environments.

Yangmei Chen, Zhongyuan Zhang, Xikun Zhang +4

Computer Vision Data Curation & Synthetic Data Multimodal Models

Chanyoung Kim +42d ago·also Soongsil University

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

VLA models are brittle: even simple synonym substitutions in instructions cause a 22-52% performance drop in robotic manipulation tasks.

Chanyoung Kim, Minwoo Kim, Minseok Kang +2

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Università degli Studi Guglielmo Marconi2d ago·also Chelonia SA

Graph Vector Field: A Unified Framework for Multimodal Health Risk Assessment from Heterogeneous Wearable and Environmental Data Streams

Finally, a framework that unifies dynamic graph models, topological learning, and multimodal fusion to decompose health risk into interpretable components.

Silvano Coletti, Francesca Fallucchi

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Scientific Discovery & Drug Design

Udita Ghosh +42d ago

Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL

Get 80% of your oracle feedback for free: ROVED leverages vision-language embeddings to drastically reduce the need for human preferences in reinforcement learning.

Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li +2

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Ziqi Miao +62d ago

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Disentangling perception and reasoning with role-specific rewards in multimodal LLMs boosts accuracy by 7 points, revealing a critical bottleneck in existing joint optimization approaches.

Ziqi Miao, Haonan Jia, Lijun Li +4

Multimodal Models Reasoning & Chain-of-Thought Training Efficiency & Optimization

2d ago

Domain-Invariant Prompt Learning for Vision-Language Models

Adversarial training unlocks domain-invariant prompts for CLIP, boosting zero-shot generalization beyond standard prompt tuning.

Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt

Computer Vision Multimodal Models Training Efficiency & Optimization

Wenhan Wang +102d ago

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

A 7B model trained on a new dataset of Chinese porcelain outperforms GPT-4 by 12% on expert connoisseurship tasks, demonstrating the power of domain-specific training and tool integration.

Wenhan Wang, Zhixiang Zhou, Zhongtian Ma +8

Computer Vision Multimodal Models Tool Use & Agents

2d ago

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

MLLMs can now guide visual generative models to imagine what's hidden behind objects, significantly boosting amodal completion performance.

Heecheol Yun, Eunho Yang

Computer Vision Multimodal Models Robotics & Embodied AI

Weimin Liu +42d ago

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Diffusion models can now predict driver attention with state-of-the-art accuracy by incorporating LLM-enhanced semantic reasoning.

Weimin Liu, Qingkun Li, Jiyuan Qiu +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Zehua Han +152d ago

PReD: An LLM-based Foundation Multimodal Model for Electromagnetic Perception, Recognition, and Decision

PReD leaps ahead by creating the first foundation model to close the loop on perception, recognition, and decision-making for electromagnetic signals.

Zehua Han, Jing Xiao, Yiqi Duan +13

Data Curation & Synthetic Data Multimodal Models Scientific Discovery & Drug Design

Core contribution2d ago

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Open-source document parsing models are shockingly brittle, losing nearly 18% accuracy on real-world photos and 14% on non-Latin scripts compared to their closed-source counterparts.

Zhang Li, Zhibo Lin, Qiang Liu +6

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Luigi Curini +52d ago

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

VLMs can unlock insights from troves of historical documents previously inaccessible due to OCR limitations, achieving state-of-the-art transcription and speaker tagging of Italian parliamentary speeches.

Luigi Curini, L. Curini, Alfio Ferrara +3

Multimodal Models Natural Language Processing Speech & Audio

2d ago·also Myongji University

When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

Scientific figure QA models are often fooled by the answer choices themselves, but a simple decoding strategy that contrasts image-grounded scores with text-only scores can significantly improve accuracy.

Taeyun Roh, Eun-yeong Jo, Wonjune Jang +1

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Kesheng Chen +42d ago

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Even state-of-the-art vision-language models still struggle to reconcile visual evidence with commonsense, often hallucinating based on prior knowledge instead of what they actually see.

Kesheng Chen, Yamin Hu, Qi Zhou +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Surendra Pathak2d ago

Efficient Inference of Large Vision Language Models

LVLM inference is ripe for optimization, but current acceleration techniques only scratch the surface.

Surendra Pathak

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Google Research2d ago

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.

Bhavuk Jain, Sercan Ö. Arik, Sercan Ö. Arık +2

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Lorenza Prospero +42d ago

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Forget expensive, low-realism 3D renders: diffusion models can now generate photorealistic human datasets that boost model performance beyond real-world data.

Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi +2

Computer Vision Data Curation & Synthetic Data Multimodal Models

M2d ago

ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

Achieve state-of-the-art image similarity generalization with a surprisingly simple, efficient, and interpretable model that operates on local descriptor correspondences.

Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis +1

Computer Vision Multimodal Models Recommendation & Information Retrieval

Chengyin Hu +82d ago

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

VLMs can be devastatingly fooled by modifying less than 2% of image pixels in a fixed, X-shaped pattern, causing them to fail spectacularly across diverse tasks like classification, captioning, and question answering.

Chengyin Hu, Jiaju Han, Xuemeng Sun +6

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

2d ago

MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

End-to-end recognition of complex chemical structures from documents is now possible, thanks to a new model and dataset that leapfrog existing methods.

Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer +3

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Fei Wu +62d ago

Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

Fuzzy logic bridges the gap between LLM reasoning and low-level artifact detection, creating a surprisingly effective AI-generated image detector.

Fei Wu, Guanghao Ding, Zijian Niu +4

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Xiamen University2d ago·also SYSU, UMacau

Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation

Achieve robust maritime scene segmentation by jointly learning restoration, fusion, and segmentation, outperforming prior methods in complex, degraded conditions.

Weichao Cai, Weiliang Huang, Biao Xue +3

Computer Vision Multimodal Models

2d ago·also CAS, MTLab

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Achieve significantly better structure preservation in text-guided image editing by injecting structure-related features into visual autoregressive models, guided by reinforcement learning.

Tao Xia, Jiawei Liu, Yukun Zhang +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Milton Zhou +32d ago

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Forget disjointed workflows: AutoCut's unified token space for video, audio, and text slashes ad production costs while boosting consistency.

Milton Zhou, Sizhong Qin, Yongzhi Li +1

Computer Vision Multimodal Models Speech & Audio

Jiho Park +52d ago

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Finally, a way to measure how efficiently a sketch conveys meaning, moving beyond simple recognition accuracy.

Jiho Park, Sieun Choi, Jaeyoon Seo +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Li-Heng Chen +52d ago

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Finally, you can precisely control specific objects in long, consistent driving videos, even those pesky long-tail objects.

Li-Heng Chen, Ke Cheng, Yahui Liu +3

Computer Vision Multimodal Models World Models & Planning

Onat Ozdemir +42d ago

Explaining CLIP Zero-shot Predictions Through Concepts

Unlock CLIP's black box: EZPC reveals the "why" behind zero-shot image recognition by grounding predictions in human-understandable concepts, without sacrificing accuracy.

Onat Ozdemir, Anders Christensen, Stephan Alaniz +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Bingchen Li +72d ago

ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

ColorFLUX achieves superior old photo colorization by cleverly disentangling structure and color, outperforming even closed-source commercial models.

Bingchen Li, Zhixin Wang, Fan Li +5

Computer Vision Multimodal Models

Yucheng Huang +42d ago

ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Bypass the need for predicate annotations in 3D scene graph pretraining with a novel topological layout learning approach that enforces predicate relation learning.

Yucheng Huang, Luping Ji, Xiangwei Jiang +2

Computer Vision Data Curation & Synthetic Data Multimodal Models

Yuhuan Xie +62d ago

ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models

Forget clunky 3D modeling software – ObjectMorpher lets you intuitively reshape objects in images with simple drags, yielding photorealistic results that stomp both 2D and previous 3D-aware methods.

Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang +4

Computer Vision Multimodal Models

Nwi(ℒig+ℒil)⏟ℓi+2d ago·also SCU

Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence

Training remote sensing image-text retrieval models on real-world noisy data can be significantly improved by a self-paced learning strategy that mimics human cognitive learning patterns.

Qiya Song, Yiqiang Xie, Yuan Sun +2

Computer Vision Multimodal Models Recommendation & Information Retrieval