LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, He Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Yi Wang +173w ago

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.

Yi Wang, Xincheng Li, Pengwei Xie +15

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Stanford HAI3w ago·also Tsinghua AI, Beihang, CUHK, HKUST +1

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.

Houyuan Chen, Hong Li, Xianghao Kong +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

All Papers (100)

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, He Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Yi Wang +173w ago

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Yi Wang, Xincheng Li, Pengwei Xie +15

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Stanford HAI3w ago·also Tsinghua AI, Beihang, CUHK, HKUST +1

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.

Houyuan Chen, Hong Li, Xianghao Kong +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Microsoft Research3w ago·also SNU

Map2World: Segment Map Conditioned Text to 3D World Generation

Forget grid layouts: Map2World lets you generate consistent 3D worlds from arbitrary segment maps, offering unprecedented control and scalability.

Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang +2

Computer Vision Multimodal Models World Models & Planning

Yan Fang +93w ago

Let ViT Speak: Generative Language-Image Pre-training

Ditch the complex multimodal pre-training pipelines: GenLIP proves a simple language modeling objective can effectively align vision encoders with LLMs, achieving strong performance with less data.

Yan Fang, Mengcheng Lan, Zilong Huang +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Siyuan Huang +83w ago

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

LVLMs can maintain sharper visual focus during long-form generation by adding a lightweight, learnable memory module that bypasses attention dilution.

Siyuan Huang, Xiaoye Qu, Yafu Li +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Massimo Rondelli +23w ago

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.

Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval

Apr 30, 2026

Clemson University3w ago

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Architectural diversity offers surprisingly little defense against adversarial attacks on VLMs for autonomous driving, with physical patches transferring effectively across different models.

David Fernandez, Pedram MohajerAnsari, Amir Salarpour +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Jialun Shen +83w ago·also DP Technology

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Current multimodal LLMs struggle to understand scientific spectra, but a new benchmark and data processing technique could change that.

Jialun Shen, Jialu Shen, Han Lyu +6

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Habtom Kahsay Gidey +23w ago

A Pattern Language for Resilient Visual Agents

Enterprise AI doesn't have to be a latency nightmare: this pattern language offers a blueprint for integrating VLAs with deterministic control loops.

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Tool Use & Agents

Tsinghua AI3w ago

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.

Xupeng Chen, Binbin Shi, Chenqian Le +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Nhi Ngoc-Yen Nguyen +53w ago

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Ignoring language-specific structure in scene-text captioning is a recipe for disaster in tonal languages like Vietnamese, but a new graph framework leveraging phonological attention can help.

Nhi Ngoc-Yen Nguyen, Anh-Duc Nguyen, Anh Nguyen +3

Computer Vision Multimodal Models Natural Language Processing

Tsinghua AI3w ago·also MiniCPM-o Team, Tencent AI

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.

Junbo Cui, Bokai Xu, Chongyi Wang +36

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

Junyoung Lee +133w ago

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

A 48-camera system finally unlocks real-time, room-scale multi-human, multi-robot interaction research in realistic home environments.

Junyoung Lee, Junyoung Lee, Sookwan Han +11

Computer Vision Multimodal Models Robotics & Embodied AI

Jing Zhang +103w ago

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

By unifying specialized detectors with MLLMs in an agentic framework, Echo-{\alpha} achieves state-of-the-art ultrasound interpretation, suggesting a path to more accurate, interpretable, and transferable medical AI.

Jing Zhang, Wentao Jiang, Tao Huang +8

Computer Vision Multimodal Models Tool Use & Agents

Zujin Guo +63w ago

Generate Your Talking Avatar from Video Reference

Ditch the static image: this method generates realistic talking avatars by learning from *videos* of the subject in completely different scenes.

Zujin Guo, Zhenhui Ye, Yi Ren +4

Computer Vision Multimodal Models Speech & Audio

Tsinghua AI3w ago·also BUPT, Corresponding author

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.

Junpeng Ding, Zichen Tang, Zichen Tang +21

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also CUHK

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation.

By explicitly aligning image features with the hierarchical structure of radiology reports, RIHA generates more clinically accurate and coherent reports than models that treat reports as flat sequences.

Yucheng Chen, Yang Yu, Yufei Shi +3

Computer Vision Multimodal Models Natural Language Processing

3w ago·also PKU

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Forget task-specific architectures: Uni-HOI uses a unified framework with LLMs to jointly model text, human motion, and object motion, enabling strong performance across diverse HOI tasks.

Mengfei Zhang, Jinlu Zhang, Zhigang Tu

Computer Vision Multimodal Models Natural Language Processing

Mengling Deng +193w ago·also Fudan, RUYi Dynamics Co

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM delivers production-grade VLM/LLM inference performance on edge devices, outperforming vendor-specific toolchains by up to 49% while remaining open-source and cross-platform.

Mengling Deng, Menglin Deng, Yuanpeng Chen +17

Computer Vision Inference & Quantization Multimodal Models

Ajou Univerity3w ago·also GenGenAI, SNU, UT Austin

Sparse-View 3D Gaussian Splatting in the Wild

Achieve high-fidelity 3D rendering from sparse, unconstrained real-world images by intelligently synthesizing novel views with diffusion models and Gaussian replication.

Wongi Park, Jordan A. James, Myeongseok Nam +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also Mississippi State University, PolyU

Representative Spectral Correlation Network for Multisource Remote Sensing Image Classification

Ditching PCA for spectral reduction can yield state-of-the-art performance in multisource remote sensing image classification while slashing computational costs.

Chuanzheng Gong, Feng Gao, Junyan Lin +2

Computer Vision Multimodal Models

Doyeop Kwak +33w ago

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee +1

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Francisco M. López +123w ago

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

Unlock a baby's-eye view: Reconstruct and replay infant movements on robots to simulate their sensory experiences, offering unprecedented insights into early development.

Francisco M. López, Francisco M. L'opez, Hoshinori Kanazawa +10

Computer Vision Multimodal Models Robotics & Embodied AI

Dominik Klement +53w ago·also Brno University of Technology

BUT System Description for CHiME-9 MCoRec Challenge

Integrating visual cues into a long-context ASR system slashes word error rate by 16% in multi-talker conversational recordings, proving the power of AV fusion.

Dominik Klement, Alexander Polok, Nguyen Hai Phong +3

Multimodal Models Natural Language Processing Speech & Audio

3w ago·also Macquarie, Meituan, UNSW

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Stop drowning your MLLMs in irrelevant document noise: FES-RAG shows that carefully selecting multimodal fragments as evidence boosts performance by up to 27% while shrinking context length.

Xihang Wang, Zihan Wang, Chengkai Huang +4

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

Pengna Li +93w ago

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Teaching VLMs to "look back" and "look ahead" with lightweight spatial reasoning tasks unlocks surprisingly strong navigation performance.

Pengna Li, Kangyi Wu, Shaoqing Xu +7

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago·also HIT

Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

Simple frequency masking and gated injection can dramatically improve the generalization of AI-generated image detectors, even against unseen generative models.

Shuchang Zhou, Shangkun Wu, Shang Wu +3

Computer Vision Multimodal Models

Ali Shibli +33w ago·also KTH

Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection

Ditch the costly sampling: Noise2Map turns diffusion models into fast, end-to-end semantic segmentation and change detection machines by directly predicting maps from noise.

Ali Shibli, A. Nascetti, Andrea Nascetti +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Mingliang Liang +33w ago·also Radboud

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

VLMs can get a boost in long-tail performance and train more efficiently by dynamically upsampling underrepresented data clusters each epoch.

Mingliang Liang, Zhuoran Liu, Arjen P. de Vries +1

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

3w ago

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.

Fengxian Ji, Jingpu Yang, Zirui Song +5

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Yujin Han +143w ago

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Expert-level video aesthetics can be captured and improved using a hierarchical rubric and reward models trained with a progressive learning scheme.

Yujin Han, Yujie Wei, Yefei He +12

Computer Vision Multimodal Models

3w ago

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.

Hao Chen, Jiaming Liu, Jiaming Liu +19

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

DAMO3w ago·also NTU

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.

Keming Wu, Zuhao Yang, Kaichen Zhang +28

Computer Vision Multimodal Models World Models & Planning

Yujin Jeong +43w ago

When Do Diffusion Models learn to Generate Multiple Objects?

Diffusion models struggle with multi-object generation not because of imbalanced concept representation, but primarily due to scene complexity and a surprising difficulty in counting, especially when training data is limited.

Yujin Jeong, Arnas Uselis, Iro Laina +2

Computer Vision Data Curation & Synthetic Data Multimodal Models

Qiyao Wang +73w ago·also Introduction With the advancement of multimodal

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.

Qiyao Wang, Haoran Hu, Longze Chen +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models+1

Tsinghua AI3w ago

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.

Sudong Wang, Weiquan Huang, Xiaomin Yu +10

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI+1

Yan Cui +83w ago·also Enable Medicine

Linking spatial biology and clinical histology via Haiku

By jointly embedding spatial biology, histology, and clinical data, Haiku lets you ask "what if" questions about disease progression, revealing molecular shifts linked to clinical outcomes.

Yan Cui, Jacob S. Leiby, Wenhui Lei +6

Computer Vision Multimodal Models Scientific Discovery & Drug Design

MotuBrain Team +203w ago·also Tsinghua AI

MotuBrain: An Advanced World Action Model for Robot Control

Real-time robot control just got a 50x speed boost thanks to MotuBrain's efficient world action model.

MotuBrain Team, Chendong Xiang, Fan Bao +18

Multimodal Models Robotics & Embodied AI World Models & Planning

Hanzhong Guo +103w ago·also ByteDance

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.

Hanzhong Guo, Jie Wu, Jie Wu +8

Computer Vision Multimodal Models RLHF & Preference Learning

Utrecht University3w ago

From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pok\'emon Case Study

Imagine a Pokemon TCG where every card is uniquely yours, dynamically generated by AI to reflect your playstyle and preferences.

Johannes Pfau, Panagiotis Vrettis

Computer Vision Multimodal Models Natural Language Processing

Kenneth J. K. Ong3w ago

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

VLMs playing the Prisoner's Dilemma can be manipulated into selfish behavior simply by showing them images of aggression or reward matrices with specific color schemes.

Kenneth J. K. Ong

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AI3w ago·also Telecom

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.

Yang Zhang, Jiangyuan Zhao, Chenyou Fan +11

Multimodal Models Robotics & Embodied AI World Models & Planning

Guang Yang +33w ago

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

MLLMs can ace circuit-to-code generation by cheating with identifier semantics, even when the circuit diagram is blank.

Guang Yang, Xing Hu, Xiang Chen +1

Code Generation & Program Synthesis Computer Vision Multimodal Models

Ce Chen +83w ago·also HeyGen Research

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.

Ce Chen, Yi Ren, Yuanming Li +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also Baidu, Brown

MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

Achieve state-of-the-art multimodal stance detection by having multiple AI agents debate each other, complete with retrieval-augmented context and self-reflection.

Weihai Lu, Zhejun Zhao, Yanshu Li +1

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

NVIDIA3w ago·also National Center for Childhood Diabetes, Pheno.AI, Schneider Children's Medical Center of Israel, TAU +3

Simulating clinical interventions with a generative multimodal model of human physiology

A generative model of human physiology not only beats existing clinical risk scores at predicting disease, but also accurately simulates the effects of clinical interventions, paving the way for personalized medicine.

Guy Lutsker, Gal Sapir, G. Sapir +12

Multimodal Models Scientific Discovery & Drug Design World Models & Planning

Xupeng Chen +83w ago

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Ditching text chunks for full document page images in medical RAG boosts QA accuracy by a full percentage point, proving that visual context matters.

Xupeng Chen, Binbin Shi, Chenqian Le +6

Multimodal Models Recommendation & Information Retrieval Scientific Discovery & Drug Design

Hiroyuki Deguchi +23w ago

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

A single, optimized text snippet can fool CLIP into thinking it's a good caption for almost any image, revealing a surprising vulnerability in cross-modal understanding.

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

Multimodal Models Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Ke Xu3w ago

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

A carefully crafted synthetic data pipeline and rubric-guided RL lets a 4B parameter model nearly match Gemini-3-Flash on wafer defect analysis, suggesting that data quality and targeted training can trump sheer model size.

Ke Xu

Computer Vision Data Curation & Synthetic Data Multimodal Models

Neemias B da Silva +33w ago

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.

Neemias B da Silva, Rodrigo Minetto, Daniel Silver +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Meta AI3w ago·also Oxford

3D-ReGen: A Unified 3D Geometry Regeneration Framework

Controllable 3D generation takes a leap forward with 3D-ReGen, a framework that leverages an initial 3D shape for tasks like enhancement and editing, outperforming existing methods.

Geon Yeong Park, Geon Yeong Park, Roman Shapovalov +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also JIUTIAN Research

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Ditch the garment masks: a simple human mask is all you need to nail video virtual try-on in the wild.

Dingbao Shao, Di Shao, Songhan Wu +13

Computer Vision Data Curation & Synthetic Data Multimodal Models

Shiqi Xu +53w ago

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.

Shiqi Xu, Moritz Burmester, Katharina Prasse +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Initializing prompts in flatter regions of the loss landscape dramatically improves calibration and performance in test-time prompt tuning for vision-language models.

Hyeonseo Jang, Hyeon-Gi Jang, Jaebyeong Jeon +3

Computer Vision Multimodal Models Training Efficiency & Optimization

Ji-Hyeon Kim +23w ago

ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

By explicitly modeling relationships between multiple relevant video segments, ClipTBP significantly improves video moment retrieval, especially when queries are ambiguous.

Ji-Hyeon Kim, Ho-Joong Kim, Seong-Whan Lee

Computer Vision Multimodal Models Recommendation & Information Retrieval

3w ago·also XJTU

Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models

LVLMs leak visual text style into semantic inference, meaning the font of a word can change the attributes a model associates with the concept it represents.

Xiaomeng Wang, Martha Larson, Zhengyu Zhao

Computer Vision Multimodal Models

3w ago

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

Flat 2D images can now be turned into voluminous 3D assets with state-of-the-art fidelity, thanks to a clever inflated-prior and latent-refinement pipeline.

Hankyeol Lee, H. Lee, Wooyeol Baek +3

Computer Vision Multimodal Models

Zhengqing Wang +73w ago·also SFU, Wayve

LA-Pose: Latent Action Pretraining Meets Pose Estimation

Self-supervised learning from driving videos can beat fully supervised methods for camera pose estimation, even with orders of magnitude less labeled data.

Zhengqing Wang, Saurabh Nair, S. Nair +5

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago·also Shanghai AI Lab

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Current MLLMs still struggle to connect the dots between images and text when they're interleaved, highlighting a critical gap in real-world multimodal understanding.

Bingli Wang, Huanze Tang, Haijun Lv +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Lijin Yang +53w ago

Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Autonomous driving gets a 30% performance boost in challenging scenarios by having VLAs critique and refine their own driving plans.

Lijin Yang, Jianing Huang, Jian-Zhang Huang +3

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago·also Taipei Medical University

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Fusing dermoscopic images, clinical photos, and patient metadata with adaptive weighting dramatically improves skin lesion classification, even in imbalanced, real-world clinical datasets.

Phan Nguyen, P. Nguyen, Dat Cao +9

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Apr 29, 2026

Pokuang Zhou +93w ago

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Quadruped robots can now perform contact-rich manipulation with significantly improved dexterity by learning to "feel" their way through tasks.

Pokuang Zhou, Yuhao Zhou, Quan Luu +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

3w ago·also Open-EP Community

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Achieve faster VLM inference in bandwidth-constrained edge environments by adaptively compressing visual data, outperforming full-edge and full-cloud solutions without sacrificing semantic accuracy.

Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni

Computer Vision Inference & Quantization Multimodal Models

3w ago

LLM-Enhanced Topical Trend Detection at Snapchat

Snapchat's new trend detection system proves that LLMs can successfully consolidate multimodal signals at scale to surface emerging topics from short-form video, boosting content freshness and user engagement.

Hangqi Zhao, Jay Li, Abhiruchi Bhattacharya +6

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

3w ago·also Southwestern University of Finance and Economics

CARD: Non-Uniform Quantization of Visual Semantic Unit for Generative Recommendation

Skewed item distributions in recommendation systems can be tamed with a learnable non-uniform quantization, leading to better codebook utilization and more accurate generative recommendations.

Yibiao Wei, Jie Zou, Pengfei Zhang +4

Inference & Quantization Multimodal Models Recommendation & Information Retrieval

3w ago

TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation

Forget static graphs: TimeMM dynamically reweights user-item interactions based on recency and modality, adapting to evolving user preferences in multimodal recommendations.

Wei Yang, Rui Zhong, Xiaodan Wang +3

Multimodal Models Recommendation & Information Retrieval

Tsinghua AI3w ago·also Fudan

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.

V Team, GLM-V Team Wenyi Hong, Xiaotao Gu +88

Computer Vision Multimodal Models Tool Use & Agents

Howard University3w ago·also Adobe Research

FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

Texture, not color, is the secret sauce behind fashion house identity, revealed by probing a multimodal CNN trained on decades of Vogue runway images.

Morayo Danielle Adeyemi, Ryan A. Rossi, Ryan A. Rossi +2

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Tsinghua AI3w ago·also Xiaomi Robotics

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.

Jun Guo, Qiwei Li, Peiyan Li +8

Computer Vision Multimodal Models Robotics & Embodied AI+1

3w ago·also NVIDIA

Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation

VLN agents can navigate more accurately in zero-shot settings by "looking forward, now, and backward," mimicking human navigational strategies.

Wanrong Zheng, Yunhao Ge, Laurent Itti

Multimodal Models Robotics & Embodied AI World Models & Planning

Ryan Allen +13w ago

Lights Out: A Nighttime UAV Localization Framework Using Thermal Imagery and Semantic 3D Maps

Nighttime UAVs can navigate using only thermal cameras and semantic maps, achieving meter-level accuracy without GPS.

Ryan Allen, Melissa Greeff

Computer Vision Multimodal Models Robotics & Embodied AI

Aditya Ukarande +73w ago

Efficient, VRAM-Constrained xLM Inference on Clients

Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.

Aditya Ukarande, Aditya Ukarande, Deep Shekhar +5

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

Saurabh K. Singh +23w ago

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

Document AI pipelines don't work the way you think: quality bottlenecks aren't where you expect, and components don't cascade quality.

Saurabh K. Singh, S. Raj, Sachin Raj

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Mingze Li +193w ago

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

An AI agent autonomously discovered four new superconductors, shrinking the discovery timeline from years to GPU hours.

Mingze Li, Yu Rong, Songyou Li +17

Multimodal Models Scientific Discovery & Drug Design Tool Use & Agents

3w ago·also UChicago, UT Austin

Targeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

Despite recent advances, sign language translation models still struggle to leverage the full range of linguistic cues, especially non-manual signals like facial expressions.

Serpil Karabüklü, Kanishka Misra, Shester Gueuwou +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

3w ago·also HKU, Tsukuba, University of North Texas, Yonsei

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

LLM agents can now remember far more, far more accurately, by "seeing" their past experiences instead of just reading about them.

Jinze Li, Yang Zhang, Jiayi Qu +3

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

NUS3w ago·also NTU, UNSW

Membership Inference Attacks Against Video Large Language Models

VideoLLMs leak training data: a novel black-box attack recovers membership with surprisingly high accuracy (AUC=0.68) by probing generation brittleness across temperatures.

Wei Song, Yuxin Cao, Ziqi Ding +3

Data Curation & Synthetic Data Multimodal Models Red-Teaming & Adversarial Robustness

3w ago

VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection

Code stylometry, often overlooked, can significantly boost vulnerability detection, improving F1 scores by up to 48% on key benchmarks.

Chidera Biringa, Ajmal Abbas, Vishnu Selvaraj +1

Code Generation & Program Synthesis Multimodal Models

3w ago

VTBench: A Multimodal Framework for Time-Series Classification with Chart-Based Representations

Time-series classification gets a visual upgrade: fusing raw data with intuitive charts like line, bar, and scatter plots can boost accuracy, especially on smaller datasets.

Madhumitha Venkatesan, Xuyang Chen, Dongyu Liu

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 28, 2026

Yafeng Wu +53w ago

HotComment: A Benchmark for Evaluating Popularity of Online Comments

Predicting comment popularity is more than just content quality – stylistic resonance with the platform's user base is a key ingredient, and this benchmark helps you measure it.

Yafeng Wu, Yunyao Zhang, Liliang Ye +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Yuxin Zhang +213w ago

Step-Audio-R1.5 Technical Report

RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.

Yuxin Zhang, Xiangyu Zhang, Xiangyu Tony Zhang +19

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Zaid Nasser +83w ago·also ITMO

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Semantic SLAM can now understand free-form language queries and ground them in 3D space using only a monocular video feed, opening the door to robots that truly understand and interact with the world around them.

Zaid Nasser, Mikhail Iumanov, Tianhao Li +6

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AI3w ago

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.

Jiayi Guo, Linqing Wang, Jiangshan Wang +11

Computer Vision Multimodal Models

Meta AI3w ago·also Brown

IAM: Identity-Aware Human Motion and Shape Joint Generation

Human motion generation gets a dose of reality: IAM shows that explicitly modeling body morphology and identity leads to more realistic and consistent movements.

Wenqi Jia, Wenqi Jia, Zekun Li +14

Computer Vision Multimodal Models Natural Language Processing+1

3w ago·also Nankai University, NJUST, Tongyi Lab

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Skip the bulky bidirectional teacher: this new method trains a fast, causal audio-video generator directly, slashing sampling steps while maintaining top-tier quality.

Yupeng Zhou, Yupeng Zhou, Lianghua Huang +17

Computer Vision Multimodal Models Speech & Audio

3w ago·also Cornell, Stony Brook

GraphPL: Leveraging GNN for Efficient and Robust Modalities Imputation in Patchwork Learning

Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.

Xingjian Hu, Zuoyu Yan, Jianhua Zhu +3

Distributed Systems & Hardware Multimodal Models Training Efficiency & Optimization

Divake Kumar +43w ago

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

VLMs can ace the ranking but bomb the scoring, revealing a critical flaw in how we evaluate multimodal systems.

Divake Kumar, Sina Tayebati, Devashri Naik +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Hector G. Rodriguez +13w ago

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Achieve 3x better coverage on out-of-distribution visual question answering by explicitly scoring the quality of visual evidence, even when using black-box models like Gemini-3-Pro.

Hector G. Rodriguez, Marcus Rohrbach

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also Luxembourg Institute of Science and Technology, TJU

Learning Generalizable Multimodal Representations for Software Vulnerability Detection

Software vulnerability detection gets a serious upgrade: aligning code with developer comments boosts F1 scores by up to 27% compared to traditional code-only methods.

Zeming Dong, Yuejun Guo, Qiang Hu +2

Code Generation & Program Synthesis Multimodal Models Natural Language Processing

3w ago

RADD: Retrieval-Augmented Discrete Diffusion for Multi-Modal Knowledge Graph Completion

Decoupling retrieval and reranking with a discrete diffusion model leaps ahead of monolithic embedding scorers for multi-modal knowledge graph completion.

Guanglin Niu

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

3w ago·also CAS

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

LVLMs hallucinate less when you intervene *before* they start generating, by cleaning up the initial Key-Value cache with modality-aware steering vectors.

Chengsheng Zhang, Chenghao Sun, Xinyan Jiang +1

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Saarland Informatics Campus3w ago

DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding

Multimodal language models are fluent liars: they produce convincing procedural video captions that are often factually incomplete, with systematic omissions and role-level inconsistencies exposed by video-grounded verification.

Cennet Oguz, Yasser Hamidullah, Simon Ostermann

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Hanqing Yang +83w ago

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

Decoupling the "Thinker" from the "Editor" in image editing allows targeted optimization of reasoning, leading to performance competitive with strong proprietary models using a fixed generative model.

Hanqing Yang, Qiang Zhou, Yongchao Du +6

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

A. Iyengar +63w ago·also Adobe Research

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

Current VLMs ace diagram question answering, but DRAGON reveals they often fake it, failing to ground their answers in the actual visual evidence.

A. Iyengar, Tampu Ravi Kumar, Gaurav Najpande +4

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Lanshan He +203w ago·also ASU

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

Forget tedious manual workflows: LLMs can now autonomously generate editable, engine-native 3D cutscenes by intelligently orchestrating animation, cinematography, and sound design.

Lanshan He, Haozhou Pang, Haozhou Pang +18

Computer Vision Multimodal Models Tool Use & Agents

Shanghai Academy of AI for Science3w ago·also Beijing University of Posts

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

Diffusion models can now reason recursively over visual tokens, achieving state-of-the-art image generation performance by dynamically selecting specialized neural modules at each diffusion step.

Yuwei Sun, Yuxuan Yao, Hui Li +1

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Ke Wang +33w ago

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Emotion recognition can be significantly improved by adapting to individual expressive traits, with ML-SAN outperforming static models in capturing nuanced emotional expressions.

Ke Wang, Kexue Wang, Yinfeng Yu +1

Multimodal Models Natural Language Processing Speech & Audio

Kidus Zewde +103w ago

GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment

Twitter strips C2PA provenance data from AI-generated images, making it impossible to cryptographically verify their origin on the platform.

Kidus Zewde, Kidus Zewde, Simiao Ren +8

Computer Vision Data Curation & Synthetic Data Multimodal Models

Search

Multimodal Models - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)