Tsinghua AI

×Multimodal Models

81 papers from Tsinghua AI on Multimodal Models

May 6, 2026

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.

Xiaopei Zhu, Guanning Zeng, Zhanhao Hu +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

May 1, 2026

Stanford HAI3w ago·also Tsinghua AI, Beihang, CUHK, HKUST +1

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.

Houyuan Chen, Hong Li, Xianghao Kong +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 30, 2026

Tsinghua AI3w ago

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.

Xupeng Chen, Binbin Shi, Chenqian Le +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AI3w ago·also MiniCPM-o Team, Tencent AI

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.

Junbo Cui, Bokai Xu, Chongyi Wang +36

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

Tsinghua AI3w ago·also BUPT, Corresponding author

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.

Junpeng Ding, Zichen Tang, Zichen Tang +21

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AI3w ago

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.

Sudong Wang, Weiquan Huang, Xiaomin Yu +10

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI+1

Tsinghua AI3w ago·also Telecom

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.

Yang Zhang, Jiangyuan Zhao, Chenyou Fan +11

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 29, 2026

Tsinghua AI3w ago·also Fudan

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.

GLM-V Team Wenyi Hong, V Team, Wenyi Hong +88

Computer Vision Multimodal Models Tool Use & Agents

Tsinghua AI3w ago·also Xiaomi Robotics

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.

Jun Guo, Qiwei Li, Peiyan Li +8

Computer Vision Multimodal Models Robotics & Embodied AI+1

Apr 28, 2026

Tsinghua AI3w ago

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.

Jiayi Guo, Linqing Wang, Jiangshan Wang +11

Computer Vision Multimodal Models

Tsinghua AI3w ago·also Edinburgh, UBC

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.

Yi Yang, Yijing Cui, Alla Sheffer +1

Computer Vision Multimodal Models Robotics & Embodied AI

3w ago·also Tsinghua AI, Huawei

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.

Minghang Zheng, Zihao Yin, Yi Yang +3

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Apr 23, 2026

Tsinghua AIApr 23, 2026

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.

Chentao Li, Zirui Gao, Mingze Gao +3

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Apr 23, 2026·also Tsinghua AI, Westlake

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.

Zeyu Cai, Yuliang Xiu, Renke Wang +8

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 23, 2026·also Tsinghua AI, Hengqin Laboratory, Sheffield

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.

Jingkun Chen, Ru Xu, Mingqi Gao +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 22, 2026

Tsinghua AIApr 22, 2026·also Imperial, of CAD & CG, State Key Laboratory, ZJU

Exploring Spatial Intelligence from a Generative Perspective

Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.

Muzhi Zhu, Shunyao Jiang, Huanyi Zheng +11

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AIApr 22, 2026

From Scene to Object: Text-Guided Dual-Gaze Prediction

LLMs can now predict where drivers look with uncanny human-like accuracy, thanks to a new dataset and architecture that grounds attention in objects, not just scenes.

Zehong Ke, Yanbo Jiang, Jinhao Li +4

Computer Vision Multimodal Models Natural Language Processing

School of Computer Science and Software EngineeringApr 22, 2026·also Tsinghua AI, University of Nottingham, Wenzhou Medical University

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

MLLMs still struggle to integrate diverse data for clinical reasoning, as evidenced by their poor performance on a new ophthalmology benchmark spanning image quality assessment to diagnosis.

Gui Wang, Zehao Zhong, YongSong Zhou +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

D observations intoApr 22, 2026·also NUS, Tsinghua AI, CAS, DGS-based methods [47 +2

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Pocket-sized VLA models can now achieve state-of-the-art robot manipulation performance by pre-training on a curated multimodal dataset and injecting manipulation-relevant representations into the action space.

Yupeng Zheng, Songen Gu, Yuhang Zheng +10

Multimodal Models Robotics & Embodied AI

Apr 21, 2026

Apr 21, 2026·also Tsinghua AI, Sen University

Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

Stop fragmented land cover predictions: SSDM leverages global geospatial embeddings to guide local feature extraction, achieving state-of-the-art performance in high-resolution remote sensing mapping.

Jienan Lyu, Miao Yang, Jinchen Cai +4

Computer Vision Multimodal Models

Tsinghua AIApr 21, 2026

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Freezing a Stable Diffusion backbone and injecting CLIP and BLIP features lets you beat the state-of-the-art in zero-shot sketch-based 3D shape retrieval, without any costly retraining.

Hang Cheng, Fanhe Dong, Fanhe Dong +1

Computer Vision Multimodal Models Recommendation & Information Retrieval

Apr 20, 2026

Tsinghua AIApr 20, 2026·also HIT

Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval

MV-HGNN achieves superior 3D shape retrieval by effectively leveraging geometric dependencies and semantic alignment, outperforming existing methods in zero-shot settings.

Hang Cheng, Muyan He, Mingyu Fan +3

Computer Vision Multimodal Models Recommendation & Information Retrieval

Tsinghua AIApr 20, 2026·also PKU

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.

Haiweng Xu, Sipeng Zheng, Hao Luo +4

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Apr 20, 2026·also Tsinghua AI, PKU

Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.

Zehua Zang, Fuchun Sun, Xiao Xu +3

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 18, 2026

Apr 18, 2026·also Tsinghua AI, Baidu, SJTU, TJU

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

Targeted neuron fine-tuning can unlock superior image translation capabilities in multimodal large language models, outperforming traditional methods by preserving pre-trained knowledge.

Ningyuan Deng, Tianyu Dong, Shaobo Wang +2

Computer Vision Multimodal Models

Apr 17, 2026

Tsinghua AIApr 17, 2026·also Beihang, HKU, PKU, Tencent AI

Repurposing 3D Generative Model for Autoregressive Layout Generation

Autoregressive 3D layout generation can be both more physically plausible and significantly faster by repurposing existing 3D generative models.

Haoran Feng, Yifan Niu, Zehuan Huang +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 16, 2026

Tsinghua AIApr 16, 2026

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

Forget relying on fickle visuals: this new ReID method uses language to describe *who* a person is, not just what they look like, and it crushes existing benchmarks.

Jiaxuan Li, Xin Wen, Zhihang Li

Computer Vision Multimodal Models Recommendation & Information Retrieval

Apr 15, 2026

Tsinghua AIApr 15, 2026·also DUT, Tencent AI

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

MLLMs still struggle to reason about everyday situations when they require identifying and using visual clues, despite excelling at tasks relying on pre-existing knowledge.

Xiaomin Li, Tala Wang, Zichen Zhong +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 15, 2026·also Tsinghua AI, State Key Laboratory of Complex &

MAny: Merge Anything for Multimodal Continual Instruction Tuning

MLLMs don't just forget language, they also suffer from perceptual drift in cross-modal spaces, but MAny offers a training-free merging strategy to fix both.

Zijian Gao, Wangwang Jia, Xingxing Zhang +5

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Tsinghua AIApr 15, 2026·also Kuaishou

DiffMagicFace: Identity Consistent Facial Editing of Real Videos

Achieve photorealistic, identity-consistent facial video edits from text prompts without video training data, rivaling traditional rendering software.

Huanghao Yin, Shenkun Xu, Kanle Shi +1

Computer Vision Multimodal Models

Tsinghua AIApr 15, 2026·also AI Laboratory, Corresponding author are Bo Cheng and Soujanya, Tencent AI, USTC

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Imagine creating high-fidelity, navigable 3D worlds from just a text prompt or a single image – HY-World 2.0 makes it a reality.

Team HY-World, Chenjie Cao, Xuhui Zuo +42

Computer Vision Multimodal Models World Models & Planning

Apr 14, 2026

Apr 14, 2026·also Tsinghua AI, CAU, Northeastern, Southwest Jiaotong University

GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.

Zhiwei Zhang, Xingyuan Zeng, Xinkai Kong +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 13, 2026

Apr 13, 2026·also Tsinghua AI, ZJU

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

By explicitly modeling both consensus and discrepancy between RGB and IR data, this text-guided multispectral object detector significantly boosts performance on multispectral benchmarks.

Zhen Wang, Enhao Huang, Kangqing Shen +1

Computer Vision Multimodal Models Natural Language Processing

Tsinghua AIApr 13, 2026·also Guangming Laboratory, NJU, PolyU

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Finally, a model that speaks fluent Lottie: LottieGPT generates editable vector animations directly from text or images, opening up a new frontier for resolution-independent, compact, and semantically structured multimedia creation.

Junhao Chen, Kejun Gao, Yuehan Cui +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Tsinghua AIApr 13, 2026·also China University of Petroleum (Beijing), Heavy Oil Processing, Key Laboratory, School of Software

Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

Achieve state-of-the-art object detection accuracy and efficiency by fusing RGB frames and event streams with a sparse hypergraph and a fine-grained mixture of experts, enabling real-time edge deployment.

Wei Bao, Yuehan Wang, Tianhang Zhou +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Apr 10, 2026

Tsinghua AIApr 10, 2026

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Achieve real-time (40 FPS at 720p) interactive video generation with minute-long memory consistency using a 5B parameter world model.

Zile Wang, Zexiang Liu, Jaixing Li +17

Computer Vision Multimodal Models World Models & Planning

Apr 9, 2026

Tsinghua AIApr 9, 2026·also GigaAI

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Robots can now better assemble boxes in the real world thanks to a video-generative value model that anticipates future states, moving beyond static snapshots for more reliable task progress assessment.

Jindi Lv, Hao Li, Jie Li +10

Multimodal Models Robotics & Embodied AI World Models & Planning

Tsinghua AIApr 9, 2026·also BUPT, School of Information Science and Technology

Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

Medical MLLMs, despite their size and training data, stumble on basic image classification due to four key failure modes, revealing a disconnect between hype and clinical readiness.

Xun Zhu, Fanbin Mo, Kaili Zheng +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AIApr 9, 2026

EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Turns out, you can cut critical errors in VLM-generated image editing instructions in half with a clever two-stage training pipeline, leading to SOTA editing performance.

Xiangyuan Wang, Honghao Cai, Yunhao Bai +6

Computer Vision Data Curation & Synthetic Data Multimodal Models

Apr 9, 2026·also Tsinghua AI, PKU

GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair

LLMs can now leverage visual structure, not just text, to pinpoint bugs in multimodal programs, thanks to a novel graph alignment approach that bridges the gap between GUI screenshots and code.

Zhuoyao Liu, Z. Zeng, Zhengran Zeng +3

Code Generation & Program Synthesis Computer Vision Multimodal Models

Tsinghua AIApr 9, 2026·also SDU

WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

World models are more valuable for synthesizing structured supervision for navigation learning than for directly providing action-ready imagined evidence.

Hongjin Chen, Shan Jiang, Shangyun Jiang +5

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 8, 2026

Tsinghua AIApr 8, 2026·also Arizona, HFUT

Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

Forget fixed pipelines: training an agent to *learn* when and how to search for knowledge dramatically improves performance on knowledge-based visual question answering.

Zhuohong Chen, Zhenxian Wu, Yunyao Yu +6

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Beijing Univ. Posts & Telecommun.Apr 8, 2026·also Tsinghua AI, UNC

LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

Current multimodal LLMs struggle with guideline-constrained clinical reasoning, but a simple multi-agent framework can significantly boost their performance on real-world lung cancer diagnosis and treatment.

Fangyu Hao, Fangyu Hao, Jiayu Yang +16

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Apr 8, 2026·also DAMO, Tsinghua AI, CAS

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.

Yifan Ji, Yukun Yan, Shuo Wang +2

Computer Vision Multimodal Models Recommendation & Information Retrieval

Apr 7, 2026

Apr 7, 2026·also Tsinghua AI, Aarhus University

QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

Existing multimodal sentiment analysis models crumble under real-world noise, but QA-MoE leverages uncertainty to dynamically route inputs, achieving robust performance across a continuous spectrum of data quality.

Yitong Zhu, Yuxuan Jiang, Guanxuan Jiang +3

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Natural Language Processing

Tsinghua AIApr 7, 2026·also HKUST, SYSU

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

VLA models, seemingly robust, crumble when faced with diverse linguistic variations, as a new red-teaming approach reveals a staggering drop in task success from 93% to just 6%.

Baoshun Tong, Haoran He, Ling Pan +2

Multimodal Models Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Tsinghua AIApr 7, 2026·also CAS, College of Computer and Data Science

Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.

Hongsheng Li, Zexian Yang, Rong Yin

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 6, 2026

Tsinghua AIApr 6, 2026·also Beijing Sport University

BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing

Current multimodal models can't handle the rapid-fire tactical analysis required for boxing commentary, as revealed by a new dataset and evaluation framework.

Kaiwen Wang, Rongrong Deng, Yiming Shi +2

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Mar 31, 2026

Tsinghua AIMar 31, 2026·also ByteDance, Rice

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Current multimodal dialogue systems can't capture the subtle expressiveness of human interaction, as revealed by a new benchmark dataset of movie and TV dialogues.

Zeyu Jin, Songtao Zhou, Ming Tian +4

Multimodal Models Natural Language Processing Speech & Audio

Mar 19, 2026

Mar 19, 2026·also Tsinghua AI, HKU

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.

Xinyao Zhang, Wenkai Dong, Yuxin Song +11

Computer Vision Multimodal Models Natural Language Processing