Tsinghua AI

×Robotics & Embodied AI

58 papers from Tsinghua AI on Robotics & Embodied AI

May 5, 2026

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

Quadrupedal robots can now perform dynamic loco-manipulation in the real world, matching human teleoperation, using only onboard ego-centric vision and a low-frequency (5Hz) open-vocabulary detector.

Shiyi Chen, Haiyi Liu, Ming Yang +2

Computer Vision Robotics & Embodied AI World Models & Planning

Apr 30, 2026

Tsinghua AI3w ago·also Northeastern, State Key Laboratory of General

Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents

Embodied agents can now exhibit coherent, long-horizon, self-directed behavior by reasoning about abstract value trade-offs, a capability previously absent in instruction-following or needs-driven approaches.

Chunhui Zhang, Yuxuan Wang, Aoyang Qin +5

Constitutional AI & AI Ethics Robotics & Embodied AI Tool Use & Agents

Tsinghua AI3w ago

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.

Sudong Wang, Weiquan Huang, Xiaomin Yu +10

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI+1

Tsinghua AI3w ago·also Telecom

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

By pretraining a VLA model with goal-conditioned RL, PRTS learns to reason about goal reachability, leading to substantial gains in long-horizon robotic tasks and zero-shot generalization.

Yang Zhang, Jiangyuan Zhao, Chenyou Fan +11

Multimodal Models Robotics & Embodied AI World Models & Planning

Apr 29, 2026

Tsinghua AI3w ago·also Xiaomi Robotics

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Achieve real-time robotic action with 79-91% success while generating high-fidelity 4D reconstructions, all within a single unified world model.

Jun Guo, Qiwei Li, Peiyan Li +8

Computer Vision Multimodal Models Robotics & Embodied AI+1

Tsinghua AI3w ago·also CAS, Fudan, HFUT, Pengcheng Laboratory +2

Walk With Me: Long-Horizon Social Navigation for Human-Centric Outdoor Assistance

Robots can now navigate complex outdoor environments using only high-level human instructions and readily available GPS/map data, bypassing the need for expensive HD maps or limited short-horizon policies.

Lingfeng Zhang, Xiaoshuai Hao, Xizhou Bu +10

Natural Language Processing Robotics & Embodied AI

Apr 28, 2026

Tsinghua AI3w ago·also Edinburgh, UBC

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.

Yi Yang, Yijing Cui, Alla Sheffer +1

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 27, 2026

3w ago·also Tsinghua AI, The Key Laboratory of Road and Traffic Engineering, UCF

Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

LLMs can now generate driving rules from traffic laws with significantly improved accuracy by grounding their reasoning in structured traffic scenarios.

Bowen Jian, Rongjie Yu, Hong Wang +2

Constitutional AI & AI Ethics Natural Language Processing Robotics & Embodied AI

Apr 23, 2026

Tsinghua AIApr 23, 2026

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

Autonomous vehicles can now plan trajectories 10x faster without sacrificing performance, thanks to a novel architecture that learns complex driving behaviors in latent space during training.

Yining Xing, Zehong Ke, Yiqian Tu +3

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI World Models & Planning

Tsinghua AIApr 23, 2026

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.

Chentao Li, Zirui Gao, Mingze Gao +3

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Apr 23, 2026·also Tsinghua AI, Westlake

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.

Zeyu Cai, Yuliang Xiu, Renke Wang +8

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 23, 2026·also Tsinghua AI, Hengqin Laboratory, Sheffield

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.

Jingkun Chen, Ru Xu, Mingqi Gao +2

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 22, 2026

Apr 22, 2026·also Tsinghua AI, Fudan, Hamburg, Hubei University of Chinese Medicine +1

ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement

Achieve superhuman dexterity: ALAS unlocks robust long-horizon task completion by decoupling environment understanding from motor control, enabling generalization across diverse human-scene interaction scenarios.

Yutong Shen, Hangxu Liu, Lei Zhang +4

Robotics & Embodied AI World Models & Planning

D observations intoApr 22, 2026·also NUS, Tsinghua AI, CAS, DGS-based methods [47 +2

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Pocket-sized VLA models can now achieve state-of-the-art robot manipulation performance by pre-training on a curated multimodal dataset and injecting manipulation-relevant representations into the action space.

Yupeng Zheng, Songen Gu, Yuhang Zheng +10

Multimodal Models Robotics & Embodied AI

Apr 20, 2026

Tsinghua AIApr 20, 2026·also PKU

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.

Haiweng Xu, Sipeng Zheng, Haoming Luo +4

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Tsinghua AIApr 20, 2026

Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist

A custom-designed tendon-driven wrist, combined with a particle-spring model, enables precise and robust control of highly flexible objects like spinning handkerchiefs.

Lei Liu, Haonan Zhang, Huahang Xu +9

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI Training Efficiency & Optimization

Tsinghua AIApr 20, 2026·also CHD, School of Traffic & Transportation Engineering, State Key Laboratory of Intelligent Green Vehicle and Mobility

Driving risk emerges from the required two-dimensional joint evasive acceleration

Time-to-collision metrics miss critical collision risk information, but a new 2D acceleration-based metric anticipates collisions far better.

Hao Cheng, Yanbo Jiang, Rui Zhou +8

Robotics & Embodied AI

Apr 20, 2026·also Tsinghua AI, PKU

Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.

Zehua Zang, Fuchun Sun, Xiao Xu +3

Computer Vision Multimodal Models Robotics & Embodied AI

Apr 15, 2026

Tsinghua AIApr 15, 2026·also HUST

EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

Imagine automating the tedious engineering tasks in embodied AI development with a conversational agent, freeing researchers to focus on core algorithmic innovation.

Xueyang Zhou, Yihang Sun, Yihan Sun +6

Code Generation & Program Synthesis Robotics & Embodied AI Tool Use & Agents

Apr 13, 2026

Tsinghua AIApr 13, 2026·also HKU

Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

Achieve superior 3D scene reconstruction from aerial images with significantly reduced transmission overhead by directly optimizing communication for rendering quality.

Zeyi Ren, Jialin Dong, Wei Zuo +4

Computer Vision Robotics & Embodied AI Training Efficiency & Optimization

Apr 12, 2026

Tsinghua AIApr 12, 2026·also BAIR, Fudan, Shanghai Qi Zhi Institute

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.

Kaizhe Hu, Yingqian Huang, Yuanchen Ju +2

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Apr 10, 2026

Apr 10, 2026·also Tsinghua AI, Futian Laboratory

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

Robots can now focus on the *right* body parts for interaction, thanks to a new vision-language model that understands human motion commands and precisely localizes task-relevant 3D keypoints.

Yonggen Ling, Yiyang Lin, Yuji Wang

Computer Vision Robotics & Embodied AI

Apr 9, 2026

Tsinghua AIApr 9, 2026·also GigaAI

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Robots can now better assemble boxes in the real world thanks to a video-generative value model that anticipates future states, moving beyond static snapshots for more reliable task progress assessment.

Jindi Lv, Hao Li, Jie Li +10

Multimodal Models Robotics & Embodied AI World Models & Planning

Tsinghua AIApr 9, 2026·also SDU

WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

World models are more valuable for synthesizing structured supervision for navigation learning than for directly providing action-ready imagined evidence.

Hongjin Chen, Shan Jiang, Shangyun Jiang +5

Multimodal Models Robotics & Embodied AI World Models & Planning

Tsinghua AIApr 9, 2026

SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

Ditch the slow per-scene optimization: SurfelSplat reconstructs surfaces from sparse views in under a second, matching state-of-the-art accuracy with a 100x speedup.

Chensheng Dai, Shengjun Zhang, Yueqi Duan

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI+1

Apr 8, 2026

Tsinghua AIApr 8, 2026·also Beihang, Central South University, College of Information and Control Engineering, First Aircraft Institute of Aviation +1

Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization

Hierarchical RL can tame the curse of dimensionality in fleet management, enabling superior maintenance and logistics decisions compared to monolithic approaches.

Yong Si, Mingfei Lu, Yang Hu +3

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Tsinghua AIApr 8, 2026·also Chongqing Changan Automobile Co.

Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Synthesizing novel views from extrapolated poses no longer requires dense supervision, thanks to a geometry-conditioned diffusion model that explicitly learns to handle out-of-trajectory artifacts.

Yatong Lan, Rongkui Tang, Lei He

Computer Vision Robotics & Embodied AI World Models & Planning

Apr 8, 2026·also Tsinghua AI, PKU

BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes

Generating coordinated bimanual grasps on diverse objects is now possible thanks to a dataset of nearly 10 million grasps and a model that adapts to object geometry and size.

Mu Lin, Yi-Lin Wei, Jiaxuan Chen +6

Data Curation & Synthetic Data Robotics & Embodied AI

Tsinghua AIApr 8, 2026

CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection

Legged robots can now recover from sensor noise and crazy user commands with 10x greater reliability, thanks to a new method that respects the robot's competence boundaries.

Ziyang Cheng, Haoyu Wei, Hang Yin +2

Robotics & Embodied AI

Shenzhen Institute of ArtificialApr 8, 2026·also Tsinghua AI, British University in Egypt (BUE), National Institute of Clean-and-Low-Carbon, National Research Centre (NRC) +1

Infrastructure First: Enabling Embodied AI for Science in the Global South

Overcoming infrastructure limitations, not algorithmic capability, is the key to unlocking the potential of Embodied AI for Science in the Global South.

Shaoshan Liu, Marwa S. Hassan, Mohamed H. Sharkawy +6

Distributed Systems & Hardware Robotics & Embodied AI Scientific Discovery & Drug Design

Apr 7, 2026

Tsinghua AIApr 7, 2026·also HKUST, SYSU

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

VLA models, seemingly robust, crumble when faced with diverse linguistic variations, as a new red-teaming approach reveals a staggering drop in task success from 93% to just 6%.

Baoshun Tong, Haoran He, Ling Pan +2

Multimodal Models Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Tsinghua AIApr 7, 2026·also CAS, College of Computer and Data Science

Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.

Hongsheng Li, Zexian Yang, Rong Yin

Computer Vision Multimodal Models Robotics & Embodied AI

University of SannioApr 7, 2026·also Tsinghua AI, School of Computer Science and Technology, Veermata Jijabai Technological Institute

A Novel PID Design Method via Model-Based Reinforcement Learning Algorithms

Unlock the power of RL for PID control: this method automatically translates complex RL policies into simple, robust PID gains, offering a plug-and-play upgrade for existing automation systems.

Hozefa Jesawada, A. Yerudkar, Yang Liu +2

RLHF & Preference Learning Robotics & Embodied AI Training Efficiency & Optimization+1

Apr 6, 2026

Tsinghua AIApr 6, 2026

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.

Zhongru Zhang, Cheng‐Chuan Yang, Chenghan Yang +4

Computer Vision Robotics & Embodied AI World Models & Planning

Tsinghua AIApr 6, 2026·also Key Laboratory of Marine Robotics

WaterSplat-SLAM: Photorealistic Monocular SLAM in Underwater Environment

Finally, underwater SLAM can produce photorealistic maps thanks to a novel medium-aware Gaussian map representation.

Kangxu Wang, Shaofeng Zou, Chenxing Jiang +4

Computer Vision Robotics & Embodied AI

Mar 19, 2026

Tsinghua AIMar 19, 2026·also NVIDIA

PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors

Humanoid robots can now traverse complex terrains with human-like gaits, thanks to a surprisingly simple and efficient framework that eschews adversarial training.

Chenxi Han, Shilu He, Yixiao Cheng +2

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Mar 18, 2026

Mar 18, 2026·also Tsinghua AI, INRIA

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.

Haocheng Yuan, Adrien Bousseau, Lei Zhong +1

Computer Vision Robotics & Embodied AI

Mar 17, 2026

Tsinghua AIMar 17, 2026·also HIT, HKU

$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.

Ruishan Guo, Ciyu Ruan, Haoyang Wang +2

Computer Vision Multimodal Models Robotics & Embodied AI

Mar 15, 2026

Tsinghua AIMar 15, 2026·also CUHK, Shanghai AI Lab, Shanghai Qi Zhi Institute, USTC

One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation

Forget training separate policies for every robot hand – this method learns one policy to control them all, slashing data needs and boosting performance by 50% in cross-embodiment manipulation.

Juncheng Mu, Sizhe Yang, Hojin Bae +1

Robotics & Embodied AI Training Efficiency & Optimization

Mar 12, 2026

Department of Computer Science and TechnologyMar 12, 2026·also Tsinghua AI, BUPT, College of AI, HKU

HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.

Chengwen Zhang, Chun Yu, Borong Zhuang +9

Computer Vision Multimodal Models Robotics & Embodied AI

MilaMar 12, 2026·also Tsinghua AI, AgiBot, CUHK, McGill

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.

Lirong Che, Shuo Wen, Shan Huang +4

Computer Vision Multimodal Models Robotics & Embodied AI

Mar 9, 2026

Tsinghua AIMar 9, 2026·also M steps for a fair comparison., UChicago

Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

RAMBO's instability got you down? ROMI offers a robust, value-aware model learning approach with implicitly differentiable adaptive weighting that outperforms RAMBO and other SOTA methods in offline RL benchmarks.

Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu +3

Robotics & Embodied AI World Models & Planning

Mar 5, 2026

Tsinghua AIMar 5, 2026·also ByteDance

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.

Juntong Fang, Zequn Chen, Weiqi Zhang +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenny Kimble, Kenneth Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 3, 2026

Tsinghua AIMar 3, 2026·also CMU ML, Microsoft Research

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Diffusion planners get a boost in robustness and performance thanks to SAGE, a self-supervised method that weeds out dynamically inconsistent plans using a learned latent consistency signal.

Dongqi Han, Yansen Wang, Dongsheng Li

Robotics & Embodied AI World Models & Planning

Mar 2, 2026

Tsinghua AIMar 2, 2026

Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment

Achieve state-of-the-art monocular re-localization in OpenStreetMap by cleverly aligning image semantics with map data, enabling faster and more accurate localization than dense matching approaches.

Yuchen Zou, Dexing Zhong

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AIMar 2, 2026

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Achieve real-time, drift-free online 3D reconstruction by decoupling memory into actively refreshed local geometry and a stable, persistent global structure.

Yule Wang, Yize Pang

Computer Vision Multimodal Models Robotics & Embodied AI

Tsinghua AIMar 2, 2026·also CUHK, Galbot

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

Achieve more realistic and physically plausible scene reconstructions from video by explicitly optimizing viewpoints for object generation and synthesizing scene graphs within a 3D simulator.

Chong Xia, Chong Xia, Kai Zhu +6

Computer Vision Robotics & Embodied AI World Models & Planning

Mar 1, 2026

Lanzhou UniversityMar 1, 2026·also Tsinghua AI, State Key Laboratory of Intelligent Green Vehicle and Mobility

DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

LLMs can now handle autonomous driving tasks with greater precision and efficiency thanks to DriveCode, which replaces discrete number tokens with continuous embeddings.

Zhiye Wang, Yanbo Jiang, Fang Zhang

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI Training Efficiency & Optimization

Feb 26, 2026

DAMOFeb 26, 2026·also Tsinghua AI, USTC

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.

Tianxing Xu, Zixuan Wang, Guangyuan Wang +5

Computer Vision Robotics & Embodied AI World Models & Planning

Search

Tsinghua AI