Wenxuan Song

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Multimodal Models (6)Robotics & Embodied AI (4)Eval Frameworks & Benchmarks (3)Reasoning & Chain-of-Thought (1)

Frequent co-authors

Han Zhao (2)Pengxiang Ding (2)Donglin Wang (2)Haoang Li (2)

Papers (6)

Jul 16, 2026

Xiaomi Robotics Team Jun Guo +331w ago·also Tsinghua AI, CAS

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

Achieving a 57.6% success rate on RoboCasa365, Xiaomi-Robotics-1 sets a new standard for vision-language-action models in real-world robotic manipulation.

Xiaomi Robotics Team Jun Guo, Piaopiao Jin, Jason Li +31

Multimodal Models Robotics & Embodied AI

Jun 22, 2026

Tsinghua AIJun 22, 2026·also sen University, Shanghai Qi Zhi Institute, UMich

PIVOTSBench: Evaluating Fine-Grained Interpersonal Relationship Reasoning in Multimodal Large Language Models

MLLMs falter in fine-grained interpersonal reasoning, but integrating visual cues and social roles can dramatically boost their performance.

Shuxiang Zhang, Yiting Yin, Wenxuan Song

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

May 26, 2026

Tsinghua AIMay 26, 2026

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

MLLMs struggle to juggle proactive tasks and reactive queries in dynamic video streams, but a simple agentic framework can significantly improve their coordination without any training.

Jinzhao Li, Yinuo Chen, Wenxuan Song +3

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Mar 26, 2026

Tsinghua AIMar 26, 2026·also ZJU

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Ditch the clunky architectures: a single diffusion model can now handle vision, language, and robot control to achieve SOTA manipulation performance.

Yang Liu, Pengxiang Ding, Teng-Long Jiang +10

Computer Vision Multimodal Models Robotics & Embodied AI

Feb 26, 2026

Feb 26, 2026·also Galbot, TU Munich, Xidian

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

A practical VLA model, LLaVA-VLA, achieves strong generalization and versatility on a new benchmark, CEBench, while running on consumer-grade GPUs, eliminating the need for costly pre-training.

Wenxuan Song, Jiayi Chen, Xiaoquan Sun +11

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 19, 2026

Tsinghua AIFeb 19, 2026·also NTU, PKU, The Fin AI, TU Munich +2

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

By aligning latent representations with multiple visual foundation models, FRAPPE offers a more scalable and data-efficient way to imbue generalist robotic policies with robust world-awareness.

Han Zhao, Jingbo Wang, Wenxuan Song +9

Multimodal Models Robotics & Embodied AI World Models & Planning

Search

Wenxuan Song

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (6)