Tsinghua AI

×Eval Frameworks & Benchmarks

47 papers from Tsinghua AI on Eval Frameworks & Benchmarks

Mar 31, 2026

1d ago·also Tsinghua AI, PKU

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.

Hongyi Nie, Xunyuan Liu, Yudong Bai +4

Eval Frameworks & Benchmarks Tool Use & Agents

Tsinghua AI1d ago·also NJU, PKU

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.

Wenyi Li, Renkai Luo, Yue Yu +5

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Mar 30, 2026

Tsinghua AI2d ago

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.

Siyuan Ma, Bo Gao, Zikai Xiao +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2d ago·also Tsinghua AI, XJU

Evaluating Privilege Usage of Agents on Real-World Tools

LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.

Quan Zhang, Li Fu, Lianhang Fu +7

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 29, 2026

3d ago·also Tsinghua AI, Imperial Global Singapore, Nankai University, SCU +1

Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

NPM malware detection tools often fail because they struggle to distinguish between innocuous code behavior and malicious intent, a problem addressable by analyzing behavioral chains.

Wenbo Guo, Zhongwen Chen, Zhengzi Xu +6

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Mar 26, 2026

Tsinghua AI6d ago

Natural-Language Agent Harnesses

Stop burying your agent harness logic in code: NLAHs let you express it in natural language, making it portable, editable, and analyzable.

Lin Pan, Lexiao Zou, Shuo Guo +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 19, 2026

Tsinghua AI1w ago·also Guangdong Laboratory of AI and Digital Economy (SZ), Independent Researcher, PolyU, SYSU +1

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.

Yinghui Li, Jiayi Kuang, Peng Xing +11

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Tsinghua AI1w ago

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.

Huaide Jiang, Huai-Zhou Jiang, Yash Chaudhary +11

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 17, 2026

Tsinghua AI2w ago·also UMD

When AI Navigates the Fog of War

LLMs can exhibit surprising "strategic realism" when analyzing an ongoing geopolitical conflict, but their reasoning falters in politically ambiguous situations, revealing critical domain-specific limitations.

Ming Li, Xirui Li, Tianyi Zhou

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought World Models & Planning

Mar 16, 2026

Tsinghua AI2w ago·also Beihang, BIT, NJU, Proxseer Inc.

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

LLMs struggle to effectively use private library APIs even when provided with the correct documentation, but PriCoder can boost their performance by over 20% through targeted training data synthesis.

Yitong Zhang, Chengze Li, Ruize Chen +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Tsinghua AI2w ago

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

MLLMs still can't handle time-sensitive multimodal reasoning, often failing to integrate auditory and visual cues effectively in dynamic environments like a 4D escape room.

Yurui Dong, Ziyue Wang, Shuyun Lu +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Mar 15, 2026

2w ago·also Tsinghua AI, BJTU

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.

Shengda Fan, Xuyan Ye, Yupeng Huo +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Tsinghua AI2w ago·also Beihang

CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

LLMs struggle with low-resource general-purpose programming languages, and surprisingly, translating code *to* a low-resource language is harder than generating it from text.

Fang Liu, Jia Li, Chengru Wu +2

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Mar 12, 2026

Tsinghua AI2w ago·also HKUST

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Scaling up LLMs boosts combinatorial creativity in code generation, but plateaus on exploratory tasks, revealing a "convergence-by-scaling" effect where larger models become less divergent.

Zihan Wang, Zihu Wang, Lam Nguyen +10

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Mar 10, 2026

Tsinghua AI3w ago·also Beihang, York

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Binquan Zhang, Li Zhang, Lin Shi +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 9, 2026

Tsinghua AI3w ago·also Hunan, Melbourne, PKU

OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras

Fisheye cameras can now see the world in 4D, thanks to a new benchmark and method that tackles the unique distortions of spherical projection for improved occupancy tracking.

Yongzhi Lin, Yong Lin, Kai Luo +5

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Tsinghua AI3w ago·also School of Mechanical Engineering

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Current language agents are still far from matching human expert performance when faced with real-world professional tasks requiring complex reasoning, authoritative source retrieval, and domain-specific knowledge, as revealed by the new \$OneMillion-Bench benchmark.

Qianyu Yang, Jiaqi Li, Jun Bai +16

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Indiana University3w ago·also Tsinghua AI, Regenstrief Institute

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance

LLMs can automate and improve thematic analysis of qualitative data, achieving expert-level alignment in clinical domains through iterative codebook refinement.

Seungjun Yi, Joakim Nguyen, Huimin Xu +9

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Mar 5, 2026

Tsinghua AI3w ago·also SCB DataX, SCBX R&D

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Current LLM safety measures are critically vulnerable to attacks grounded in Thai cultural nuances, as demonstrated by a new benchmark showing higher attack success rates compared to general Thai-language attacks.

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Tsinghua AI3w ago·also Cambridge, Helmholtz

Making Reconstruction FID Predictive of Diffusion Generation FID

Interpolating latent representations before decoding yields a reconstruction FID (iFID) that finally aligns with the generation FID of latent diffusion models, achieving ~0.85 correlation where standard rFID fails.

Tongda Xu, Mingwei He, Shady Abu-Hussein +5

Computer Vision Eval Frameworks & Benchmarks

Tsinghua AI3w ago·also Westlake, Zhipu

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Current judge models for instruction-following are surprisingly unreliable, but a new benchmark exposes their flaws and offers a path to better alignment.

Bosi Wen, Bosi Wen, Yilin Niu +11

Eval Frameworks & Benchmarks RLHF & Preference Learning

Mar 4, 2026

Tsinghua AIMar 4, 2026·also Department of Industrial Engineering, Zhili College

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

LLMs can synthesize verifiable discrete-event world models from natural language, bridging the gap between hand-engineered simulators and unconstrained neural models.

Zhuohuan Li

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +10

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenneth Kimble, Kenny Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 26, 2026

Tsinghua AIFeb 26, 2026·also CUHK, HKUST, MiroMind AI, PKU +1

MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

MiroFlow leapfrogs existing LLM agent frameworks with its agent graph architecture, delivering state-of-the-art performance and robust execution across a diverse range of benchmarks.

Shiqian Su, Shiqian Su, Sen Xing +22

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Feb 26, 2026·also Tsinghua AI, CAS, ECNU

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Even the best vision-language models struggle to diagnose brain tumors from MRI scans, but a new dataset and benchmark reveals a path to significant accuracy gains through instruction tuning.

Feng Guo, Feng Guo, Jiaxiang Liu +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tsinghua AIFeb 26, 2026·also Independent Researcher

TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Uncovered: mental health chatbots can fall into dangerous "validation spirals" or "empathy fatigue" patterns, revealing critical relational safety flaws missed by current single-turn evaluations.

Joydeep Chandra, Joydeep Chandra, Satyam Kumar Navneet +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Tsinghua AIFeb 26, 2026·also Fudan, Shanghai Innovation

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

VLMs can get a +39% boost in downstream reasoning by using translator-guided reinforcement learning to improve geometric perception, a far better result than standard supervised fine-tuning.

Shuning Jia, Guanghao Li, Guanghao Li +1

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought+1

Feb 26, 2026·also DAMO, Meta AI, Tsinghua AI, Corresponding author +3

SkillNet: Create, Evaluate, and Connect AI Skills

AI agents can now learn durable skills instead of constantly "reinventing the wheel," thanks to SkillNet's infrastructure for creating, evaluating, and connecting AI skills at scale.

Yuanying Liang, R. Zhong, Haoming Xu +46

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Feb 25, 2026

Independent ResearcherFeb 25, 2026·also Tsinghua AI

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

LLMs scrub away up to 20% of culturally specific language, even while preserving the core meaning, revealing a "Semantic Preservation Paradox" that threatens linguistic diversity.

Satyam Kumar Navneet, Satyam Kumar Navneet, Joydeep Chandra +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Tsinghua AIFeb 25, 2026·also UBC, ZJU

UniVBench: Towards Unified Evaluation for Video Foundation Models

Current video benchmarks are too simple; UniVBench offers the first unified framework to measure the integrated capabilities of video foundation models using complex, multi-shot videos and a standardized evaluation system.

Jianhui Wei, Jianhui Wei, Xiaotian Zhang +11

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Nantong UniversityFeb 25, 2026·also Tsinghua AI, Beihang, Georgia Tech, NJU +5

An Empirical Study of Bugs in Modern LLM Agent Frameworks

LLM agent frameworks are riddled with bugs stemming from API misuse and documentation issues, leading to crashes and functional errors that current agent-level evaluations miss.

Xinxue Zhu, Xinxue Zhu, Jiacong Wu +10

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 24, 2026

Tsinghua AIFeb 24, 2026·also Beihang

PackMonitor: Enabling Zero Package Hallucinations Through Decoding-Time Monitoring

Achieve zero package hallucinations from LLMs in dependency recommendation by monitoring the decoding process and intervening with an authoritative package list.

Xiting Liu, Yuetong Liu, Yitong Zhang +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Tsinghua AIFeb 24, 2026·also B-Ins), Hangzhou High-Tech Zone (Binjiang, Institute of Blockchain and Data, ZJU

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

VLMs still can't reason about spatial logic in real-world scenes, but a new benchmark and scene graph method shows how to make progress.

Yuechen Xie, Xiaoyan Zhang, Yicheng Shan +6

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Feb 19, 2026

Feb 19, 2026·also Tsinghua AI, Baidu, SMU, UNSW

What Makes a Good LLM Agent for Real-world Penetration Testing?

LLM-powered pentesting agents fail not because of model limitations, but because they can't estimate task difficulty, leading to wasted effort and premature context exhaustion.

Gelei Deng, Yi Liu, Yuekang Li +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Feb 17, 2026

Tsinghua AIFeb 17, 2026·also DAMO, CAS, Independent Researcher, Northwestern +1

SecCodeBench-V2 Technical Report

LLM code copilots are put to the test with SecCodeBench-V2, a new benchmark revealing their security vulnerabilities across 22 CWE categories and five programming languages.

Longfei Chen, Ji Zhao, Lanxiao Cui +18

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Feb 15, 2026

Tsinghua AIFeb 15, 2026·also DAMO, BUPT

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

A new family of GUI agents, GUI-Owl-1.5, leapfrogs existing open-source models on 20+ GUI benchmarks, proving that multi-platform, real-time GUI automation is now within reach.

Haiyang Xu, Haiyang Xu, Xi Zhang +36

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Feb 15, 2026·also Tsinghua AI

LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Forget monolithic models: a mixture-of-experts approach using clustered semantic domains boosts definition modeling by 7% BLEU, proving that specialization wins.

Jiaye Yang, Weikang Li, Jiahui Liang +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

DAMOFeb 15, 2026·also MIT CSAIL, Tsinghua AI, BJTU, ECNU +2

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.

Weiqi Zhai, Weiqi Zhai, Zhihai Wang +52

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Feb 13, 2026

Feb 13, 2026·also Tsinghua AI, HKU, Soochow, UTokyo +1

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Retrieval models, even large ones, struggle under realistic acoustic noise, as revealed by the new SQuTR benchmark.

Yuejie Li, Ke Yang, Yueying Hua +6

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Speech & Audio

Feb 12, 2026

Tsinghua AIFeb 12, 2026

PatientHub: A Unified Framework for Patient Simulation

PatientHub finally offers a standardized, reproducible framework for patient simulation, streamlining development and benchmarking across diverse methods and models.

Sahand Sabour, NG TszYam

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Tsinghua AIFeb 12, 2026

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Current verifiers often reward correct answers derived from flawed reasoning, but PRIME offers a benchmark to identify and select verifiers that actually penalize incorrect derivations.

Yinmin Zhang, Chun Yuan, Tong Xu +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Feb 9, 2026

Tsinghua AIFeb 9, 2026·also HKU

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

MLLMs can ace the high-level strategy for two-handed robot tasks, but still fumble basic coordination like not smashing the robot's arms together.

Xin Wu, Zhixuan Liang, Yue Ma +4

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Dec 19, 2025

OpenAIDec 19, 2025·also Anthropic, BAIR, Mila, MIT CSAIL +5

OpenAI GPT-5 System Card

GPT-5's real-time router learns to route queries to specialized models, making it faster and more useful than its predecessors.

Aaditya K. Singh, Adam Fry, Adam Perelman +43062

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Dec 7, 2025

MilaDec 7, 2025·also BAIR, MIT CSAIL, Stanford HAI, ELLIS +5

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.

Y. Bengio, Stephen Clare, Carina Prunkl +34

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Oct 29, 2025

Tsinghua AIOct 29, 2025·also Fudan, Rutgers, Shanghai Key Laboratory of Multimodal

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

LLMs still can't convincingly mimic human personas, especially when it comes to syntactic style and memory, despite advancements in other areas.

Bangde Du, Minghao Guo, Songming He +8

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Oct 20, 2025

Tsinghua AIOct 20, 2025

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

LLMs still struggle to learn effectively from user feedback during service, as revealed by a new benchmark spanning multiple domains and languages.

Qingyao Ai, Yichen Tang, Changyue Wang +311

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Aug 21, 2025

Tsinghua AIAug 21, 2025·also Fudan, Shanghai Key Laboratory of Multimodal

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.

Weihang Su, Anzhe Xie, Qingyao Ai +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Search

Tsinghua AI