Tsinghua AI

×Tool Use & Agents

54 papers from Tsinghua AI on Tool Use & Agents

Mar 31, 2026

1d ago·also Tsinghua AI, PKU

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.

Hongyi Nie, Xunyuan Liu, Yudong Bai +4

Eval Frameworks & Benchmarks Tool Use & Agents

Mar 30, 2026

Tsinghua AI2d ago

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.

Siyuan Ma, Bo Gao, Zikai Xiao +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2d ago·also Tsinghua AI, XJU

Evaluating Privilege Usage of Agents on Real-World Tools

LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.

Quan Zhang, Lianhang Fu, Li Fu +7

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 27, 2026

Tsinghua AI5d ago

Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

LLMs can learn to generate more "organic" pull requests by distilling coding style, API usage, and architectural invariants from a project's commit history, leading to better acceptance rates.

Mo Li, L. H. Xu, Qitai Tan +2

Code Generation & Program Synthesis Tool Use & Agents

Mar 26, 2026

Tsinghua AI6d ago

Natural-Language Agent Harnesses

Stop burying your agent harness logic in code: NLAHs let you express it in natural language, making it portable, editable, and analyzable.

Lin Pan, Lexiao Zou, Shuo Guo +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 19, 2026

Tsinghua AI1w ago·also OPPO, Shenzhen Institutes of Advanced

Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

AI can now handle the tedious copywriting and real-time Q&A for live-streaming commerce, freeing up human streamers to focus on engagement.

Ruizhi Yu, Keyang Zhong, Peng Liu +5

Multimodal Models Natural Language Processing Tool Use & Agents

Mar 18, 2026

Tsinghua AI2w ago

VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

LLMs can now generate Verilog code that's not just correct, but also optimized for real-world hardware constraints like power, performance, and area, thanks to a novel multi-agent system with evolving memory.

Yaoxiang Wang, Qi Shi, Qiaolin Shi +8

Code Generation & Program Synthesis Tool Use & Agents

Mar 17, 2026

Tsinghua AI2w ago·also Tencent AI

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

LLM agents can now leverage a unified memory framework that dynamically adapts to different question types, enabling more coherent and user-centric long-horizon dialogues.

Shannan Yan, Jingchen Ni, Leqi Zheng +7

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Mar 16, 2026

Tsinghua AI2w ago·also AI Chip Center for Emerging Smart, HKUST, UMacau

Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems

Scaling LLM-based multi-agent systems doesn't just need better prompts or models, but a whole new software engineering approach focused on managing runtime entropy.

Weihao Zhang, Yitong Zhou, Huanyu Qu +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Tsinghua AI2w ago·also Beihang, BIT, NJU, Proxseer Inc.

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

LLMs struggle to effectively use private library APIs even when provided with the correct documentation, but PriCoder can boost their performance by over 20% through targeted training data synthesis.

Yitong Zhang, Chengze Li, Ruize Chen +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 15, 2026

2w ago·also Tsinghua AI, BJTU

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.

Shengda Fan, Xuyan Ye, Yupeng Huo +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Mar 12, 2026

Tsinghua AI2w ago·also Ant Group

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

Autonomous LLM agents are riddled with vulnerabilities, as point defenses fail to address cross-temporal and multi-stage systemic risks like memory poisoning and intent drift.

Xinhao Deng, Yixiang Zhang, Jiaqing Wu +16

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Tsinghua AI

QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Forget brittle retrieval: QChunker uses a question-aware multi-agent debate to restructure RAG from retrieval-augmentation to *understanding*-retrieval-augmentation, boosting performance across diverse domains.

Jihao Zhao, Daixuan Li, Pengfei Li +3

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Mar 10, 2026

Tsinghua AI3w ago·also Beihang, York

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Binquan Zhang, Li Zhang, Lin Shi +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Tsinghua AI3w ago·also PKU

Video-Based Reward Modeling for Computer-Use Agents

A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.

Linxin Song, Jieyu Zhang, Huanxin Sheng +6

Computer Vision RLHF & Preference Learning Tool Use & Agents

Tsinghua AI3w ago·also NJU

TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

Forget tweaking knobs – this new Gram-matrix-based audio representation lets you *retrieve* the perfect, editable audio effect preset, outperforming standard methods.

Shihao He, Yihan Xia, Fang Liu +2

Recommendation & Information Retrieval Speech & Audio Tool Use & Agents

Mar 9, 2026

Tsinghua AI3w ago·also School of Mechanical Engineering

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Current language agents are still far from matching human expert performance when faced with real-world professional tasks requiring complex reasoning, authoritative source retrieval, and domain-specific knowledge, as revealed by the new \$OneMillion-Bench benchmark.

Qianyu Yang, Jiaqi Li, Jun Bai +16

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 5, 2026

Tsinghua AI3w ago

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning

LLMs can now parallel park your car: U-Parking uses them for intelligent planning in a distributed UWB-assisted autonomous system.

Yi'an Wu, Yiang Wu, Qiong Wu +6

Robotics & Embodied AI Tool Use & Agents World Models & Planning

3w ago·also Tsinghua AI, USTC, Xiaohongshu

GCAgent: Enhancing Group Chat Communication through Dialogue Agents System

Group chats can be revitalized with LLM-powered agents, boosting message volume by nearly 30% in real-world deployments.

Zijie Meng, Zheyong Xie, Zheyu Ye +4

Natural Language Processing Tool Use & Agents

Tsinghua AI3w ago

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

LLMs under pressure to survive exhibit surprisingly frequent and diverse risky behaviors, from financial fraud to misinformation, highlighting a critical safety gap in agentic AI.

Yida Lu, J. Fang, Jianwei Fang +8

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 4, 2026

Tsinghua AIMar 4, 2026·also Department of Industrial Engineering, Zhili College

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

LLMs can synthesize verifiable discrete-event world models from natural language, bridging the gap between hand-engineered simulators and unconstrained neural models.

Zhuohuan Li

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Mar 3, 2026

Tsinghua AIMar 3, 2026·also Donghua University

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

By normalizing rewards across groups of sampled communication graphs, Graph-GRPO stabilizes multi-agent topology learning and uncovers critical communication pathways obscured by noisy, absolute rewards.

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao +9

Distributed Systems & Hardware Tool Use & Agents Training Efficiency & Optimization

Tsinghua AIMar 3, 2026

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Multimodal jailbreaks, meet your match: SaFeR-ToolKit's virtual tool-calling protocol boosts VL model safety by up to 55% without sacrificing general capabilities.

Zixuan Xu, Tiancheng He, Huahui Yi +3

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

Tsinghua AIMar 3, 2026

Agentic Self-Evolutionary Replanning for Embodied Navigation

Robots that learn from their mistakes *while* navigating? SERP unlocks this by evolving the action model in-context during replanning, boosting success rates and cutting token costs.

Guoliang Li, Ruihua Han, Chengyang Li +4

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Mar 2, 2026

Tsinghua AIMar 2, 2026

What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction

Automating paper reproduction isn't about finding code, it's about filling in the "missing manual" of tacit knowledge, and this graph-based agent closes the gap by 24.68%.

Lehui Li, Ruining Wang, Haochen Song +8

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Mar 1, 2026

Mar 1, 2026·also Tsinghua AI, Hebei University of Science and Technology

GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant

Get 3x more bang for your buck in multi-user LLM chat applications with GroupGPT, a framework that slashes token usage while preserving privacy.

Zhuokang Shen, Yifan Wang, Hanyuan Chen +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

Feb 28, 2026

Tsinghua AIFeb 28, 2026·also DAMO, CAS, PolyU, ZJU

Qwen3-Coder-Next Technical Report

An 80B model that runs like a 3B? Qwen3-Coder-Next shows you can get competitive coding agent performance with a fraction of the active parameters, thanks to smart training.

Ruisheng Cao, Mouxiang Chen, Jiawei Chen +17

Code Generation & Program Synthesis Inference & Quantization Tool Use & Agents

Feb 27, 2026

Tsinghua AIFeb 27, 2026·also IEEE

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Agentic RL can now beat proprietary LLMs and torch.compile in the challenging domain of CUDA kernel generation, achieving up to 40% speedups on hard tasks.

Weinan Dai, Hanlin Wu, Qiying Yu +12

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

Feb 26, 2026

Tsinghua AIFeb 26, 2026·also CUHK, HKUST, MiroMind AI, PKU +1

MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

MiroFlow leapfrogs existing LLM agent frameworks with its agent graph architecture, delivering state-of-the-art performance and robust execution across a diverse range of benchmarks.

Shiqian Su, Shiqian Su, Sen Xing +22

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Tsinghua AIFeb 26, 2026·also Microsoft Research, Beihang

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

LLM agents can learn to explore novel states and generalize to new tasks with a hybrid on- and off-policy RL framework that leverages memory.

Zeyuan Liu, Zeyuan Liu, Jeonghye Kim +5

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Tsinghua AIFeb 26, 2026·also OpenAI, CAS, China Academy of Space Technology, Kuaishou +3

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Context-augmented RL lets smaller MLLMs punch *way* above their weight, rivaling much larger models on reasoning tasks while dodging reward hacking.

Xingyu Lu, Jinpeng Wang, Jinpeng Wang +21

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Feb 26, 2026·also DAMO, Meta AI, Tsinghua AI, Corresponding author +3

SkillNet: Create, Evaluate, and Connect AI Skills

AI agents can now learn durable skills instead of constantly "reinventing the wheel," thanks to SkillNet's infrastructure for creating, evaluating, and connecting AI skills at scale.

Yuanying Liang, R. Zhong, Haoming Xu +46

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Tsinghua AIFeb 26, 2026·also CUHK, Xidian

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Achieve real-time, high-precision GUI navigation with minimal resources by pruning redundant visual tokens *without* retraining.

Zhou Xu, Zhou Xu, Bowen Zhou +5

Computer Vision Inference & Quantization Tool Use & Agents

Tsinghua AIFeb 26, 2026·also HIT, SYSU

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Multi-agent systems get a 6.3% accuracy boost on math problems thanks to a new "rectify-or-reject" pruning method that dynamically filters out bad information at test time.

Yutong Wang, Yutong Wang, Siyuan Xiong +10

Inference & Quantization Tool Use & Agents

Feb 25, 2026

Nantong UniversityFeb 25, 2026·also Tsinghua AI, Beihang, Georgia Tech, NJU +5

An Empirical Study of Bugs in Modern LLM Agent Frameworks

LLM agent frameworks are riddled with bugs stemming from API misuse and documentation issues, leading to crashes and functional errors that current agent-level evaluations miss.

Xinxue Zhu, Xinxue Zhu, Jiacong Wu +10

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 24, 2026

Tsinghua AIFeb 24, 2026

Grounding LLMs in Scientific Discovery via Embodied Actions

LLMs can now actively perceive and react to anomalies during scientific simulations, leading to more reliable and accurate results in complex engineering and modeling tasks.

Jinfeng Zhou, Yuxuan Chen, Jianing Yin +1

Robotics & Embodied AI Scientific Discovery & Drug Design Tool Use & Agents

Tsinghua AIFeb 24, 2026·also Shanghai AI Lab

PyVision-RL: Forging Open Agentic Vision Models via RL

Reinforcement learning for multimodal agents doesn't have to collapse into uselessness: PyVision-RL shows how to stabilize training and encourage multi-turn tool use.

Shitian Zhao, Shitian Zhao, Shaoheng Lin +5

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Tsinghua AIFeb 24, 2026

How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Current VLM-driven embodied agents struggle with fundamental skills like navigation and object manipulation when evaluated in realistic, low-level action spaces, severely hindering their performance on complex tasks.

Bo Peng, Pi Bu, Keyu Pan +5

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Feb 24, 2026·also Tsinghua AI, ByteDance, NJU

KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning

Unlock richer time series analysis by injecting semantic understanding, enabling models to reason beyond raw numbers.

Haotian Si, Changhua Pei, Xiao He +9

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Feb 23, 2026

Tsinghua AIFeb 23, 2026·also CAS

SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

LLMs can now capture an author's unique voice in translations, thanks to a multi-agent system guided by a "Stylistic Feature Spectrum" derived from wavelet transforms.

Jingzhuo Wu, Jiajun Zhang, Jiajun Zhang +5

Natural Language Processing Tool Use & Agents

Feb 19, 2026

Feb 19, 2026·also Tsinghua AI, Baidu, SMU, UNSW

What Makes a Good LLM Agent for Real-world Penetration Testing?

LLM-powered pentesting agents fail not because of model limitations, but because they can't estimate task difficulty, leading to wasted effort and premature context exhaustion.

Gelei Deng, Yi Liu, Yuekang Li +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Feb 17, 2026

Tsinghua AIFeb 17, 2026·also NVIDIA, ByteDance, Department of Industrial and Management, SUCCESS Lab +4

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 doesn't just code; it engineers, showcasing unprecedented capability in tackling end-to-end software engineering challenges.

GLM-5 Team, Zhenyu Hou, Qinkai Zheng +145

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Feb 16, 2026

Tsinghua AIFeb 16, 2026

WebWorld: A Large-Scale World Model for Web Agent Training

Training web agents in a simulator can now match real-world performance: Qwen3-14B, fine-tuned with WebWorld-synthesized trajectories, rivals GPT-4o on WebArena.

Zikai Xiao, Jianhong Tu, Chuhang Zou +2

Data Curation & Synthetic Data Tool Use & Agents World Models & Planning

Tsinghua AIFeb 16, 2026·also DAMO, Beihang, Fudan, NTU +1

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Frontier AI is getting sneakier: this report details how LLMs are now capable of emergent misalignment, LLM-to-LLM persuasion, and autonomous mis-evolution, demanding robust mitigation strategies.

Dongrui Liu, Dongrui Liu, Yi Yu +32

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Feb 15, 2026

Tsinghua AIFeb 15, 2026·also DAMO, BUPT

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

A new family of GUI agents, GUI-Owl-1.5, leapfrogs existing open-source models on 20+ GUI benchmarks, proving that multi-platform, real-time GUI automation is now within reach.

Haiyang Xu, Haiyang Xu, Xi Zhang +36

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Google ResearchFeb 15, 2026·also BraneMatrix AI, Chongqing, Northeast University, SYSU

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Coding agents are vulnerable to a new class of stealthy, automated prompt injection attacks via poisoned skills, achieving high success rates even in realistic software engineering tasks.

Xiaojun Jia, Jie Liao, Simeng Qin +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Tsinghua AIFeb 15, 2026·also M steps for a fair comparison.

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

By strategically resampling from deep, recoverable states ("pivots") within unsuccessful trajectories, DDE drastically improves LLM reinforcement learning compared to methods that oversample from the root or blindly disperse budgets.

Yiran Guo, Zhongjian Qiao, Yingqi Xie +3

RLHF & Preference Learning Tool Use & Agents

Feb 12, 2026

Feb 12, 2026·also Tsinghua AI, Nankai University, USTC

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Generate a million educational videos a day at 5% of the cost using a novel LLM-based multi-agent system that orchestrates problem-solving, visualization, and narration.

Jiulong Wu, Dong Xie, Deguo Xia +1

Computer Vision Multimodal Models Tool Use & Agents

Dec 19, 2025

OpenAIDec 19, 2025·also Anthropic, BAIR, Mila, MIT CSAIL +5

OpenAI GPT-5 System Card

GPT-5's real-time router learns to route queries to specialized models, making it faster and more useful than its predecessors.

Aaditya K. Singh, Adam Fry, Adam Perelman +30562

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Nov 24, 2025

Tsinghua AINov 24, 2025·also B-Ins)

HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

LLM agents can now navigate the vast model zoo of HuggingFace with 6.9x less token consumption and 33% better reasoning, thanks to a new iterative selection framework.

Shaoyin Ma, Chenggong Hu, Huiqiong Wang +3

Open-Source Models & Weights Reasoning & Chain-of-Thought Tool Use & Agents

Search

Tsinghua AI