Chao Peng

LLM agents can now achieve a 92% success rate in complex code repository setup by learning from past failures, a 19% improvement over existing methods.

Zihang Zhou, Ziqian Ren, Yukai Wu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

May 21, 2026

May 21, 2026·also NJU, Tencent AI

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Today's agents are surprisingly bad at real-world terminal tasks, with even frontier models failing nearly 40% of the time on everyday workflows.

Zhaoyang Chu, Jiarui Hu, Pengyu Zou +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Apr 8, 2026

Apr 8, 2026·also Tencent AI

Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

Current LLMs struggle to leverage software documentation for repository-level comprehension, but high-quality documentation can boost agent performance by 20%.

Xinchen Wang, Ruida Hu, Pengfei Gao +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Apr 8, 2026·also SMU, Tencent AI

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

LLMs struggle to build complete software from scratch, with even the best models failing more than half the time on a new CLI tool generation benchmark.

Ruida Hu, Xinchen Wang, Chao Peng +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Search

Chao Peng

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (5)