Pengfei Liu

Today's best AI agents fail at realistic software engineering tasks, stalling before even reaching 30% completion, revealing the urgent need for better long-horizon planning and human-AI collaboration.

Yukang Feng, Jianwen Sun, Jian Sun +26

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

May 23, 2025

NVIDIAMay 23, 2025

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

LLMs are surprisingly bad at keeping up with how people's minds change over time, lagging humans by 45% on a new benchmark designed to test this crucial social skill.

Yang Xiao, Jiashuo Wang, Qiancheng Xu +59

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Search

Pengfei Liu

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (3)