Qingyao Ai

LexRubric reveals that even state-of-the-art LLMs struggle with open-ended legal tasks, exposing critical gaps in their contextual understanding and reasoning abilities.

Yifan Chen, Kaisong Song, Jun Lin +3

Eval Frameworks & Benchmarks

Tsinghua AIJun 8, 2026·also BUPT, University of California

Civil Court Simulation with Large Language Models

Reliable civil court judgments can now be simulated with a framework that adapts to the complexities of legal claims and remedies.

Yifan Chen, Kaiyuan Zhang, Yueyue Wu +2

Natural Language Processing World Models & Planning

Apr 29, 2026

Tsinghua AIApr 29, 2026

Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation

Untangling task-solving skills from factual knowledge in PRAG adapters makes them play better together, boosting performance when you combine multiple documents.

Weihang Su, Qingyao Ai

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Apr 27, 2026

Tsinghua AIApr 27, 2026

Skill Retrieval Augmentation for Agentic AI

Explicitly enumerating skills in-context doesn't scale for agentic LLMs, but retrieving skills on demand can substantially improve performance – if the LLM can figure out when and which skill to load.

Weihang Su, Jianming Long, Qingyao Ai +4

Recommendation & Information Retrieval Tool Use & Agents

Apr 8, 2026

Apr 8, 2026·also Tsinghua AI

TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Humans are still way better than LLMs at trial-and-error problem solving, and this new dataset of human problem-solving trajectories shows us why.

Xinkai Zhang, Jingtao Zhan, Jingtao Zhan +1

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Tool Use & Agents

Mar 17, 2026

Tsinghua AIMar 17, 2026

Parametric Social Identity Injection and Diversification in Public Opinion Simulation

Injecting demographic attributes directly into LLM hidden states can drastically improve the diversity and realism of public opinion simulations.

Hexi Wang, Yujia Zhou, Bangde Du +2

Constitutional AI & AI Ethics Natural Language Processing World Models & Planning

Feb 12, 2026

Tsinghua AIFeb 12, 2026

Analytical Search

Current search paradigms fall short for analytical tasks, motivating a new "analytical search" framework that treats search as an evidence-driven, multi-step reasoning process.

Shuo Miao, Yiqun Liu, Qingyao Ai

Natural Language Processing Recommendation & Information Retrieval

Oct 29, 2025

Tsinghua AIOct 29, 2025·also Fudan, Rutgers

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

LLMs still can't convincingly mimic human personas, especially when it comes to syntactic style and memory, despite advancements in other areas.

Bangde Du, Minghao Guo, Songming He +8

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Oct 20, 2025

Tsinghua AIOct 20, 2025

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

LLMs still struggle to learn effectively from user feedback during service, as revealed by a new benchmark spanning multiple domains and languages.

Qingyao Ai, Yichen Tang, Changyue Wang +311

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Aug 21, 2025

Tsinghua AIAug 21, 2025·also Fudan

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.

Weihang Su, Anzhe Xie, Qingyao Ai +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing