Stanford HAI

×Tool Use & Agents

25 papers from Stanford HAI on Tool Use & Agents

May 6, 2026

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.

Zhaorun Chen, Xun Liu, Haibo Tong +14

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

May 4, 2026

Stanford HAI2w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 28, 2026

3w ago·also Stanford HAI

Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

LLMs can now automatically design and execute experiments to resolve debates between cognitive science theories, even discovering the models and experiments themselves.

Suyog Chandramouli, George Kachergis, Akshay Jagadish

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Stanford HAI3w ago·also NVIDIA, Univeristy of Illinois Urbana Champaign

Recursive Multi-Agent Systems

Looping language models isn't just for single agents anymore: Recursive Multi-Agent Systems (RecursiveMAS) show that agent collaboration itself can be scaled through recursion, yielding faster and more efficient problem-solving.

Xiyuan Yang, Jiaru Zou, Rui Pan +8

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Apr 22, 2026

Stanford HAIApr 22, 2026

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Turns out, coding agents in the wild are only writing useful code 44% of the time, and are introducing more security vulnerabilities than human developers.

Joachim Baumann, Vishakh Padmakumar, John Yang +2

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Apr 20, 2026

University of Science and TechnologyApr 20, 2026·also Stanford HAI, Chicago University, DUT, Harvard +1

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Achieve real-time video understanding with transparent reasoning: \model{} aligns response timing with visual evidence, offering a breakthrough for online video LLMs.

Kecheng Zhang, Zongxin Yang, Mingfei Han +6

Computer Vision Multimodal Models Tool Use & Agents

Apr 16, 2026

ETHApr 16, 2026·also Stanford HAI, Heidelberg, Institute of Computer Science, UZH

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent doesn't just give you the answer; it shows its work, offering clinicians a transparent, step-by-step reasoning trace for AI-generated CT reports.

Jean-Benoit Delbrouck, Christian Bluethgen, Bjoern Menze +1

Computer Vision Multimodal Models Tool Use & Agents

Apr 13, 2026

Apr 13, 2026·also Stanford HAI, K). On DeepSearch

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

A lightweight, RL-trained context curator can match GPT-4o's context management abilities, slashing token consumption by 8x and opening the door to efficient long-horizon LLM agents.

Xiaozhe Li, Tianyi Lyu, Yizhao Yang +6

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Ludwig-Maximilians-Universität MünchenApr 13, 2026·also DeepMind, Google Research, Stanford HAI, Munich Center for Machine Learning +1

Epistemic Trust as a Mechanism for Ethics Integration: Failure Modes and Design Principles from 70 Moral Imagination Workshops

Ethics interventions in AI development often fail because practitioners don't trust them – here's a breakdown of why, and how to fix it.

Benjamin Lange, Geoff Keeling, Kyle Pedersen +4

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Apr 9, 2026

Stanford HAIApr 9, 2026

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

LLMs can decide when they need more "thinking time" – and boost their accuracy while slashing compute costs by up to 65% – simply by checking if they agree with themselves.

Khushal Sethi

Inference & Quantization Reasoning & Chain-of-Thought Tool Use & Agents

Apr 5, 2026

Stanford HAIApr 5, 2026·also Amazon Science, BAIR

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Scaling prompt learning by 17x without sacrificing accuracy is now possible, unlocking efficient self-improvement for LLM agents.

Hanchen Li, Runyuan He, Qizheng Zhang +13

Distributed Systems & Hardware Scaling Laws & Emergent Abilities Tool Use & Agents+1

Apr 2, 2026

Stanford HAIApr 2, 2026·also MIT CSAIL, McGill, SambaNova

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

LLM agents can autonomously outperform fixed evolutionary search by 3-10x on open-ended discovery tasks when given persistent memory, asynchronous collaboration, and heartbeat-based interventions.

Ao Qu, Handi Zheng, Zijian Zhou +14

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Mar 30, 2026

Stanford HAIMar 30, 2026

Synonymix: Unified Group Personas for Generative Simulations

Unlock richer, more realistic agent simulations by moving beyond individual personas to unified group representations that capture collective behavior.

Huanxing Chen, Aditesh Kumar

Natural Language Processing Tool Use & Agents World Models & Planning

Stanford HAIMar 30, 2026·also CUHK, Lehigh

Towards a Medical AI Scientist

Medical AI Scientist leapfrogs generic LLMs in clinical research, generating higher-quality, evidence-backed hypotheses and manuscripts that rival top-tier medical publications.

Hongtao Wu, Boyun Zheng, Dingjie Song +2

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Stanford HAIMar 30, 2026·also KRAFTON

Meta-Harness: End-to-End Optimization of Model Harnesses

LLM performance hinges on the code around the model, and Meta-Harness proves that automating the design of this "harness" can significantly boost results across diverse tasks.

Yoonho Lee, Roshen Nair, Qizhen Zhang +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 29, 2026

UWMar 29, 2026·also AI2, Microsoft Research, Stanford HAI, Bake AI +5

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.

Wenjie Wang, Yuchen Ma, Zichen Chen +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 19, 2026

Mar 19, 2026·also Stanford HAI, Data61, UofT

Multi-User Large Language Model Agents

LLMs, impressive as they are, can't juggle multiple users' conflicting needs without dropping balls on privacy, prioritization, and efficiency.

Shu Yang, Shenzhe Zhu, Hao Zhu +4

Constitutional AI & AI Ethics Tool Use & Agents

Mar 13, 2026

Stanford HAIMar 13, 2026·also OpenHands, UCR, UCSD, USC

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.

Gangda Deng, Zhaoling Chen, Zhongming Yu +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 12, 2026

Stanford HAIMar 12, 2026

EducaSim: Interactive Simulacra for CS1 Instructional Practice

Imagine a flight simulator, but for teaching: EducaSim lets CS1 instructors hone their skills in a realistic, scalable environment powered by generative agents.

Cameron Mohne, Nicholas Vo, Dora Demszky +1

Tool Use & Agents World Models & Planning

Mar 10, 2026

AnsibleHealth Inc.Mar 10, 2026·also Stanford HAI, George Washington University

From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring

An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.

SeungHwan Kim, Tiffany H. Kung, H. Verma +15

Reasoning & Chain-of-Thought Scientific Discovery & Drug Design Tool Use & Agents

Mar 4, 2026

Mar 4, 2026·also Stanford HAI

HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation

Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.

Armin Abdollahi, Saeid Shokoufa, Negin Ashrafi +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 25, 2026

BAIRFeb 25, 2026·also Stanford HAI

Power and Limitations of Aggregation in Compound AI Systems

Aggregating responses from multiple copies of the same model expands the range of achievable outputs in compound AI systems through three key mechanisms, offering a path to overcome individual model limitations.

Nivasini Ananthakrishnan, Meena Jagadeesan

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Feb 22, 2026

Feb 22, 2026·also BAIR, Stanford HAI, ETH Zurich, FieldAI Inc. 3 Morgan

WildOS: Open-Vocabulary Object Search in the Wild

Robots can now navigate complex outdoor environments and find objects using natural language queries, even without prior maps or precise depth sensing.

Hardik Shah, Erica Tevere, Deegan Atha +4

Computer Vision Robotics & Embodied AI Tool Use & Agents

Feb 18, 2026

Stanford HAIFeb 18, 2026

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

A single RL policy trained on procedurally generated tools in simulation can achieve zero-shot dexterous manipulation of diverse real-world tools, rivaling task-specific policies.

Kushal Kedia, K. Kedia, Tyler Ga Wei Lum +5

Robotics & Embodied AI Tool Use & Agents World Models & Planning

Nov 19, 2025

Stanford HAINov 19, 2025·also Amazon Science, Google Research, Microsoft Research, UW +5

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Open-source LLMs can now autonomously optimize AI accelerator kernels, matching the performance of proprietary models at a fraction of the cost.

Genghan Zhang, Genghan Zhang, Shaowei Zhu +15

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents+1