CMU Machine Learning

×Tool Use & Agents

28 papers from CMU Machine Learning on Tool Use & Agents

Apr 28, 2026

CMU ML3w ago

Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

LLMs can now design better computer architectures than humans, but only if you give them the right starting point.

Alexander Blasberg, Vasilis Kypriotis, Dimitrios Skarlatos

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Apr 27, 2026

CMU ML3w ago

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Today's best web agents are shockingly inefficient, achieving only 1.15% trajectory efficiency on realistic long-horizon tasks, revealing a critical need to move beyond simple success rates.

Lawrence Keunho Jang, L. Jang, Jing Yu Koh +5

Eval Frameworks & Benchmarks Tool Use & Agents

Apr 22, 2026

CMU MLApr 22, 2026

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Continual learning for LLM agents hits a wall: scaling models doesn't reliably improve skill generation, and self-feedback can lead to recursive drift.

Shan Zhong, Shanshan Zhong, Yiming Lu +17

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Apr 16, 2026

CMU MLApr 16, 2026·also Max Planck, UofT, Vector

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Forget carrots and sticks: contracts and mediation are the surprisingly effective keys to unlocking cooperation between LLMs, even when individual incentives push toward defection.

Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

CMU MLApr 16, 2026

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

Stop wasting tokens on irrelevant questions: reward models that ask about task relevance and user answerability can slash question count by 41% while matching GPT-5's issue resolution rate.

Sanidhya Vijayvargiya, S. Vijayvargiya, Vijay Viswanathan +3

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

CMU MLApr 16, 2026

CoGrid&the Multi-User Gymnasium: A Framework for Multi-Agent Experimentation

Democratizing human-AI interaction research, CoGrid and MUG offer accessible tooling for deploying web-based multi-agent experiments.

Chase McDonald, Cleotilde Gonzalez

Open-Source Models & Weights Robotics & Embodied AI Tool Use & Agents

CMU MLApr 16, 2026

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Forget training wheels: symbolic guardrails offer a surprisingly simple and effective way to guarantee safety and security for AI agents in critical domains.

Yining Hong, Yining She, Eunsuk Kang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Apr 16, 2026·also CMU ML

Scaling Test-Time Compute for Agentic Coding

Agentic coding gets a serious boost: distilling and reusing rollout trajectories lets Claude-4.5-Opus jump from 70.9% to 77.6% on SWE-Bench Verified.

Joongwon Kim, Wannan Yang, Kelvin Niu +13

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities+1

Apr 15, 2026

CMU MLApr 15, 2026·also UMass

Evaluation of Agents under Simulated AI Marketplace Dynamics

Stop evaluating AI systems in isolation: marketplace dynamics like user switching and early-adoption advantages critically shape real-world success.

To Eun Kim, Alireza Salemi, Hamed Zamani +1

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Apr 14, 2026

CMU MLApr 14, 2026·also Microsoft Research

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso

Computer Vision Multimodal Models Tool Use & Agents

CMU MLApr 14, 2026

Thought-Retriever: Don't Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

LLMs can now tap into arbitrarily long-term memories by retrieving "thoughts" – their own past reasoning steps – rather than just raw data, leading to significant performance gains.

Tao Feng, Pengrui Han, Guanyu Lin +28

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

Apr 9, 2026

CMU MLApr 9, 2026·also Northeastern, Tongji

Visually-grounded Humanoid Agents

Imagine populating any 3D environment with digital humans that spontaneously navigate and interact, driven only by visual input and goals.

Hang Ye, Hang Ye, Xiaoxuan Ma +7

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Apr 9, 2026·also CMU ML

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

LLM agent progress increasingly hinges on better external cognitive infrastructure, not just stronger models.

Chenyu Zhou, Chenyue Zhou, Huacan Chai +20

Reasoning & Chain-of-Thought Tool Use & Agents

CMU MLApr 9, 2026

Human-AI Collaboration Reconfigures Group Regulation from Socially Shared to Hybrid Co-Regulation

GenAI's integration into collaborative learning unexpectedly shifts group regulation dynamics, increasing reliance on directive and obstacle-oriented processes.

Yujing Zhang, Xianghui Meng, Shihui Feng +1

Natural Language Processing Tool Use & Agents

CMU MLApr 9, 2026·also Tsinghua AI, Waterloo

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.

Yuxuan Zhang, Yubo Wang, Yipeng Zhu +19

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 7, 2026

CMU MLApr 7, 2026

Gym-Anything: Turn any Software into an Agent Environment

Forget toy problems: Gym-Anything lets you turn *any* software into an agent environment, unlocking a world of 10K+ real-world tasks spanning medicine, engineering, and more.

Pranjal Aggarwal, Sean Welleck

Code Generation & Program Synthesis Tool Use & Agents

CMU MLApr 7, 2026·also UTokyo

Say Something Else: Rethinking Contextual Privacy as Information Sufficiency

LLMs leak significantly more private information in multi-turn conversations than single-message evaluations suggest, and free-text pseudonymization offers a more robust privacy-utility trade-off than suppression or generalization.

Xiaoyuan Wu, Ning Ma, Yueqi Song +1

Constitutional AI & AI Ethics Natural Language Processing Tool Use & Agents

Apr 6, 2026

CMU MLApr 6, 2026

Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

LLMs can save up to 40% of tokens in multi-turn reasoning by adaptively allocating compute based on turn difficulty, without sacrificing accuracy.

Neharika Jali, Anupam Nayak, Gauri Joshi

Inference & Quantization Reasoning & Chain-of-Thought Tool Use & Agents

CMU MLApr 6, 2026·also EuroSafeAI, Max Planck, UofT, Vector

Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest

Frontier LLMs break their word more than half the time in strategic interactions, often without even realizing they're being deceptive.

Jerick Shi, Terry Jingchen Zhang, Terry Jingcheng Zhang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

CMU MLApr 6, 2026·also MIT CSAIL, Oxford, University of California

AI Assistance Reduces Persistence and Hurts Independent Performance

Just 10 minutes of AI assistance can measurably degrade your ability to solve problems on your own.

Grace Liu, Brian Christian, Tsvetomira Dumbalska +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 1, 2026

CMU MLApr 1, 2026

Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

LLM-powered forums may generate norm-aware language, but they fail to foster the crucial back-and-forth needed for communities to teach, enforce, and revise those norms.

Luyang Zhang, Yi-Yun Chu, Jialu Wang +2Code

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Mar 18, 2026

CMU MLMar 18, 2026·also INSA Rennes

CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

Forget specialized tools: a standard Unix terminal and clever RL are all you need to beat much larger LLMs at code search.

Lintang Sutawika, Aditya Bharat Soni, R. BharathSriraamR +11

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Mar 2, 2026

CMU MLMar 2, 2026·also Independent Researcher

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Today's frontier LLMs can't autonomously patch critical zero-day vulnerabilities, revealing a significant gap in their cyberdefense capabilities.

Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 25, 2026

Feb 25, 2026·also CMU ML, UNC

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

By decomposing long-horizon manipulation into transport and object-centric interaction, LiLo-VLA achieves state-of-the-art zero-shot generalization and robustness, outperforming end-to-end VLA models by a large margin.

Shuo Cheng, Daniel Szafir

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Feb 24, 2026

Feb 24, 2026·also CMU ML, Bowie State University

Hybrid LLM-Embedded Dialogue Agents for Learner Reflection: Designing Responsive and Theory-Driven Interactions

Injecting LLMs into rule-based dialogue systems for learner reflection can boost the depth of insights, but risks disengagement due to repetitiveness and misalignment.

Paras Sharma, YuePing Sha, Janet Shufor Bih Epse Fofang +6

Natural Language Processing Tool Use & Agents

Feb 23, 2026

Feb 23, 2026·also CMU ML, Northwestern

Positioning Modular Co-Design in Future HRI Design Research

Modularity in HRI isn't just about interchangeable parts; it's a powerful design medium for fostering long-term, evolving relationships between humans and robots.

Lingyun Chen, Qing Xiao, Zitao Zhang +2

Natural Language Processing Robotics & Embodied AI Tool Use & Agents

Feb 17, 2026

CMU MLFeb 17, 2026·also Georgia Tech, Purdue

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Forget slow text-based communication: Vision Wormhole unlocks faster multi-agent reasoning by turning VLMs into telepathic hubs, slashing runtime without sacrificing fidelity.

Xiaoze Liu, Xiaoze Liu, Ruowang Zhang +13

Multimodal Models Tool Use & Agents

Feb 11, 2026

CMU MLFeb 11, 2026·also Princeton

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Multimodal agents still struggle with game development, solving only ~50% of tasks in a new benchmark, GameDevBench, highlighting the need for better multimodal reasoning in complex software environments.

Wayne Chi, Wayne Chi, Yixiong Fang +16

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents