MIT CSAIL

×Tool Use & Agents

25 papers from MIT CSAIL on Tool Use & Agents

Jul 8, 2026

Independent Researcher5d ago·also MIT CSAIL, IIT

Reason Less, Verify More: Deterministic Gates Recover a Silent Policy-Violation Failure Mode in Tool-Using LLM Agents

Silent policy violations in tool-using LLMs can be mitigated by deterministic gates, improving success rates by over 12 percentage points in critical tasks.

Vikas Reddy, Vikas D. Reddy, Sumanth Reddy Challaram +2

Constitutional AI & AI Ethics Tool Use & Agents

Jul 6, 2026

MIT CSAIL1w ago·also Stanford HAI, University of California

EEG-SpikeAgent: Agentic Closed-Loop Program Synthesis for Automated EEG Spike Detection

LLM-driven program synthesis can automate EEG feature engineering while ensuring interpretability and high detection accuracy.

Sonali Santhosh, Kelly Shuhong Yu, Eugene Chang +3

Code Generation & Program Synthesis Tool Use & Agents

1w ago·also MIT CSAIL, DevRev AI LLC

AI Agent Pull Requests on GitHub: Frequency, Structure, and Merge Conflict Rates

Nearly 80% of AI-generated pull requests are submitted concurrently, raising critical questions about collaboration efficiency and merge conflicts in AI coding agents.

George Xu, Arjun Subramanian, Nithilan Karthik

Code Generation & Program Synthesis Tool Use & Agents

MIT CSAIL1w ago

PatchOptic for Shared-State LLM Workflows with Projected Views and Verified Structured Updates

Projected reads in PatchOptic not only cut token costs but also ensure that local updates remain valid in the context of shared-state workflows.

Zhaoyu Bai, Jiaqi Cai

Recommendation & Information Retrieval Tool Use & Agents

Jun 18, 2026

MIT CSAIL3w ago

Optimal Order of Multi-Agent and General Many-Body Systems

Balancing productivity and stability reveals that stronger synchronization can paradoxically increase systemic fragility in multi-agent systems.

Jake J. Xia

Tool Use & Agents World Models & Planning

MIT CSAIL3w ago·also Harvard

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Trajectory mining reveals skill structures but fails to translate these insights into meaningful performance gains for downstream policies.

Yuexing Hao, Xiaomin Li

Code Generation & Program Synthesis Tool Use & Agents

Jun 16, 2026

MIT CSAIL3w ago·also UC Santa Barbara

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Retaining visual figures in skill artifacts boosts CUA performance by over 23 points, proving that seeing is believing in agent training.

Ziyan Jiang, Li An, Yujian Liu +4

Multimodal Models Tool Use & Agents

Jun 10, 2026

MIT CSAILJun 10, 2026·also BAIR, Microsoft Research, Stanford HAI, Corresponding authors. +8

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Current AI agents excel in structured tasks but falter at generating novel insights and tackling open-ended scientific challenges.

Tianyu Liu, Allen Wang, Antonia Panescu +30

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Jun 10, 2026·also MIT CSAIL

MedCTA: A Benchmark for Clinical Tool Agents

Even state-of-the-art multimodal models struggle with reliability in clinical tool use, revealing critical gaps in AI agent performance.

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker

Eval Frameworks & Benchmarks Tool Use & Agents

Jun 1, 2026

Microsoft ResearchJun 1, 2026·also MIT CSAIL, Ant Group, BJUT, Digital Technologies +11

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw reveals that existing benchmarks fall short in capturing the complexities of agent behavior, enabling a more nuanced evaluation of security risks in autonomous systems.

Hao Cheng, Changtao Miao, Tianle Song +15

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

May 21, 2026

MIT CSAILMay 21, 2026

Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

Coordinating AI agents across scientific disciplines only boosts performance when each discipline captures a unique piece of the puzzle, otherwise, simpler combined summaries often suffice.

Fiona Y. Wong

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Improbable AI LabMay 21, 2026·also MIT CSAIL, KU

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

LLMs trained with Vector Policy Optimization (VPO) learn to produce diverse solutions that unlock previously unsolvable problems in evolutionary search, outperforming models optimized for single scalar rewards.

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld +6

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Apr 28, 2026

Apr 28, 2026·also MIT CSAIL, TWT GmbH Science & Innovation

Emotive Architectures: The Role of LLMs in Adjusting Work Environments

Imagine a workspace that subtly shifts lighting and sound to match your mood, all powered by an LLM that understands your needs – this paper explores the potential and pitfalls of that reality.

Lara Vartziotis, Tina Vartziotis, Frank Beutenmueller +4

Natural Language Processing Tool Use & Agents

Apr 22, 2026

MIT CSAILApr 22, 2026·also Istituto Italiano di Tecnologia, Perseus Labs

pAI/MSc: ML Theory Research with Humans on the Loop

Imagine slashing the human effort needed to go from hypothesis to submission-ready ML theory paper by orders of magnitude.

Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio

Open-Source Models & Weights Scientific Discovery & Drug Design Tool Use & Agents

Apr 20, 2026

MIT CSAILApr 20, 2026·also UW Allen School of CSE, UW Department of Philosophy

Navigating the Conceptual Multiverse

Uncover the hidden assumptions baked into LLM responses with a new interactive system that lets you explore alternative conceptual framings and values.

Andre Ye, Jenny Y. Huang, Alicia Guo +3

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Tool Use & Agents

Apr 14, 2026

MIT CSAILApr 14, 2026

A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

Multi-agent systems can find 5x more real-world events in satellite imagery than traditional methods, unlocking a wealth of training data for multi-temporal change detection.

Madeline Anderson, Mikhail Klassen, Ash Hoover +1

Computer Vision Multimodal Models Tool Use & Agents

Apr 13, 2026

Kakashi Ventures Accelerator (KVA)Apr 13, 2026·also MIT CSAIL

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

Organizational AI's biggest bottleneck isn't finding the right information, but knowing what's actually true, agreed upon, or even known at all.

Federico Bottino, Carlo Ferrero, Nicholas Dosio +1

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Apr 9, 2026

Microsoft ResearchApr 9, 2026·also MIT CSAIL

From Gaze to Guidance: Interpreting and Adapting to Users'Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.

Valdemar Danry, Javier Hernandez, Andrew D Wilson +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing+1

Apr 6, 2026

MIT CSAILApr 6, 2026·also Stanford HAI, Improbable AI Lab, UIUC, University of California

Decocted Experience Improves Test-Time Inference in LLM Agents

Forget brute-force scaling: crafting the *right* context from past experiences unlocks surprisingly large gains in LLM agent performance.

Maohao Shen, Kaiwen Zha, Zexue He +6

Inference & Quantization Reasoning & Chain-of-Thought Tool Use & Agents

Apr 6, 2026·also MIT CSAIL, UC Santa Barbara

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

LLM agent skills, despite their promise, often fail in realistic settings, with performance plummeting to no-skill baselines when agents must retrieve skills from a large, uncurated collection.

Yujian Liu, Jiabao Ji, Li An +4

Eval Frameworks & Benchmarks Tool Use & Agents

Apr 2, 2026

Stanford HAIApr 2, 2026·also MIT CSAIL, McGill, SambaNova

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

LLM agents can autonomously outperform fixed evolutionary search by 3-10x on open-ended discovery tasks when given persistent memory, asynchronous collaboration, and heartbeat-based interventions.

Ao Qu, Handi Zheng, Zijian Zhou +14

Natural Language Processing Scientific Discovery & Drug Design Tool Use & Agents

Mar 31, 2026

MIT CSAILMar 31, 2026·also Caltech, Department of Civil and Environmental, Department of Computing and Mathematical, Georgia Tech +7

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.

Rui Ai, David Simchi-Levi, Chonghuan Wang

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

Mar 4, 2026

Vals AIMar 4, 2026·also MIT CSAIL

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Building a complete web application from scratch remains a surprisingly hard task for even the best AI models, with top performance at only 58% accuracy on a new end-to-end benchmark.

Hung Tran, Langston Nashold, Rayan Krishnan +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 3, 2026

MIT CSAILMar 3, 2026

NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

NeuroSkill(tm) offers real-time, edge-based human-AI interaction by directly modeling human state of mind from BCI data, enabling more nuanced and empathetic agentic responses.

Eugene Hauptmann

Natural Language Processing Robotics & Embodied AI Tool Use & Agents

Feb 19, 2026

MIT CSAILFeb 19, 2026·also BAIR, Princeton

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

VLMs are nowhere near human-level general intelligence: they score less than 10% of human performance across a diverse set of human-designed games, especially struggling with world-model learning, memory, and planning.

Lance Ying, Lance Ying, Ryan Truong +18

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory Tool Use & Agents