TU MunichMay 28, 2026arXiv:2605.29668

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, M. Hadamitzky, Daniel Rueckert, Lisa Adams, Keno K. Bressem

AI Summary

GRASP addresses the problem of regression in self-improving LLM agents by introducing a gated mechanism that validates new skills against a held-out probe, ensuring net improvement under a hard regression budget. This approach involves treating agent improvement as a sequence of edits to a bounded skill library, accepting only candidates that enhance performance without significantly degrading existing capabilities. Experiments on clinical benchmarks demonstrate that GRASP significantly improves performance across various base models, outperforming existing self-improvement baselines and highlighting the importance of validation in skill acquisition.

Key Contribution

LLM agents can leap from 40% to 88% accuracy in complex clinical tasks simply by validating new skills against a regression budget, proving that *how* you learn matters more than *what* you learn.

Abstract

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Related Papers