Search papers, labs, and topics across Lattice.
This paper introduces Bayesian-Agent, a framework that optimizes the evolution of skills and standard operating procedures (SOPs) for large language model (LLM) agents by treating them as hypotheses informed by verified trajectory evidence. By maintaining a feature-conditioned categorical posterior over each skill, Bayesian-Agent enables systematic auditing and refinement of agent behaviors through actions like patching and exploring, leading to significant performance improvements across various benchmarks. The results show that Bayesian-Agent can enhance task execution success rates, achieving up to 100% on Lifelong AgentBench and substantial gains on other benchmarks, highlighting the effectiveness of posterior-guided optimization in agent development.
Bayesian-Agent transforms how LLM agents evolve skills, achieving up to 100% success on complex benchmarks through a novel posterior-guided optimization approach.
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.