CASHKUUniversity of LondonZJUJun 4, 2026arXiv:2606.06324

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, Qing Wang

AI Summary

This paper introduces HarnessFix, a trace-guided framework designed to diagnose and repair failures in LLM agent harnesses by compiling execution traces into a Harness-aware Trace Intermediate Representation (HTIR). By effectively attributing failures to specific trajectory steps and harness layers, HarnessFix enables targeted repairs that improve agent performance without introducing regressions. Evaluations across multiple benchmarks demonstrate that HarnessFix enhances performance by 15.2% to 50% compared to initial harnesses and outperforms existing baselines, revealing systematic harness-flaw patterns.

Key Contribution

HarnessFix reveals that targeted, trace-guided repairs can significantly enhance LLM agent performance by addressing specific harness flaws rather than relying on broad modifications.

Abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

Related Papers