Mar 26, 2026arXiv:2603.25764

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

AI Summary

This paper investigates the relationship between behavioral consistency and accuracy in LLM-based agents on the SWE-bench software engineering benchmark, comparing Claude, GPT-5, and Llama. The study finds a correlation between higher consistency and higher accuracy across different models, but reveals that within a model, consistency amplifies both correct and incorrect interpretations. Specifically, the analysis shows that a significant portion of Claude's failures result from consistently making the same incorrect assumption, highlighting the importance of interpretation accuracy over execution consistency for reliable agent performance.

Key Contribution

High consistency in LLM agents doesn't guarantee correctness; it just means they'll fail the same way every time.

Abstract

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71\% of Claude's failures stem from"consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References10

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Related Papers