Apr 27, 2026arXiv:2604.24579

Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

AI Summary

This paper introduces TraceToChain, a pipeline that models LLM agent execution traces as an absorbing discrete-time Markov chain (DTMC) to provide a more comprehensive reliability analysis. The method uses Laplace-smoothed MLE for transition estimation, AIC and KS tests for goodness-of-fit, and Dirichlet-posterior credible intervals for uncertainty quantification. Experiments on seven controlled frameworks demonstrate that the DTMC accurately models the success-time distribution, reconciling common metrics like pass@k and RDC while providing uncertainty estimates.

Key Contribution

LLM agent reliability metrics hide a wealth of information: modeling execution traces as Markov chains reveals the underlying success-time distribution and quantifies uncertainty, offering a richer understanding of agent behavior.

Abstract

Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass$@k$, pass$^k$, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success-time distribution being estimated, test whether traces support that distribution, or quantify finite-trace uncertainty. We present \textsc{TraceToChain}, a reproducible pipeline that fits agent execution traces to an absorbing discrete-time Markov chain (DTMC), $\hat M=(\hat Q,\hat R_\oplus,\hat R_\ominus)$, with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace-smoothed maximum-likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov--Smirnov (KS) goodness-of-fit certificate, and reports Dirichlet-posterior credible intervals and non-parametric bootstrap intervals. We adapt classical reliability mathematics (Kemeny--Snell~\cite{kemenysnell}, Cheung~\cite{cheung1980}, Goel--Okumoto~\cite{goelokt}) to agent traces. The resulting first-passage view reconciles metrics usually reported separately: pass$@k$, pass$^k$, and the RDC are projections of one success-time distribution. On seven controlled MAST-style frameworks with a strict 50/50 fit/test protocol, held-out empirical RDCs overlay their analytic counterparts with max $L_\infty^{\mathrm{RDC}} = 0.053$ (median $0.048$). A two-sample KS test on the first-passage cumulative distribution function (CDF) accepts the fitted chain with $p>0.05$ on $7/7$ frameworks (min $p = 0.78$), and per-entry $95\%$ posterior and bootstrap intervals agree to $\approx\!0.01$ at the median.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Related Papers