Search papers, labs, and topics across Lattice.
This paper introduces a four-axis framework (FRP, RCS, CRR, CAR) for evaluating long-horizon enterprise AI agents, arguing that single-scalar metrics obscure critical failure modes related to factual precision, reasoning coherence, compliance, and calibrated abstention. They evaluate six memory architectures on LongHorizon-Bench, a new benchmark covering loan qualification and insurance claims adjudication. Results show that retrieval struggles with factual precision, schema-anchored architectures incur a scaffolding tax, and all architectures fail to abstain, highlighting the importance of calibrated abstention.
Current aggregate accuracy metrics hide critical failures in long-horizon AI agents, like retrieval's struggle with factual precision and a universal inability to abstain, demanding a shift towards multi-axis evaluation.
Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.