Search papers, labs, and topics across Lattice.
This paper introduces a black-box reliability certification method for AI systems, using self-consistency sampling to reduce uncertainty and conformal calibration to guarantee correctness bounds. The method provides a single reliability score for a system-task pair with finite-sample guarantees, serving as a deployment gate. Experiments across benchmarks and models show that stronger models achieve higher reliability levels, and the method maintains high conditional coverage on solvable items while reducing API costs through sequential stopping.
Trust your AI's output less: this method gives you a single, guaranteed reliability score for any black-box AI system on any task.
Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.