May 28, 2026arXiv:2605.30218

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Kexin Chu, Kexin Chu, Yang Zhou, Yang Zhou, Wei Zhang

AI Summary

The paper introduces MarginGate, a method for achieving deterministic LLM inference in BF16 precision by selectively verifying and correcting token outputs only when low top-1/top-2 logit margins indicate potential batch-induced flips. By exploiting the sparsity of token flips during batched inference, MarginGate significantly reduces the overhead associated with always-on verification approaches like LLM-42. Experiments across multiple models and datasets demonstrate that MarginGate restores 100% sequence-level deterministic decoding with substantially lower verification trigger rates, leading to significant latency improvements.

Key Contribution

Deterministic LLM inference gets a 2x speedup by verifying only the 1% of tokens with shaky confidence.

Abstract

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

Inference & Quantization Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

Related Papers