Search papers, labs, and topics across Lattice.
This paper adapts the Reliable Change Index (RCI) from clinical psychology to evaluate item-level performance changes between LLM versions, using 2,000 MMLU-Pro items with 10 samples each. Applying RCI to Llama 3 → 3.1 and Qwen 2.5 → 3 reveals that aggregate accuracy gains mask substantial bidirectional item-level churn, with significant proportions of items improving and deteriorating. The study finds that greedy single-shot evaluation misses a large fraction of these reliable changes, highlighting the need to report churn rate alongside aggregate accuracy.
LLM upgrades are a chaotic mix of progress and decay: despite overall gains, up to 47% of questions get *worse* after an update, and single-shot evals miss almost half of these critical regressions.
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.