May 27, 2026arXiv:2605.28179

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

Quanen Sun, Changxin Tian, Ke Shi, Cai Chen, Cunyin Peng, Jia Liu, Kunlong Chen, Zhiqiang Zhang

AI Summary

The paper introduces SuperValid, a framework for synthesizing out-of-distribution validation data aligned with specific capabilities, aiming to improve the prediction of downstream performance of LLMs. SuperValid distills core concepts from benchmarks within a capability domain and expands them into diverse, knowledge-rich texts to create a more robust validation set. Experiments across 17 benchmarks and 6 capability domains demonstrate that SuperValid loss correlates strongly with downstream performance, enabling better model selection and scaling decisions compared to IID validation.

Key Contribution

Forget benchmarks - SuperValid's capability-aligned validation loss robustly predicts downstream LLM performance across architectures, scales, and training distributions.

Abstract

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

Related Papers