Search papers, labs, and topics across Lattice.
This paper introduces a framework using Confirmatory Factor Analysis (CFA) and Generalizability Theory to decompose variance in AI benchmark rankings, applied to over 4,000 models from the Open LLM Leaderboard. The analysis reveals that current reporting practices underestimate the relationships between benchmarks, that local dependencies undermine benchmark validity, and that contributor metadata is surprisingly influential. The study also finds that latent general-factor scaling is more reliable than manifest-score scaling, providing insights into benchmark dynamics and offering diagnostics for improving benchmark design and trust.
Leaderboard rankings are more noise than signal: contributor metadata matters more than architecture, and scaling laws are unreliable.
While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_尾=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.