ImperialLMUMunich Center for Machine LearningUofTWilliam & MaryMay 25, 2026arXiv:2605.25998

Causal methods for LLM development and evaluation

Dennis Frauen, Marie Brockschmidt, Konstantin Hess, Haorui Ma, Yuchen Ma, Abdurahman Maarouf, Maresa Schröder, Jonas Schweisthal, Yuxin Wang, Athiya Deviyani, Sonali Parbhoo, Rahul G. Krishnan, Stefan Feuerriegel

AI Summary

This paper argues that many questions in LLM development, such as the impact of pretraining data or prompt routing strategies, are inherently causal and thus well-suited to causal inference methods. It identifies opportunities for applying causal methods across the LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation, where observational data and biased judges can confound purely predictive approaches. The authors advocate for greater adoption of causal methods to ensure more reliable and scientifically grounded LLM design.

Key Contribution

LLM development is flying blind by ignoring causal inference, leaving models vulnerable to confounding and distribution shifts throughout pretraining, alignment, and evaluation.

Abstract

Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Causal methods for LLM development and evaluation

Related Papers