Feb 17, 2026arXiv:2602.15785

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw

AI Summary

This paper investigates the validity of using LLMs as synthetic participants in social science experiments, contrasting heuristic approaches that aim for behavioral interchangeability with statistical calibration methods that adjust for discrepancies using auxiliary human data. It argues that while heuristic approaches are suitable for exploratory research, statistical calibration offers formal statistical guarantees for confirmatory research under explicit assumptions. The study highlights that statistical calibration can provide more precise and cost-effective causal effect estimates compared to relying solely on human participants, while also emphasizing the importance of considering how well LLMs approximate the relevant populations.

Key Contribution

LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.

Abstract

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

Related Papers