Feb 22, 2026arXiv:2602.19177

Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Simon Münker, Kai Kugler, Michael Heseltine

AI Summary

The paper introduces a history-conditioned reply prediction task using real-world X (Twitter) data to address linguistic discrepancies when using LLMs as proxies for human participants in social science research. They analyze stylistic and content-based differences between LLM-generated and human-generated replies. The analysis reveals significant discrepancies, highlighting the need for improved prompting techniques and specialized datasets to ensure the validity of LLM-generated content in computational social science.

Key Contribution

LLM-generated social media content is measurably different from human-generated content, raising serious questions about using LLMs as proxies in social science research.

Abstract

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Related Papers