Search papers, labs, and topics across Lattice.
The paper introduces InterviewSim, a large-scale framework for evaluating personality simulation in LLMs by grounding generation in a dataset of 671,000 question-answer pairs extracted from 23,000 interview transcripts across 1,000 public figures. It proposes a multi-dimensional evaluation framework with metrics for content similarity, factual consistency, personality alignment, and knowledge retention, enabling a more direct assessment of simulated personalities against real-world statements. The authors demonstrate that interview-grounded methods outperform those relying on biographical profiles or parametric knowledge, while also revealing a trade-off between retrieval-augmented and chronological-based approaches in capturing personality style versus factual consistency.
LLMs simulating personalities perform better when grounded in real interview data, but how you use that data—retrieval vs. chronological—impacts the trade-off between stylistic accuracy and factual consistency.
Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.