AI2Alongside.careMar 17, 2026arXiv:2603.16120

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Nishant Balepur, Nishant Balepur, Malachi Hamada, Malachi Hamada, Varsha Kishore, Varsha Kishore, Sergey Feldman, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, J. Boyd-Graber, Jordan Lee Boyd-Graber, Aakanksha Naik, Aakanksha Naik

AI Summary

This paper introduces MyScholarQA (MySQA), a personalized deep research tool that infers user research interests, proposes personalized actions for queries, and generates multi-section reports. The authors initially evaluate MySQA using a synthetic user benchmark with LLM judges, demonstrating superior performance compared to baselines in citation metrics and action-following. However, through user interviews with a live version of MySQA, they uncover nine nuanced errors undetectable by LLM judges, highlighting the limitations of synthetic evaluation and the necessity of real-user feedback for personalization in deep research tools.

Key Contribution

Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.

Abstract

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers'queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References99

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Related Papers