UIUCMar 31, 2026arXiv:2603.29112

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Iordanis Fostiropoulos, Muhammad Azhar, Abdalaziz Sawwan, Bo-Xuan Fang, Jiayi Liu, Hanchao Yu, Jianyu Wang, Xiangjun Fan

AI Summary

GISTBench, a new benchmark, evaluates LLMs on their ability to extract and verify user interests from interaction histories, moving beyond traditional item prediction accuracy in RecSys. It introduces two metric families, Interest Groundedness (IG) and Interest Specificity (IS), to assess the accuracy and distinctiveness of LLM-predicted user profiles. Experiments on a synthetic dataset built from real user interactions show that current LLMs struggle to accurately count and attribute engagement signals across heterogeneous interaction types.

Key Contribution

LLMs still struggle to accurately infer user interests from interaction histories, especially when dealing with diverse engagement signals – a critical gap for effective personalization.

Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models'(LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Related Papers