Search papers, labs, and topics across Lattice.
This paper presents a systematic reproducibility study of generative recommendation models, specifically focusing on user and item cold-start scenarios. The authors disentangle the impact of model scale, identifier design, and training strategy on cold-start performance, which are often confounded in existing literature. Their analysis reveals that gains in cold-start performance are often overstated and highly sensitive to specific design choices.
Generative recommendation's touted cold-start abilities often vanish under rigorous testing, revealing a sensitivity to design choices that current benchmarks fail to capture.
Cold-start recommendation remains a central challenge in dynamic, open-world platforms, requiring models to recommend for newly registered users (user cold-start) and to recommend newly introduced items to existing users (item cold-start) under sparse or missing interaction signals. Recent generative recommenders built on pre-trained language models (PLMs) are often expected to mitigate cold-start by using item semantic information (e.g., titles and descriptions) and test-time conditioning on limited user context. However, cold-start is rarely treated as a primary evaluation setting in existing studies, and reported gains are difficult to interpret because key design choices, such as model scale, identifier design, and training strategy, are frequently changed together. In this work, we present a systematic reproducibility study of generative recommendation under a unified suite of cold-start protocols.