Search papers, labs, and topics across Lattice.
This paper introduces BLADE, a Bayesian List-wise Alignment method for LLM-based recommendation that addresses limitations in existing Best-of-N (BoN) alignment techniques. BLADE uses a Bayesian framework to dynamically update the target distribution for distillation, incorporating historical priors and evidence from the model's rollouts, thus providing a self-evolving and informative training signal. Experiments on real-world datasets demonstrate that BLADE outperforms state-of-the-art baselines, achieving sustained gains in ranking accuracy and complex list-wise metrics by breaking the static performance upper bound.
LLMs for recommendation can now surpass the limitations of static training signals, achieving sustained improvements in ranking accuracy, fairness, and diversity through a dynamically updated Bayesian distillation target.
Large Language Models have revolutionized recommender systems (LLM4Rec) by leveraging their generative capabilities to model complex user preferences. However, existing LLM4Rec methods primarily rely on token-level objectives, making it difficult to optimize list-level and non-differentiable metrics (e.g., NDCG, fairness) that define actual recommendation quality. While Best-of-N (BoN) directly optimizes these metrics during inference, its high computational cost hinders real-world deployment. To address this, BoN Alignment aims to distill the search capability into the model itself, yet current approaches suffer from two critical limitations: (1) Indiscriminate Supervision, where the static reference fails to distinguish the relative quality of candidates exceeding its empirical range, leading to a loss of ranking guidance; and (2) Gradient Decay, where the effective supervision signal rapidly diminishes as the evolving policy improves, resulting in inefficient optimization. To overcome these challenges, we propose BLADE (Bayesian List-wise Alignment via Dynamic Estimation). Unlike static approaches, BLADE introduces a Bayesian framework that continuously updates the target distribution by fusing historical priors with dynamic evidence from the model's current rollouts. This mechanism constructs a self-evolving target that adapts to the model's growing capabilities, ensuring the training signal remains informative throughout the learning process. Extensive experiments on three real-world datasets demonstrate that BLADE significantly outperforms state-of-the-art baselines. Crucially, it breaks the static performance upper bound, achieving sustained gains in both ranking accuracy (Recall, NDCG) and complex list-wise metrics (Fairness, Diversity). The code is available via https://github.com/RegionCh/BLADE.