Search papers, labs, and topics across Lattice.
This paper investigates the ability of LLMs to control the distribution of generated attributes (gender, race, sentiment) in multi-round generation, revealing limitations of existing methods like prompt engineering and DPO. To address this, they introduce a fine-tuning framework coupling Steering Token Calibration with Semantic Alignment. Their method uses a hybrid objective with KL divergence to anchor steering token probabilities and Kahneman-Tversky Optimization for semantic consistency, achieving superior distributional control across six datasets.
LLMs struggle to reliably control the distribution of attributes like gender and race in multi-round generation, but a novel fine-tuning approach can precisely steer these distributions.
While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.