Search papers, labs, and topics across Lattice.
The paper investigates bias spillover in LLM alignment, where targeted gender alignment affects fairness across other sensitive attributes. Using Direct Preference Optimization and the BBQ benchmark, the authors aligned Mistral 7B, Llama 3.1 8B, and Qwen 2.5 7B for gender fairness. Results show that while aggregate fairness metrics improve, context-aware analysis reveals significant fairness degradations in ambiguous contexts for attributes like physical appearance, sexual orientation, and disability status, demonstrating the risk of single-attribute alignment.
Aligning LLMs for gender fairness can inadvertently worsen biases related to physical appearance, sexual orientation, and disability status, especially when the context is ambiguous.
Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.