Search papers, labs, and topics across Lattice.
This study systematically evaluates various conditioning methods for Large Language Models (LLMs) to understand the trade-off between effectiveness and fluency in output generation. The findings reveal that while efficient steering methods can achieve effective conditioning, they often compromise fluency, particularly when applied to instruction-tuned models. Additionally, the research highlights that simple prompting and supervised fine-tuning are effective for concept injection but less so for removal, with textual metrics providing a reliable proxy for more expensive evaluations.
Efficient conditioning methods for LLMs often sacrifice fluency, revealing a critical trade-off that could reshape deployment strategies.
Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.