Search papers, labs, and topics across Lattice.
PowerFlow reframes unsupervised LLM fine-tuning as a distribution matching problem, using GFlowNets to sample from $\alpha$-power distributions. This allows for directional control over LLM behavior, either sharpening the distribution for reasoning or flattening it for creativity. Experiments show PowerFlow outperforms existing Reinforcement Learning from Internal Feedback (RLIF) methods and achieves simultaneous gains in diversity and quality by mitigating over-sharpening.
Unleashing an LLM's inner creativity or laser-sharp logic is now as simple as turning a knob, thanks to a new distribution-matching method that avoids heuristic rewards.
Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha>1$) to intensify logical reasoning, or flattening it ($\alpha<1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.