Mar 19, 2026arXiv:2603.18363

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen, Ruishuo Chen, Yu Chen, Yu Chen, Zhuoran Li, Zhuoran Li, Longbo Huang, Longbo Huang

AI Summary

PowerFlow reframes unsupervised LLM fine-tuning as a distribution matching problem, using GFlowNets to sample from $\alpha$-power distributions. This allows for directional control over LLM behavior, either sharpening the distribution for reasoning or flattening it for creativity. Experiments show PowerFlow outperforms existing Reinforcement Learning from Internal Feedback (RLIF) methods and achieves simultaneous gains in diversity and quality by mitigating over-sharpening.

Key Contribution

Unleashing an LLM's inner creativity or laser-sharp logic is now as simple as turning a knob, thanks to a new distribution-matching method that avoids heuristic rewards.

Abstract

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha>1$) to intensify logical reasoning, or flattening it ($\alpha<1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Related Papers