Yu Chen

Tlog⁡πθ(yt|y<t,q)−log⁡p~target(y|q))2.\displaystyle\mathcal{L}_{\text{TB}}(\theta,\phi;q,y)=\left(\log Z_{\phi}(q)+\sum_{t=1}^{T}\log\pi_{\theta}(y_{t}|y_{<t},q)-\log\tilde{p}_{\text{target}}(y|q)\right)^{2}. (5) This formulation transforms the distribution matching problem into an RL-style on-policy optimization task: minθ,ϕ⁡𝒥(θ,ϕ)=𝔼q∼𝒟[𝔼y∼πθ(⋅|q)[ℒTB(θ,ϕ;q,y)]].\min_{\theta,\phi}\mathcal{J}(\theta,\phi)=\mathbb{E}_{q\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi_{\theta}(\cdot|q)}\left[\mathcal{L}_{\text{TB}}(\theta,\phi;q,y)\right]\right]. (6) By optimizing Eq. (6), πθ\pi_{\theta} learns to align its sequence-level probabilities with p~target\tilde{p}_{\text{target}}, while Zϕ(q)Z_{\phi}(q) amortizes the estimation of the normalization constant to reduce gradient variance. Further implementation and training details are provided in Appendix A.1. 3.3. Length-Aware PowerFlow Objective Figure 3: Stability analysis of distribution matching strategies. Matching the trajectory-level α\alpha-power distribution via standard TB or RL objectives (-traj) leads to rapid length collapse. Token-level normalization (-token) initially improves performance but eventually decays due to the exploitation of repetitive tokens. PowerFlow maintains both stable response length and superior reasoning accuracy (pass@1 on MATH) throughout training. Autoregressive generation in LLMs is inherently plagued by structural length bias. Specifically, the log-probability of a trajectory, log⁡p(y|q)=∑t=1|y|log⁡p(yt|y<t,q)\log p(y|q)=\sum_{t=1}^{|y|}\log p(y_{t}|y_{<t},q), is approximately negatively linear with respect to the sequence length |y||y|. Consequently, a naive distribution matching objective is often dominated by sequence length rather than semantic density. For instance, when targeting an α\alpha-power distribution with α>1\alpha>1 (sharpening), the model tends to exploit the path probability by producing excessively short, trivial sequences. Conversely, when α<1\alpha<1 (flattening), the model is prone to generating repetitive, deterministic long sequences to accumulate probability mass. Furthermore, the extreme sensitivity of path probabilities to |y||y| causes the gradient of the partition function ZϕZ_{\phi} to exhibit massive variance, severely destabilizing the optimization process. As illustrated in Figure 3, directly matching the α(>1)\alpha(>1)-power distribution using either the Trajectory Balance (TB-traj) or RL-based KL-regularized objectives (RL-traj) leads to an immediate and pathological collapse of response length. We include the RL-based formulation as a baseline because the standard RL objective with KL regularization, maxπ⁡𝔼y∼π[r(y)]−β𝔻KL(π∥πbase)\max_{\pi}\mathbb{E}_{y\sim\pi}[r(y)]-\beta\mathbb{D}_{\text{KL}}(\pi\|\pi_{\text{base}}), theoretically yields an optimal policy π∗(y|q)∝πbase(y|q)exp⁡(r(y)/β)\pi^{*}(y|q)\propto\pi_{\text{base}}(y|q)\exp(r(y)/\beta). By setting the intrinsic reward to the base model’s log-probability, r(y)=log⁡pbase(y|q)r(y)=\log p_{\text{base}}(y|q), the target becomes an α\alpha-power distribution where α=1+1/β\alpha=1+1/\beta. To counteract length bias, a common heuristic is to optimize the average token log-probability, 1|y|log⁡pbase\frac{1}{|y|}\log p_{\text{base}}. While these token-level variants, TB-token and RL-token, exhibit initial performance gains, Figure 3 reveals a subsequent decay. This failure stems from a fundamental distortion of the target distribution’s structural integrity. By optimizing for average token-level confidence, these methods reshape the energy surface such that local probability mass is decoupled from global semantic coherence. Consequently, the model exploits repetitive and meaningless tokens to artificially lower the average energy, effectively eroding the learned semantic structure to inflate likelihood metrics through repetitive generation. These observations underscore that both naive RL and standard GFlowNet objectives are profoundly susceptible to reward and structural biases, necessitating a more principled approach to distribution alignment. To bridge the gap between principled distribution matching and the non-stationary length distributions of LLMs, we introduce a structural reparameterization of the Trajectory-Balance objective. Standard GFlowNets Malkin et al. (2022) typically treat the partition function Zϕ(q)Z_{\phi}(q) as a prompt-dependent scalar, a formulation that is ill-conditioned for autoregressive sequences where probabilities decay exponentially with length. We instead reformulate the normalization constant as a length-aware energy term, Zϕ(q,y)=(Zϕ′(q))|y|Z_{\phi}(q,y)=(Z^{\prime}_{\phi}(q))^{|y|}, which effectively projects the optimization onto a length-normalized energy surface. By further normalizing the log-trajectory mismatch by |y||y|, our LA-TB objective ensures that the optimization gradient remains scale-invariant across varying sequence lengths: ℒLA-TB(θ,ϕ;q,y)=(log⁡Zϕ′(q)+1|y|log⁡πθ(y|q)p~target(y|q))2.\displaystyle\mathcal{L}_{\text{LA-TB}}(\theta,\phi;q,y)=\left(\log Z^{\prime}_{\phi}(q)+\frac{1}{|y|}\log\frac{\pi_{\theta}(y|q)}{\tilde{p}_{\text{target}}(y|q)}\right)^{2}. (7) This objective converges to a length-normalized target distribution: π∗(y|q)∝p~target(y|q)Zϕ′(q)|y|.\pi^{*}(y|q)\propto\frac{\tilde{p}_{\text{target}}(y|q)}{Z^{\prime}_{\phi}(q)^{|y|}}. (8) While this formulation does not strictly maintain relative mode rankings across sequences of varying lengths, it achieves a robust balance by neutralizing structural biases while fundamentally preserving the target distribution’s semantic essence. By operating within the space of amortized geometric mean probabilities, PowerFlow prioritizes semantic quality over sequence brevity or redundancy, effectively shifting the optimization focus toward the model’s true latent capability space. Finally, we instantiate the target as the α\alpha-power distribution, pbase(y|q)αp_{\text{base}}(y|q)^{\alpha}. To ensure instruction-following integrity and logical structure, we incorporate a format penalty ψ(y)\psi(y). Specifically, ψ(y)\psi(y) is set to a negative constant (e.g., −0.5-0.5) if the output fails to match a predefined pattern (e.g., the absence of \boxed{}), and 0 otherwise. This yields the final PowerFlow objective: ℒPowerFlow=w⋅(log⁡Zϕ′(q)+1|y|log⁡πθ(y|q)−α[1|y|log⁡pbase(y|q)+ψ(y)])2\displaystyle\mathcal{L}_{\text{PowerFlow}}=w\,\cdot\Bigg(\log Z^{\prime}_{\phi}(q)+\frac{1}{|y|}\log\pi_{\theta}(y|q)-\alpha\left[\frac{1}{|y|}\log p_{\text{base}}(y|q)+\psi(y)\right]\Bigg)^{2} (9) where ww is the importance sampling ratio defined as: w=clip(πθ(y|q)πold(y|q),1−ϵ,1+ϵ)detach.w=\text{clip}\left(\frac{\pi_{\theta}(y|q)}{\pi_{\text{old}}(y|q)},1-\epsilon,1+\epsilon\right)^{\text{detach}}. (10) The inclusion of ww ensures compatibility with off-policy fine-tuning, where trajectories are sampled from a behavior policy πold\pi_{\text{old}}. Following Zhu et al. (2025), we apply clipping to maintain training stability and prevent gradient collapse during iterative optimization. As evidenced by the robust training dynamics in Figure 3, the PowerFlow objective effectively circumvents these structural distortions, achieving sustained length stability and monotonic performance gains by preserving the principled α\alpha-power density on a length-normalized surface. 4. Experiments In this section, we evaluate the effectiveness of PowerFlow across two primary domains: complex logical reasoning and diverse creative writing. Following the experimental setup detailed in Section 4.1, we present our findings in two parts. First, Section 4.2 demonstrates that distribution sharpening (α>1\alpha>1) effectively intensifies reasoning performance across various model variants. Subsequently, Section 4.3 reveals that distribution flattening (α<1\alpha<1) restores the generative diversity typically suppressed in aligned models while simultaneously improving output quality. Together, these experiments illustrate that principled distribution matching serves as a robust mechanism for the directional elicitation of latent LLM capabilities without external supervision. 4.1. Experimental Setup Data and Training Configuration. For reasoning tasks, we follow standard practices in the community Hugging Face (2025); Zhang et al. (2025a) by utilizing questions from the NuminaMath-CoT dataset LI et al. (2024) for unsupervised training. Specifically, we employ a subset of 18,000 queries filtered by Zhang et al. (2025c) to exclude instances with excessive response length or potential answer leakage. Each query is appended with a prompt instructing the model to “think step by step” and provide the final answer within a \boxed{} environment. For creative writing tasks covering poem continuation, story generation, and joke writing Lu et al. (2025), we select a training set of 300300 prompts, drawn from the 500500-prompt collection curated by Zhang et al. (2025b) and sourced from PoemHunter.com, BookMIA Shi et al. (2024), and Reddit r/DadJokes Reddit (2023). All inputs are formatted using the models’ official chat templates; detailed prompts and hyperparameters are provided in Appendix B and Appendix A.2, respectively. Models and Baselines. To evaluate the generalizability of PowerFlow, we conduct experiments across several representative model families and scales, including the 1.

NVIDIA Research

Papers on Lattice

Total citations

Topics

h-index

Research focus

Multimodal Models (1)Robotics & Embodied AI (1)World Models & Planning (1)

Frequent co-authors

Nvidia Arslan Ali (1)Junjie Bai (1)Maciej Bala (1)Yogesh Balaji (1)

Papers (1)

Oct 28, 2025

NVIDIAOct 28, 2025·also BUPT, Cohere, Georgia Tech, KAIST +5

World Simulation with Video Foundation Models for Physical AI

Forget synthetic data that looks like it came from a PS2 game: NVIDIA's new Cosmos-Predict2.5 generates high-fidelity videos for training embodied AI, opening the door to more realistic and reliable simulations.

Nvidia Arslan Ali, Junjie Bai, Maciej Bala +8536

Multimodal Models Robotics & Embodied AI World Models & Planning

Search

Yu Chen

Research focus

Frequent co-authors

Papers (1)