CUHKHKUSTSydneyTencent AIWestlakeJun 9, 2026arXiv:2606.11075

Exploring the Design Space of Reward Backpropagation for Flow Matching

Ruoyu Wang, Boye Niu, Xiangxin Zhou, Yushi Huang, Tongliang Liu, Chi Zhang

AI Summary

This paper introduces FlowBP, a novel framework for reward backpropagation in text-to-image flow matching models that addresses the limitations of traditional methods by treating the backward trajectory as the design object. By implementing a no-gradient cached rollout for sampling and constructing a lightweight backward surrogate, FlowBP effectively mitigates issues related to memory storage and gradient inflation. The three variants of FlowBP—FlowBP-Sparse, FlowBP-Bridge, and FlowBP-Lagrange—demonstrate significant improvements over direct-gradient baselines across various metrics, enhancing the alignment of models with human preferences.

Key Contribution

FlowBP's innovative approach to reward backpropagation leads to improved alignment of text-to-image models with human preferences while managing memory and gradient complexities.

Abstract

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Exploring the Design Space of Reward Backpropagation for Flow Matching

Related Papers