Apr 20, 2026arXiv:2604.18327

PARM: Pipeline-Adapted Reward Model

Xingyu Fan, Jiacheng Liu, Linqi Song, Pheng Ann Heng

AI Summary

The paper introduces Pipeline-Adapted Reward Model (PARM) to address the inconsistency between reward model predictions and actual pipeline execution outcomes in multi-stage LLM pipelines. PARM leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. Experiments on code generation for combinatorial optimization and GSM8K demonstrate that PARM improves pipeline output quality and stability compared to baselines.

Key Contribution

Reward models optimized for single-step generation can fail spectacularly when integrated into multi-stage LLM pipelines, but pipeline-aware training can fix this.

Abstract

Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.

Code Generation & Program Synthesis RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PARM: Pipeline-Adapted Reward Model

Related Papers