Jun 10, 2025arXiv:2506.08965

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

AI Summary

The paper introduces GFRIEND, a data augmentation and expansion framework to improve the data efficiency of generative reward models trained with Direct Preference Optimization (DPO) in few-shot Reinforcement Learning from Human Feedback (RLHF). GFRIEND leverages Chain-of-Thought (CoT) sampling for preference refinement to generate diverse, high-quality preference relationships and employs a perplexity-based scoring mechanism for nuanced preference level assignment. Experiments show that GFRIEND enables reward models trained on small datasets to achieve performance comparable to those trained on large-scale datasets, demonstrating significant improvements in data efficiency and model performance.

Key Contribution

Forget massive datasets: GFRIEND lets you train reward models in a few-shot setting that perform just as well.

Abstract

The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References58

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

Related Papers