CohereTAUMay 19, 2025arXiv:2505.12845

Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

Ruopei Sun, Jianfeng Cai, Jinhua Zhu, Kangwen Zhao, Dongyun Xue, Weng Zhou, Li Li, Houqiang Li

AI Summary

This paper addresses the limitations of Reinforcement Learning from Human Feedback (RLHF) in handling complex multi-instruction tasks by proposing a Multi-level Aware Preference Learning (MAPL) framework. MAPL constructs new training objectives by generating varied prompts with preference relations to capture intra-sample preference disparities and synthesizing multi-instruction preference pairs to capture inter-sample preference discrepancies. The framework integrates into both Reward Modeling and Direct Preference Optimization paradigms, demonstrating improved performance on multiple benchmarks.

Key Contribution

RLHF can be significantly improved for complex tasks by explicitly modeling preference relationships both within and between training examples, unlocking better instruction following without relying on expensive human annotation or biased LLM-generated data.

Abstract

RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model's semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.

Natural Language Processing RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References39

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

Related Papers