Feb 24, 2026arXiv:2602.20759

Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Yu Fu, Seongho Son, Seongho Son, Ilija Bogunovic, Ilija Bogunovic

AI Summary

The paper introduces Overton Pluralistic Group Relative Policy Optimization (OP-GRPO), a reinforcement learning framework that enables large language models (LLMs) to generate diverse responses reflecting pluralistic human values without explicit prompting. OP-GRPO uses a fine-tuned Sentence Transformer as a similarity estimator to provide accurate coverage evaluation of generated responses and incorporates it into a dual-reward system that promotes both broad coverage and uniqueness of perspectives. Experiments show that a Qwen2.5-3B-Instruct model trained with OP-GRPO outperforms larger models like GPT-OSS and modular architectures on NLI benchmarks, demonstrating a "small models, big perspective coverage" effect.

Key Contribution

A 3B model, guided by a novel RL framework, can outperform a 20B model in capturing diverse human perspectives, challenging the assumption that larger models inherently possess better alignment.

Abstract

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a"small models, big perspective coverage"effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References70

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Overton Pluralistic Reinforcement Learning for Large Language Models

Related Papers