NTUMay 6, 2026arXiv:2605.04920

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

AI Summary

This paper investigates using reinforcement learning (RL) to improve compositional generalization in models, addressing the limitations of token-level supervised fine-tuning. They apply Group Relative Policy Optimization (GRPO) with both binary and composite rewards to optimize models based on final outputs. Experiments on compositional benchmarks demonstrate that RL outperforms supervised fine-tuning, particularly for complex compositions, by reshaping the output distribution and mitigating overfitting to frequent training compositions.

Key Contribution

RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.

Abstract

Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

Related Papers