Search papers, labs, and topics across Lattice.
This paper investigates using reinforcement learning (RL) to improve compositional generalization in models, addressing the limitations of token-level supervised fine-tuning. They apply Group Relative Policy Optimization (GRPO) with both binary and composite rewards to optimize models based on final outputs. Experiments on compositional benchmarks demonstrate that RL outperforms supervised fine-tuning, particularly for complex compositions, by reshaping the output distribution and mitigating overfitting to frequent training compositions.
RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.
Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.