CUHKNJITJun 8, 2026arXiv:2606.09043

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

AI Summary

This paper introduces DynaCF, a novel dynamic reweighting framework designed to combat shortcut learning in reward models trained from pairwise preferences. By applying semantics-preserving counterfactual perturbations and tracking margin shifts during optimization, DynaCF dynamically downweights samples that exhibit higher shortcut sensitivity, thereby promoting reliance on more meaningful task-relevant signals. Experimental results demonstrate that DynaCF significantly enhances the robustness of preference modeling compared to traditional static heuristics.

Key Contribution

DynaCF reveals that dynamically adjusting sample weights based on shortcut sensitivity can drastically improve the robustness of reward models against superficial cues.

Abstract

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

Related Papers