Dec 17, 2025arXiv:2512.19728

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

AI Summary

The paper addresses the challenge of improving mathematical reasoning in small language models by focusing on structured errors in chain-of-thought reasoning. They introduce a MathVerifier to decompose errors into a six-dimensional profile, generating wrongness and absurdity scores used to mine hard negatives and define per-sample importance weights. By integrating these verifier signals into a weighted Direct Preference Optimization (DPO) objective, they achieve targeted improvements in a 1.5B-parameter Qwen2.5 model, outperforming vanilla SFT and unweighted DPO.

Key Contribution

Forget expensive reward models: this work shows how a compact MathVerifier can guide DPO to significantly improve mathematical reasoning in small language models by mining hard negatives and weighting preference pairs.

Abstract

Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References24

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Related Papers