Search papers, labs, and topics across Lattice.
The paper addresses the challenge of improving mathematical reasoning in small language models by focusing on structured errors in chain-of-thought reasoning. They introduce a MathVerifier to decompose errors into a six-dimensional profile, generating wrongness and absurdity scores used to mine hard negatives and define per-sample importance weights. By integrating these verifier signals into a weighted Direct Preference Optimization (DPO) objective, they achieve targeted improvements in a 1.5B-parameter Qwen2.5 model, outperforming vanilla SFT and unweighted DPO.
Forget expensive reward models: this work shows how a compact MathVerifier can guide DPO to significantly improve mathematical reasoning in small language models by mining hard negatives and weighting preference pairs.
Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.