Coolwei AI LabUofTApr 2, 2026arXiv:2604.01591

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson

AI Summary

ThinkTwice is introduced as a two-phase framework that uses Group Relative Policy Optimization (GRPO) to jointly train LLMs for reasoning and self-refinement. The method alternates between optimizing the model on solving reasoning problems and refining its own solutions, using the same binary correctness reward in both phases. Experiments on mathematical reasoning benchmarks with Qwen3-4B and Olmo3-7B show ThinkTwice significantly boosts both reasoning and refinement performance compared to online policy optimization baselines, revealing an implicit "rectify-then-fortify" curriculum.

Key Contribution

Jointly training LLMs to reason and refine their answers unlocks significant performance gains, outperforming standard policy optimization by up to 11.5 points on AIME.

Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Related Papers