Search papers, labs, and topics across Lattice.
The paper introduces Disciplined Chain-of-Thought (D-CoT), a framework for training small language models (SLMs) to perform more efficient and accurate reasoning by using control tags to structure the CoT process during training. D-CoT optimizes the reasoning trajectory, mitigating reasoning drift and reducing token consumption. Experiments on Qwen3-8B show that D-CoT, trained on only 5,000 samples, improves accuracy on GPQA-diamond by 9.9% and MMLU-Pro by 9.1%, while also reducing computational costs, and the model retains performance even without control tags during inference.
Small language models can achieve surprisingly large gains in reasoning accuracy and efficiency by learning to follow explicit, structured reasoning paths guided by simple control tags during training.
Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.