SJTUMar 25, 2025arXiv:2503.19633

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li

AI Summary

The authors introduce AM-DeepSeek-R1-Distilled, a 1.4 million instance dataset of reasoning problems collected from open-source sources, deduplicated, cleaned, and filtered for test set contamination. Responses in the dataset were distilled from reasoning models, primarily DeepSeek-R1, and verified using reference answers, test cases, or a reward model. Training the AM-Distill-Qwen-32B model on this dataset via supervised fine-tuning resulted in improved performance over the DeepSeek-R1-Distill-Qwen-32B model on AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench benchmarks, demonstrating the dataset's utility for enhancing LLM reasoning.

Key Contribution

Forget scaling laws: a meticulously curated and distilled dataset of 1.4M reasoning problems lets smaller models leapfrog larger, less carefully trained ones.

Abstract

The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.

Data Curation & Synthetic Data Open-Source Models & Weights Reasoning & Chain-of-Thought

Citation Metrics

Citations39

Influential citations3

References29

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Related Papers