SNUYonseiMay 27, 2026arXiv:2605.28003

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu

AI Summary

The authors introduce ResearchMath-14k, a new dataset of 14,056 research-level math problems curated using a multi-agent pipeline to address the lack of training data in this domain. They also generate 220K reasoning trajectories, observing issues like avoidance and hallucinated references, which surprisingly increase in newer model generations. Fine-tuning Qwen3 models on filtered trajectories from this dataset yields a 9.2 point improvement, demonstrating the value of even imperfect reasoning attempts for supervision.

Key Contribution

Newer LLMs are producing *more* fake citations when attempting to solve research-level math problems, highlighting a critical challenge for trustworthy AI-driven research.

Abstract

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

Data Curation & Synthetic Data Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Related Papers