CMU MLHKUSTIQuest ResearchSJTUMay 28, 2026arXiv:2605.30288

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, T. Zheng, Mingxun Zhou, Xianglong Liu

AI Summary

This paper introduces MIRA, a novel framework for mid-training data selection that integrates source-aware rubric discovery into the filtering process. By enabling the dynamic construction of evaluation rubrics tailored to diverse source groups, MIRA enhances the effectiveness of data selection while maintaining scalability. The results show that MIRA outperforms existing selection baselines across nine code benchmarks and achieves comparable performance to full-corpus runs using only half the tokens, demonstrating its efficiency and effectiveness in optimizing training data for LLMs.

Key Contribution

MIRA achieves superior mid-training data selection by dynamically constructing source-specific evaluation rubrics, outperforming traditional methods while using half the data.

Abstract

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

Data Curation & Synthetic Data Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Related Papers