Search papers, labs, and topics across Lattice.
The paper introduces MIRA, a source-aware data selection framework for LLM mid-training that addresses the challenges of heterogeneous data sources and the need for semantic criteria. MIRA discovers relevant rubrics for each source group and distills these into scalable student scorers for efficient corpus filtering. Experiments on code-oriented mid-training demonstrate that MIRA achieves comparable performance to using the full corpus while using only half the tokens, outperforming existing selection baselines across nine code benchmarks.
Achieve the same performance with half the data: MIRA distills source-specific rubrics into scalable data scorers, enabling efficient and effective data selection for LLM mid-training.
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.