Jannis Vamvas

University of Zurich, N\mathcal{D}=\{(x_{i},a_{i}^{*})\}_{i=1}^{N} denote the full training set of problems with ground-truth answers. Our method proceeds in three stages: 1. Difficulty Estimation: Assign a difficulty score di∈[1,5]d_{i}\in[1,5] to each problem xix_{i} (described below). 2. Data Partitioning: Based on the difficulty scores, partition 𝒟\mathcal{D} into an SFT subset 𝒟SFT\mathcal{D}_{\text{SFT}} (easier, broader) and an RL subset 𝒟RL\mathcal{D}_{\text{RL}} (harder, focused): 𝒟SFT={(xi,ai∗)∈𝒟∣di≤τ},𝒟RL={(xi,ai∗)∈𝒟∣di>τ},\mathcal{D}_{\text{SFT}}=\{(x_{i},a_{i}^{*})\in\mathcal{D}\mid d_{i}\leq\tau\},\quad\mathcal{D}_{\text{RL}}=\{(x_{i},a_{i}^{*})\in\mathcal{D}\mid d_{i}>\tau\}, (3) where τ\tau is a difficulty threshold. For 𝒟SFT\mathcal{D}_{\text{SFT}}, we generate reference responses yiy_{i} using a moderate teacher model (e.g., Qwen3-

Papers on Lattice

Total citations

Topics

h-index