Search papers, labs, and topics across Lattice.
The paper introduces Predictive Batch Scheduling (PBS), a training optimization technique that prioritizes high-loss samples during batch construction for faster language model convergence. PBS trains a lightweight linear predictor online, using static token-level features to estimate sample difficulty, avoiding the overhead of per-sample loss tracking. Experiments on a 130M parameter transformer show that PBS achieves 6-13% faster convergence, demonstrating the effectiveness of token frequency statistics in encoding sample difficulty.
Skip the expensive per-sample loss tracking: a simple linear predictor using only token frequency statistics can accelerate language model training by up to 13%.
We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard example mining methods that demand expensive per-sample loss tracking, PBS employs a lightweight linear predictor trained online to estimate sample difficulty from static token-level features. Our predictor achieves 0.44 correlation with actual loss using only four simple features: token frequency, sequence length, vocabulary diversity, and rare token ratio. Experiments on a 130M parameter transformer demonstrate that PBS achieves 6-13\% faster convergence measured by evaluation loss across training checkpoints, with the predictor's correlation improving from 0.14 to 0.44 over 10,000 training steps. These results validate that token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead.