Search papers, labs, and topics across Lattice.
The paper introduces RoPE-Perturbed Self-Distillation, a training regularizer to improve positional robustness in long-context LLMs. This method perturbs RoPE indices during fine-tuning to create alternative views of the training sequence, then uses self-distillation to enforce consistent predictions across these views. Experiments on Llama-3-8B and Qwen-3-4B show significant gains on long-context benchmarks, with improvements up to 12.04% on RULER-64K and improved length extrapolation.
LLMs can be made far more robust to the position of information in long contexts by simply shuffling the context during fine-tuning.
Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative"views"of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.