Apr 15, 2026arXiv:2604.14339

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Zichong Li, Chen Liang, Liliang Ren, Tuo Zhao, Yelong Shen, Weizhu Chen

AI Summary

The paper introduces RoPE-Perturbed Self-Distillation, a training regularizer to improve positional robustness in long-context LLMs. This method perturbs RoPE indices during fine-tuning to create alternative views of the training sequence, then uses self-distillation to enforce consistent predictions across these views. Experiments on Llama-3-8B and Qwen-3-4B show significant gains on long-context benchmarks, with improvements up to 12.04% on RULER-64K and improved length extrapolation.

Key Contribution

LLMs can be made far more robust to the position of information in long contexts by simply shuffling the context during fine-tuning.

Abstract

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative"views"of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Related Papers