Apr 9, 2026arXiv:2604.07809

PolicyLong: Towards On-Policy Context Extension

Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Tinghao Yu, Ting-Ting Yu, Feng Zhang

AI Summary

PolicyLong addresses the challenge of training LLMs with extended context windows by iteratively generating training data on-policy. It does this by re-executing data screening (entropy computation, retrieval, and verification) using the current model, ensuring the training distribution aligns with the model's evolving capabilities. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) demonstrate that PolicyLong outperforms existing off-policy methods, especially at longer context lengths, indicating the importance of dynamic data generation.

Key Contribution

On-policy data generation closes the training distribution gap and unlocks +2.54 performance gains at 128K context lengths, proving that LLMs learn best from data that evolves with their capabilities.

Abstract

Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PolicyLong: Towards On-Policy Context Extension

Related Papers