Search papers, labs, and topics across Lattice.
This paper presents a large-scale study on training long-context vision language models (VLMs) up to 344K context tokens for long-document visual question answering and transfer to long-context text. The authors systematically investigate continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, addressing the lack of reproducibility in existing open-weight VLMs. Key findings include the importance of matching training and evaluation context lengths, the benefits of page indices, the effectiveness of synthetic data pipelines for self-improvement, and the transferability of visual long context training to long-context text tasks, achieving SOTA on MMLongBenchDoc.
Training VLMs on context lengths matching evaluation lengths yields better performance than training on even longer contexts, challenging common assumptions about scaling laws.
We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.