AI LaboratoryFudanHKUNJUShanghai AI LabUniversity of Electronic Science and TechnologyUniversity of Science and TechnologyZJUJun 4, 2026arXiv:2606.05769

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang

AI Summary

This paper introduces Future-L1, an interleaved latent visual reasoning framework that enhances video event prediction (VEP) by allowing models to alternate between language tokens and continuous latent visual spans during autoregressive decoding. By training on a dataset specifically curated to highlight the importance of future visual hints, and optimizing with a latent-aware reinforcement learning objective, Future-L1 achieves significant improvements over existing models, setting new state-of-the-art benchmarks on FutureBench and TwiFF-Bench. The findings indicate that maintaining intermediate visual semantics in latent space leads to more accurate predictions, reducing the risk of hallucinations that arise from purely text-based reasoning.

Key Contribution

Future-L1 shows that preserving visual semantics in latent space can dramatically enhance video event prediction accuracy, outperforming previous models by substantial margins.

Abstract

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

Multimodal Models Reasoning & Chain-of-Thought World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Related Papers