ShanghaiTechSYSUFeb 25, 2026arXiv:2602.22142

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

AI Summary

The paper introduces WeaveTime, a framework to address time-agnosticism in Video-LLMs operating in streaming settings, where frames arrive sequentially. WeaveTime employs a Temporal Reconstruction objective to instill order-aware representations and a Past-Current Dynamic Focus Cache for uncertainty-triggered history retrieval. Experiments on streaming benchmarks demonstrate that WeaveTime improves accuracy and reduces latency without architectural changes to existing Video-LLMs.

Key Contribution

Video-LLMs can now stream more effectively: WeaveTime teaches them to perceive temporal order and focus dynamically on relevant history, boosting accuracy and cutting latency without requiring architectural changes.

Abstract

Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

Related Papers