Search papers, labs, and topics across Lattice.
This paper investigates the impact of Video-SFT on the spatial and temporal understanding capabilities of MLLMs, finding a trade-off where improvements in video understanding often come at the cost of performance on static image benchmarks. They show that increasing the number of sampled frames exacerbates this trade-off and propose an instruction-aware Hybrid-Frame strategy to mitigate it. The study reveals that Video-SFT is not universally beneficial and highlights the challenge of balancing spatial and temporal understanding in MLLMs.
Video fine-tuning boosts MLLMs' video smarts, but surprisingly dumbs them down on static images – a trade-off you can't simply brute-force away with more frames.
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.