Microsoft ResearchBUMay 21, 2026arXiv:2605.22678

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

AI Summary

Swift Sampling, a training-free frame selection algorithm, leverages Taylor series expansion to predict the trajectory of visual features in a video's latent space. By identifying frames where the actual visual features deviate significantly from this predicted trajectory, the method selects temporally surprising and information-rich frames. The approach achieves state-of-the-art performance on long-video question answering and other downstream tasks, with minimal computational overhead (0.02x), outperforming existing query-agnostic baselines, especially under limited frame budgets.

Key Contribution

Skip the training and the hyperparameter tuning: Swift Sampling uses Taylor series to find the most informative frames in a video, beating existing methods with a 30x reduction in overhead.

Abstract

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Computer Vision World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Related Papers