Apr 1, 2026arXiv:2604.00696

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Soumya Jahagirdar, Edson Araujo, A. Kukleva, M. J. Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hildegard Kuehne

AI Summary

This paper introduces TTA-Vid, a test-time adaptation approach for video reasoning that leverages reinforcement learning to adapt a pre-trained model to new video samples without labels. TTA-Vid uses step-by-step reasoning on frame subsets and a batch-aware frequency-based reward as pseudo-ground truth to update the model during inference. The results demonstrate that TTA-Vid generalizes well across datasets and outperforms state-of-the-art methods trained on large-scale labeled data, highlighting the potential of test-time RL for temporal multimodal understanding.

Key Contribution

Forget finetuning: TTA-Vid adapts video reasoning models to new datasets *during inference* using test-time reinforcement learning, achieving state-of-the-art results without any labels.

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Related Papers