Feb 18, 2026arXiv:2602.16160

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi, Patrick Poggi, Divake Kumar, Divake Kumar, Theja Tulabandhula, Theja Tulabandhula, A. R. Trivedi, Amit Ranjan Trivedi

AI Summary

The paper introduces UncL-STARK, a method for dynamically adapting the inference depth of transformer-based visual trackers based on an uncertainty estimate derived from corner localization heatmaps. This is motivated by the observation that full-depth inference is often unnecessary for temporally coherent frames in video sequences, leading to wasted computation. By fine-tuning the model with random-depth training and knowledge distillation, UncL-STARK achieves significant reductions in GFLOPs, latency, and energy consumption while maintaining tracking accuracy comparable to the full-depth baseline.

Key Contribution

Transformer-based visual trackers can slash compute by up to 12% without sacrificing accuracy, simply by dynamically adjusting their depth based on uncertainty.

Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Related Papers