Search papers, labs, and topics across Lattice.
The paper introduces UncL-STARK, a method for dynamically adapting the inference depth of transformer-based visual trackers based on an uncertainty estimate derived from corner localization heatmaps. This is motivated by the observation that full-depth inference is often unnecessary for temporally coherent frames in video sequences, leading to wasted computation. By fine-tuning the model with random-depth training and knowledge distillation, UncL-STARK achieves significant reductions in GFLOPs, latency, and energy consumption while maintaining tracking accuracy comparable to the full-depth baseline.
Transformer-based visual trackers can slash compute by up to 12% without sacrificing accuracy, simply by dynamically adjusting their depth based on uncertainty.
Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.