Mar 2, 2026arXiv:2603.01706

Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

AI Summary

This paper addresses the accuracy-efficiency trade-off in Siamese visual trackers by introducing a Multilayer Perceptron (MLP)-based fusion module for pixel-level interaction. To mitigate the quadratic computational cost associated with stacking MLP blocks, the authors construct a hierarchical search space and employ differentiable neural architecture search (DNAS) with a customized relaxation strategy to decouple channel-width optimization. The resulting tracker achieves state-of-the-art accuracy-efficiency, demonstrating real-time performance on both GPUs and NPUs while performing competitively on multiple tracking benchmarks.

Key Contribution

Ditching convolutions and transformers, a new Siamese tracker uses MLPs and neural architecture search to achieve state-of-the-art accuracy-efficiency trade-offs in visual tracking, even on resource-constrained hardware.

Abstract

Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...