Search papers, labs, and topics across Lattice.
This paper addresses the accuracy-efficiency trade-off in Siamese visual trackers by introducing a Multilayer Perceptron (MLP)-based fusion module for pixel-level interaction. To mitigate the quadratic computational cost associated with stacking MLP blocks, the authors construct a hierarchical search space and employ differentiable neural architecture search (DNAS) with a customized relaxation strategy to decouple channel-width optimization. The resulting tracker achieves state-of-the-art accuracy-efficiency, demonstrating real-time performance on both GPUs and NPUs while performing competitively on multiple tracking benchmarks.
Ditching convolutions and transformers, a new Siamese tracker uses MLPs and neural architecture search to achieve state-of-the-art accuracy-efficiency trade-offs in visual tracking, even on resource-constrained hardware.
Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).