Search papers, labs, and topics across Lattice.
TAPFormer is introduced, a transformer-based framework for tracking arbitrary points by asynchronously fusing frames and event streams. It uses a Transient Asynchronous Fusion (TAF) mechanism to model temporal evolution between frames using continuous event updates, and a Cross-modal Locally Weighted Fusion (CLWF) module to adaptively adjust spatial attention based on modality reliability. Experiments on a new real-world frame-event TAP dataset and standard benchmarks demonstrate TAPFormer's superior performance compared to existing point trackers, achieving a 28.2% improvement in average pixel error within threshold.
Achieve 28% better point tracking by asynchronously fusing frames and events with a novel transformer architecture that bridges the gap between low-rate frames and high-rate events.
Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io