Mar 17, 2026arXiv:2603.16163

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

AI Summary

This paper introduces a novel spatio-temporal attention network (STARK) for continuous sign language recognition (CSLR) that jointly models spatial relationships between keypoints and temporal dynamics within local windows using a unified attention mechanism. By computing attention scores both spatially and temporally, STARK aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder achieves comparable performance to existing state-of-the-art keypoint-based methods on the Phoenix-14T dataset while significantly reducing the number of parameters by 70-80%.

Key Contribution

Achieve state-of-the-art performance in continuous sign language recognition with 70-80% fewer parameters by unifying spatial and temporal attention.

Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Related Papers