Search papers, labs, and topics across Lattice.
This paper introduces a novel spatio-temporal attention network (STARK) for continuous sign language recognition (CSLR) that jointly models spatial relationships between keypoints and temporal dynamics within local windows using a unified attention mechanism. By computing attention scores both spatially and temporally, STARK aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder achieves comparable performance to existing state-of-the-art keypoint-based methods on the Phoenix-14T dataset while significantly reducing the number of parameters by 70-80%.
Achieve state-of-the-art performance in continuous sign language recognition with 70-80% fewer parameters by unifying spatial and temporal attention.
Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.