Mar 2, 2026arXiv:2603.01743

Action-Guided Attention for Video Action Anticipation

Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz

AI Summary

This paper introduces Action-Guided Attention (AGA), a novel attention mechanism for video action anticipation that uses predicted action sequences as queries and keys to guide sequence modeling. AGA addresses the limitations of existing transformer-based methods that overfit to explicit visual cues by explicitly incorporating high-level semantics through predicted actions. Experiments on EPIC-Kitchens-100 demonstrate that AGA improves generalization from validation to unseen test sets and enables post-training analysis of learned action dependencies.

Key Contribution

By explicitly guiding attention with predicted action sequences, AGA overcomes the limitations of standard dot-product attention in video action anticipation, leading to better generalization and interpretability.

Abstract

Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Action-Guided Attention for Video Action Anticipation

Related Papers