Search papers, labs, and topics across Lattice.
The paper introduces EAST, a novel framework for early action prediction that effectively handles incomplete observations by employing a randomized training strategy to sample time steps between observed and unobserved video frames. This approach allows for improved generalization across varying observation ratios and demonstrates that joint learning with both observed and future representations significantly enhances model performance. Notably, EAST achieves state-of-the-art results on benchmark datasets NTU60, SSv2, and UCF101, outperforming previous methods by substantial margins while also improving training efficiency through a token masking procedure that reduces memory usage and accelerates training.
EAST's innovative sampling strategy enables models to excel at early action prediction, achieving unprecedented accuracy across multiple benchmarks.
Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.