Search papers, labs, and topics across Lattice.
The paper introduces SemanticMoments, a training-free approach for motion-based video retrieval that addresses the limitations of existing methods biased towards static appearance. They demonstrate this bias with the SimMotion benchmarks, comprising synthetic and real-world datasets, where existing models struggle to disentangle motion from appearance. SemanticMoments computes temporal statistics (higher-order moments) over features from pre-trained semantic models, achieving superior performance on the SimMotion benchmarks compared to RGB, flow, and text-supervised methods.
Forget training: SemanticMoments achieves state-of-the-art motion-based video retrieval by simply computing temporal statistics over features from pre-trained semantic models.
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.