BRIA AITAUFeb 9, 2026arXiv:2602.09146

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

AI Summary

The paper introduces SemanticMoments, a training-free approach for motion-based video retrieval that addresses the limitations of existing methods biased towards static appearance. They demonstrate this bias with the SimMotion benchmarks, comprising synthetic and real-world datasets, where existing models struggle to disentangle motion from appearance. SemanticMoments computes temporal statistics (higher-order moments) over features from pre-trained semantic models, achieving superior performance on the SimMotion benchmarks compared to RGB, flow, and text-supervised methods.

Key Contribution

Forget training: SemanticMoments achieves state-of-the-art motion-based video retrieval by simply computing temporal statistics over features from pre-trained semantic models.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

Computer Vision Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Related Papers