Search papers, labs, and topics across Lattice.
This paper investigates the utility of long-term motion representations derived from point tracks for various perceptual tasks, comparing them to image-based representations. The authors demonstrate that long-term motion representations encode information about actions, objects, materials, and spatial relationships, often outperforming image representations, particularly in low-data and zero-shot scenarios. They also find that motion representations offer a more efficient trade-off between computational cost and accuracy compared to standard video representations, and that combining them yields superior performance.
Long-term motion understanding can outperform image-based perception, offering surprising generalization and efficiency gains.
Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.