Search papers, labs, and topics across Lattice.
The paper introduces dense point trajectories as a mid-level representation for modeling motion and behavior in visual data, disentangling motion from appearance. They then propose a diffusion transformer architecture that operates on sets of these trajectories, explicitly handling occlusions to forecast complex motion patterns. Evaluated on a newly curated 300-hour dataset of unconstrained animal videos, the method demonstrates category-agnostic and data-efficient motion prediction, outperforming existing baselines and generalizing to unseen species.
Forget bounding boxes – predicting the future of animal movement in the wild is now possible with interpretable trajectory tokens and a diffusion transformer.
Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.