Search papers, labs, and topics across Lattice.
The paper introduces Robot-DIFT, a framework that distills geometric priors from a frozen diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN) to improve visuomotor control. This distillation process aims to address the structural mismatch between vision encoders optimized for semantic invariance and the geometric sensitivity required for precise manipulation. Robot-DIFT, pretrained on the DROID dataset, achieves superior geometric consistency and control performance compared to discriminative baselines by leveraging the geometric dependencies encoded within diffusion model latent manifolds.
By distilling a frozen diffusion model's geometric understanding into a fast, deterministic network, Robot-DIFT unlocks more precise robot control compared to standard vision encoders.
We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a"blind spot"for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.