CMU MLFeb 12, 2026arXiv:2602.11934

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

AI Summary

The paper introduces Robot-DIFT, a framework that distills geometric priors from a frozen diffusion model into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN) to improve visuomotor control. This distillation process aims to address the structural mismatch between vision encoders optimized for semantic invariance and the geometric sensitivity required for precise manipulation. Robot-DIFT, pretrained on the DROID dataset, achieves superior geometric consistency and control performance compared to discriminative baselines by leveraging the geometric dependencies encoded within diffusion model latent manifolds.

Key Contribution

By distilling a frozen diffusion model's geometric understanding into a fast, deterministic network, Robot-DIFT unlocks more precise robot control compared to standard vision encoders.

Abstract

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a"blind spot"for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

Related Papers