Search papers, labs, and topics across Lattice.
3PoinTr is introduced as a method for pretraining robot policies from unconstrained human videos by predicting 3D point tracks, an embodiment-agnostic representation of goals, scene geometry, and spatiotemporal relationships. A transformer architecture predicts these 3D point tracks, which are then used with a Perceiver IO architecture for sample-efficient behavior cloning. Experiments show that 3PoinTr achieves robust spatial generalization on manipulation tasks with only 20 labeled robot demonstrations, outperforming behavior cloning and prior pretraining methods.
Robots can now learn manipulation skills from ordinary human videos, thanks to a 3D point tracking method that bridges the embodiment gap and requires only 20 robot demonstrations.
Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.