Search papers, labs, and topics across Lattice.
AffordSim is introduced as a novel simulation framework that integrates open-vocabulary 3D affordance prediction into robotic manipulation data generation, enabling the creation of semantically correct trajectories for tasks requiring precise interaction with object affordances. The framework leverages VoxAfford, a 3D affordance detector, to guide grasp pose estimation toward task-relevant functional regions within NVIDIA Isaac Sim. Experiments across 50 tasks reveal that affordance-demanding tasks remain challenging for imitation learning, while zero-shot sim-to-real experiments validate the transferability of the generated data.
Generating robotic manipulation data that respects object affordances is now possible at scale, but current imitation learning methods still struggle with tasks like pouring and hanging, revealing a critical gap.
Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.