Search papers, labs, and topics across Lattice.
The paper introduces Spatial Surgical Transformer (SST), an end-to-end visuomotor policy for surgical robots that leverages 3D spatial cues directly from stereo endoscopic images. To address the lack of training data, the authors created Surgical3D, a large-scale photorealistic dataset of 30K stereo endoscopic image pairs with accurate 3D geometry. SST finetunes a geometric transformer on Surgical3D to extract 3D latent representations, aligns them with the robot's action space using a multi-level spatial feature connector, and achieves state-of-the-art performance on real-robot surgical tasks.
Surgical robots can now perform complex tasks like knot tying and organ dissection with state-of-the-art accuracy, thanks to a new method that directly infers 3D spatial awareness from standard endoscopic images.
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.