Search papers, labs, and topics across Lattice.
AffordTissue, a novel multimodal framework, predicts tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy by combining temporal vision encoding of tool motion and tissue dynamics with language conditioning for instrument-action generalization and a DiT-style decoder. The framework is trained and evaluated on a newly curated dataset of 15,638 video clips from 103 cholecystectomy procedures, establishing the first tissue affordance benchmark. Results demonstrate a significant improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), highlighting the efficacy of the task-specific architecture for dense surgical affordance prediction.
Task-specific architectures still crush large vision-language models when it comes to predicting where surgical instruments should interact with tissue.
Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.