TU MunichMar 10, 2026arXiv:2603.09466

TopoOR: A Unified Topological Scene Representation for the Operating Room

Tony Danjun Wang, Ka Young Kim, Tolga Birdal, Nassir Navab, Lennart Bastian

AI Summary

The paper introduces TopoOR, a novel approach to represent surgical operating rooms (OR) as higher-order topological structures, capturing both pairwise and group relationships between entities. This representation overcomes the limitations of traditional scene graphs that rely on dyadic relationships and flatten the underlying manifold geometry. TopoOR also incorporates a higher-order attention mechanism to preserve manifold structure and modality-specific features, avoiding the need for a single joint latent representation. Experiments demonstrate that TopoOR outperforms graph and LLM-based baselines in tasks like sterility breach detection, robot phase prediction, and next-action anticipation.

Key Contribution

Ditch the flat scene graphs: TopoOR models surgical environments as higher-order topological structures, unlocking superior performance in safety-critical tasks by preserving complex relationships and multimodal data.

Abstract

Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TopoOR: A Unified Topological Scene Representation for the Operating Room

Related Papers