Search papers, labs, and topics across Lattice.
PanopticQuery addresses the challenge of answering natural language queries about dynamic 4D scenes by integrating 4D Gaussian Splatting with a novel multi-view semantic consensus mechanism. This mechanism aggregates 2D semantic predictions across views and time to enforce geometric consistency and lift semantics into a structured 4D representation via neural field optimization. The approach achieves state-of-the-art performance on a new benchmark, Panoptic-L4D, which evaluates complex language queries involving attributes, actions, spatial relationships, and multi-object interactions.
Answering complex questions about 4D scenes just got a whole lot better: PanopticQuery leverages multi-view semantic consensus to transform noisy, view-dependent predictions into globally consistent 4D interpretations.
Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.