Search papers, labs, and topics across Lattice.
This paper introduces VISA, a semantic auditing approach that enhances existing 3D occupancy world models by leveraging a Vision-Language Model (VLM) to assess and improve object classification accuracy. By querying the VLM for structured audits of physical object instances, VISA effectively identifies and addresses errors in object classification, particularly for rare classes, leading to significant improvements in mean Intersection over Union (mIoU) metrics across various datasets. The results demonstrate that VLMs can serve as reliable semantic auditors, rather than merely as embedding targets, thereby improving the robustness of occupancy models in autonomous systems.
VLMs can outperform traditional embedding methods by serving as reliability-aware semantic auditors, boosting occupancy model accuracy for rare classes.
Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.