Tsinghua AIMar 17, 2026arXiv:2603.16781

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Zuozhu Liu

AI Summary

The authors introduce IOSVLM, a 3D vision-language model (VLM) designed for unified dental diagnosis directly from intraoral scans (IOS) represented as point clouds. To address challenges like heterogeneous scan forms and limited paired data, they propose a geometry-to-chromatic proxy to bridge the gap between color-free IOS data and color-dependent 3D pre-training, along with a two-stage curriculum training strategy. Evaluated on a newly created large-scale IOS diagnosis VQA dataset (IOSVQA), IOSVLM demonstrates significant performance gains over existing methods, achieving at least +9.58% macro accuracy and +1.46% macro F1.

Key Contribution

Directly modeling 3D geometry in dental scans unlocks a 9.58% accuracy boost in multi-disease diagnosis compared to methods relying on 2D or multi-view image representations.

Abstract

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Related Papers