Search papers, labs, and topics across Lattice.
The paper introduces MDF-MLLM, a novel multimodal deep learning architecture that integrates fine-grained image features from a U-Net encoder with global textual context within a LLaMA 3.2 11B MLLM to improve retinal disease classification from fundus images. MDF-MLLM uses skip connections from the U-Net to cross-attention blocks in the MLLM, fusing vision features patch-wise with scaled cross-attention and FiLM-based U-Net modulation. Evaluated on 1,305 fundus image-text pairs, MDF-MLLM achieved 94% accuracy, a 56% improvement over the baseline MLLM, demonstrating enhanced spatial reasoning and classification performance, especially for inherited diseases.
Fusing multi-scale U-Net features into a large language model boosts fundoscopic image classification accuracy by 56%, unlocking a new level of diagnostic precision.
This study aimed to enhance disease classification accuracy from retinal fundus images by integrating fine-grained image features and global textual context using a novel multimodal deep learning architecture. Existing multimodal large language models (MLLMs) often struggle to capture low-level spatial details critical for diagnosing retinal diseases such as glaucoma, diabetic retinopathy, and retinitis pigmentosa. This model development and validation study was conducted on 1,305 fundus image-text pairs compiled from three public datasets (FIVES, HRF, and StoneRounds), covering acquired and inherited retinal diseases, and evaluated using classification accuracy and F1-score. The MDF-MLLM integrates skip features from four U-Net encoder layers into cross-attention blocks within a LLaMA 3.2 11B MLLM. Vision features are patch-wise projected and fused using scaled cross-attention and FiLM-based U-Net modulation. Baseline MLLM achieved 60% accuracy on the dual-type disease classification task. MDF-MLLM, with both U-Net and MLLM components fully fine-tuned during training, achieved a significantly higher accuracy of 94%, representing a 56% improvement. Recall and F1-scores improved by as much as 67% and 35% over baseline, respectively. Ablation studies confirmed that the multi-depth fusion approach contributed to substantial gains in spatial reasoning and classification, particularly for inherited diseases with rich clinical text. MDF-MLLM presents a generalizable, interpretable, and modular framework for fundus image classification, outperforming traditional MLLM baselines through multi-scale feature fusion. The architecture holds promise for real-world deployment in clinical decision support systems. Future work will explore synchronized training techniques, a larger pool of diseases for more generalizability, and extending the model for segmentation tasks.