Feb 25, 2026arXiv:2602.22098

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Antonino Ferraro, Vincenzo Moscato

AI Summary

The paper introduces Brain3D, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI that addresses the limitations of 2D slice-based approaches. Brain3D inflates a pretrained 2D medical encoder into a 3D architecture and aligns it with a causal language model through contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Evaluated on a dataset of 468 subjects, Brain3D achieves a Clinical Pathology F1 score of 0.951, significantly outperforming a 2D baseline while maintaining perfect specificity on healthy scans.

Key Contribution

Brain3D leapfrogs existing 2D slice-based methods for brain MRI analysis, achieving near-perfect F1 score in clinical pathology detection by directly processing volumetric data.

Abstract

Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

Related Papers