Friedrich-Alexander University Erlangen-NürnbergInternational School of MedicineUZHFeb 25, 2026arXiv:2602.21735

SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz

AI Summary

The paper introduces SigVLP, a novel vision-language pre-training approach for CT volumes that addresses the challenge of variable volume sizes by treating the z-axis as an unconstrained temporal dimension using Rotary Position Embeddings. SigVLP implements Rotary Position Embedding within the attention operation to generate input-conditioned sine and cosine weights, enabling adaptation to variable input sizes and consistent query-key alignment. By training with chunkwise volume-text pairs and the Muon optimizer, SigVLP achieves finer-grained supervision and demonstrates improved performance on zero-shot abnormality/organ classification, segmentation, and retrieval tasks.

Key Contribution

Ditch fixed-size 3D blocks: SigVLP uses rotary embeddings to let vision-language models handle CT volumes with variable slice counts, unlocking better pre-training.

Abstract

Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

Related Papers