May 6, 2026arXiv:2605.04943

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

AI Summary

DART, a vision-language foundation model, addresses the full rope inspection workflow by extending the Joint-Embedding Predictive Architecture (JEPA) with a Vision Transformer (ViT-H/14) and Llama-3.2-3B-Instruct, coupled via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Key innovations include HD-MASK for saliency-guided masking, per-class learnable severity gates, and a Contrastive Damage Disentanglement (CDD) loss. Trained on 4,270 images, DART achieves strong zero-shot performance on damage classification (93.22 % accuracy), severity regression (Spearman rho = 0.94), and few-shot recognition (89.2 % macro-F1), demonstrating its potential as a general-purpose condition monitoring backbone.

Key Contribution

A single vision-language foundation model, DART, can perform a full rope inspection workflow, including damage classification, severity estimation, and few-shot recognition, all without task-specific fine-tuning.

Abstract

The condition monitoring (CM) of synthetic fibre ropes (SFRs) used in offshore, maritime, and industrial settings demands more than a classifier: inspectors need continuous severity estimates, maintenance recommendations, anomaly flags, deterioration timelines, and automated reports, all from a single inspection image. We present DART (Damage Assessment via Rope Transformer), a vision-language foundation model that addresses the full rope inspection workflow through a unified multi-task architecture. DART extends the Joint-Embedding Predictive Architecture (JEPA) to the cross-modal domain by coupling a Vision Transformer (ViT-H/14) with Llama-3.2-3B-Instruct via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Three architectural innovations drive the model's versatility: (1) HD-MASK, a saliency-guided masking strategy that focuses self-supervised reconstruction on damage-dense patches; (2) per-class learnable severity gates that adaptively weight language grounding by damage category; and (3) a Contrastive Damage Disentanglement (CDD) loss that shapes the embedding space to simultaneously encode damage type, severity ordering, and cross-modal semantics. Trained once on 4,270 images spanning 14 fine-grained rope damage classes, the frozen DART backbone supports downstream tasks without any task-specific fine-tuning: damage classification (93.22 % accuracy, 91.04 % macro-F1, +38.5 pp over a vision-only baseline), continuous severity regression (Spearman rho = 0.94, within-1-ordinal accuracy 99.6 %), few-shot recognition (89.2 % macro-F1 at 20 shots). These results demonstrate that DART functions as a general-purpose CM backbone that goes well beyond classification, providing actionable inspection intelligence from a single shared representation.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

Related Papers