Amazon ScienceSain AssociatesSchool of Computer ScienceDec 21, 2025

Domain-Adapted MLLMs for Interpretable Road Traffic Accident Analysis Using Remote Sensing Imagery

Bing He, Wei He, Qing Chang, Wen Luo, Lingli Xiao

AI Summary

This paper introduces a multimodal large language model (MLLM) framework for road traffic accident analysis, integrating remote sensing imagery with structured accident data. The authors fine-tuned three open-source vision-language models using LoRA through a three-stage pipeline to enhance visual description, multi-task classification, and chain-of-thought reasoning. Experiments demonstrate improved CIDEr scores for image description and enhanced accuracy and F1-scores in accident severity and duration classification, alongside significant gains in CoT reasoning metrics.

Key Contribution

MLLMs can now reason about road traffic accidents by fusing remote sensing imagery and structured data, unlocking interpretable insights previously inaccessible to traditional methods.

Abstract

Traditional road traffic accident analysis has long relied on structured data, making it difficult to integrate high-dimensional heterogeneous information such as remote sensing imagery and leading to an incomplete understanding of accident scene environments. This study proposes a road traffic accident analysis framework based on Multimodal Large Language Models. The approach integrates high-resolution remote sensing imagery with structured accident data through a three-stage progressive training pipeline. Specifically, we fine-tune three open-source vision–language models using Low-Rank Adaptation (LoRA) to sequentially optimize the model’s capabilities in visual environmental description, multi-task accident classification, and Chain-of-Thought (CoT) driven causal reasoning. A multimodal dataset was constructed containing remote sensing image descriptions, accident classification labels, and interpretable reasoning chains. Experimental results show that the fine-tuned model achieved a maximum improvement in the CIDEr score for image description tasks. In the joint classification task of accident severity and duration, the model achieved an accuracy of 71.61% and an F1-score of 0.8473. In the CoT reasoning task, both METEOR and CIDEr scores improved significantly. These results validate the effectiveness of structured reasoning mechanisms in multimodal fusion for transportation applications, providing a feasible path toward interpretable and intelligent analysis for real-world traffic management.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations1

Influential citations0

References37

Year2025

VenueISPRS Int. J. Geo Inf.

Related Papers

Finding related papers...

Search

Domain-Adapted MLLMs for Interpretable Road Traffic Accident Analysis Using Remote Sensing Imagery

Related Papers