Mar 16, 2026arXiv:2603.15237

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

AI Summary

This paper introduces a physics-informed instruction tuning framework to improve VLMs' performance on physics-grounded anomaly detection. The framework encodes object properties, motion paradigms, and dynamic constraints into structured prompts delivered through multi-turn dialogues, which decomposes causal reasoning into incremental steps. Results on the Phys-AD benchmark show the approach achieves 96.7% AUROC in video-level detection, significantly outperforming the previous state-of-the-art.

Key Contribution

VLMs can achieve near-perfect anomaly detection in physical systems by incorporating structured physics priors into multi-turn dialogues, a massive leap from previous methods.

Abstract

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Related Papers