IST LisbonSword HealthApr 9, 2026arXiv:2604.08294

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, P. Martins, Pedro Henrique Martins

AI Summary

This paper benchmarks the performance of state-of-the-art Vision Language Models (VLMs) on Action Quality Assessment (AQA) across diverse activities and tasks. The study finds that models like Gemini 3.1 Pro, Qwen3-VL, and InternVL3.5 perform only slightly better than random chance, even with enhancements like skeleton data and in-context learning. Analysis reveals biases towards predicting correct execution and sensitivity to linguistic framing, highlighting fundamental challenges in using VLMs for fine-grained movement quality assessment.

Key Contribution

Despite their impressive capabilities, today's VLMs struggle to judge action quality, performing barely above chance even with tailored prompts and visual cues.

Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models'limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Related Papers