CMU MLIITFeb 16, 2026arXiv:2602.14989

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra

AI Summary

The paper introduces ThermEval-B, a new benchmark comprising approximately 55,000 thermal visual question answering pairs, to evaluate vision-language models (VLMs) on thermal imagery. The benchmark includes a new dataset, ThermEval-D, providing dense per-pixel temperature maps with semantic body-part annotations. Experiments on 25 VLMs reveal consistent failures in temperature-grounded reasoning, sensitivity to colormap transformations, and reliance on language priors, highlighting the need for specialized evaluation beyond RGB-centric benchmarks.

Key Contribution

VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.

Abstract

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Related Papers