ByteDanceHKUSTApr 22, 2026arXiv:2604.20544

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

AI Summary

The paper introduces EVIAN, a framework for auditing visual instruction-tuning data by decomposing model responses into visual description, subjective inference, and factual claims. EVIAN evaluates these components along axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy, enabling targeted analysis of data quality. Experiments demonstrate that fine-tuning LVLMs on EVIAN-curated, high-quality subsets outperforms training on much larger, uncurated datasets, highlighting the importance of data quality over quantity.

Key Contribution

Forget scaling laws: a model trained on a carefully curated subset of visual instruction data can beat models trained on datasets orders of magnitude larger.

Abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Related Papers