Search papers, labs, and topics across Lattice.
The paper introduces EVIAN, a framework for auditing visual instruction-tuning data by decomposing model responses into visual description, subjective inference, and factual claims. EVIAN evaluates these components along axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy, enabling targeted analysis of data quality. Experiments demonstrate that fine-tuning LVLMs on EVIAN-curated, high-quality subsets outperforms training on much larger, uncurated datasets, highlighting the importance of data quality over quantity.
Forget scaling laws: a model trained on a carefully curated subset of visual instruction data can beat models trained on datasets orders of magnitude larger.
The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.