Mar 16, 2026arXiv:2603.15118

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels

AI Summary

The VAREX benchmark is introduced to evaluate multimodal foundation models on structured data extraction from government forms using a reverse annotation pipeline to generate synthetic data with deterministic ground truth. The benchmark includes 1,777 documents with varied schemas and provides four input modalities (plain text, layout-preserving text, document image, and combined text/image) to enable ablation studies on the impact of input format. Evaluation of 20 models reveals that structured output compliance is a major bottleneck for models below 4B parameters, and that layout-preserving text provides the largest accuracy gain.

Key Contribution

Layout-preserving text beats pixel-level visual cues for structured data extraction from documents, according to a new multimodal benchmark.

Abstract

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Related Papers