Feb 16, 2026arXiv:2602.14743

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner

AI Summary

The paper introduces LLMStructBench, a new benchmark dataset for evaluating LLMs on structured data extraction and JSON generation from natural language text, comprising diverse parsing scenarios. The authors systematically tested 22 LLMs using five prompting strategies and introduced new metrics for token-level accuracy and document-level validity. Their key finding is that the choice of prompting strategy significantly impacts parsing reliability and structural validity, often outweighing the influence of model size.

Key Contribution

Forget scaling laws: the right prompting strategy can drastically improve JSON parsing reliability in LLMs, even more than model size.

Abstract

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Related Papers