Search papers, labs, and topics across Lattice.
The paper introduces ChartGen, a fully automated pipeline for generating synthetic chart image-code pairs to improve chart understanding in vision-language models (VLMs). ChartGen leverages a VLM to reconstruct seed chart images into Python scripts and then uses a code-oriented LLM to iteratively augment these scripts, creating a diverse dataset. The authors generated 222.5K unique chart image-code pairs and used a held-out evaluation set to benchmark six open-weight VLMs, demonstrating significant room for improvement in chart-to-code reconstruction.
Forget hand-annotated data: ChartGen automatically generates 222.5K chart-image/code pairs, exposing surprising weaknesses in today's VLMs at reconstructing plotting scripts.
Chart-to-code reconstruction -- the task of recovering executable plotting scripts from chart images -- provides important insights into a model's ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: https://github.com/SD122025/ChartGen/