AI2HUJIApr 10, 2026arXiv:2604.09237

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

Shahar Levy, Reshef Mintz, Barak Raveh, Renana Keydar

AI Summary

ScheMatiQ automates the creation of structured databases from unstructured text corpora given a natural language research question, using LLMs to generate an extraction schema and populate a grounded database. This avoids the manual schema design and annotation process, which is slow and error-prone. Experiments with domain experts in law and computational biology demonstrate that ScheMatiQ produces outputs suitable for real-world analysis.

Key Contribution

Skip the annotation bottleneck: ScheMatiQ lets you turn research questions and text corpora into structured databases with LLMs, guided by a simple web interface.

Abstract

Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

Data Curation & Synthetic Data Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

Related Papers