Apr 30, 2026arXiv:2604.28076

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

An-Yang Ji, Anya Ji, Jun-Peng Jiang, Jun-Peng Jiang, De-Chuan Zhan, De-Chuan Zhan, Han-Jia Ye

AI Summary

TopBench, a new benchmark, is introduced to evaluate LLMs on implicitly predictive tabular question answering tasks, which require inferring unobserved answers from historical patterns. The benchmark comprises 779 samples across four sub-tasks, including single-point prediction, decision making, treatment effect analysis, and complex filtering. Experiments using TopBench reveal that current LLMs struggle with intent recognition and predictive reasoning, often defaulting to simple lookups, and that accurate intent disambiguation is crucial for improved performance.

Key Contribution

LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.

Abstract

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Related Papers