Search papers, labs, and topics across Lattice.
The authors introduce SciTaRC, a new QA benchmark focused on scientific tables that demands both language reasoning and complex computation. They find that even state-of-the-art models like Llama-3.3-70B-Instruct struggle, failing on 65.5% of the questions. Error analysis reveals a key "execution bottleneck," where models struggle to faithfully execute correct plans, whether using code or natural language.
Even the best open-weight LLMs still fail on nearly two-thirds of questions requiring reasoning over scientific tables, highlighting a persistent "execution bottleneck" in translating strategy to action.
We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal"execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.