Mar 2, 2026arXiv:2603.02119

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

AI Summary

The paper introduces Pencil Puzzle Bench, a benchmark of 300 puzzles across 20 varieties derived from a larger dataset of 62,231 pencil puzzles, designed to evaluate LLM reasoning capabilities through step-level verifiable constraint satisfaction. The benchmark enables precise error localization by checking intermediate board states against variety-specific constraints, facilitating dense reward signals for process supervision. Evaluation of 51 models reveals significant performance gains from increased reasoning effort (GPT-5.2 improving 81x) and agentic iteration (Claude Opus 4.6 rising from 0.3% to 30.0%), highlighting the importance of these factors in solving complex reasoning tasks.

Key Contribution

LLMs can improve their pencil puzzle solving ability by up to 81x simply by increasing reasoning effort, and iterative checking boosts performance by over 29%, revealing a surprisingly effective path to stronger reasoning.

Abstract

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Related Papers