Search papers, labs, and topics across Lattice.
This paper introduces a benchmark based on GF(2) circuit reconstruction to evaluate the step-success probability (γ) of LLMs in out-of-distribution logical inference, a key factor in the Diligent Learner framework for achieving superintelligence. The benchmark tasks are designed to require careful information integration, making them information-theoretically impossible to solve without it. Results show that while smaller LLMs exhibit a superlinear decline in γ with increasing task depth, frontier models demonstrate partial robustness, with successful reasoning being highly dependent on precise tool calls.
Tool design, not just model size, is the bottleneck for LLMs to achieve "superintelligence" via the Diligent Learner framework.
The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $γ$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.