Search papers, labs, and topics across Lattice.
The paper investigates the ability of large language models (LLMs) to reason about computational processes, specifically in the context of causal discovery for algorithm selection. By evaluating eight frontier LLMs against ground truth data from algorithm executions, the study reveals a systematic failure in predicting algorithmic means and confidence intervals. The LLMs' performance is often worse than random guessing, suggesting a lack of true algorithmic reasoning and a reliance on memorization.
LLMs exhibit "algorithmic blindness," failing to accurately predict algorithmic outcomes and performing worse than random guessing in causal discovery tasks, despite possessing broad declarative knowledge.
Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.