Search papers, labs, and topics across Lattice.
EsoLang-Bench is introduced as a benchmark to evaluate genuine reasoning in LLMs, using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) to minimize memorization effects. Evaluating five frontier models reveals a significant performance drop compared to standard benchmarks, with accuracies plummeting from 85-95% to 0-11% on esoteric tasks. The benchmark's design facilitates mimicking human learning through documentation and interpreter feedback, providing a more robust measure of transferable reasoning skills.
LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.
Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.