Search papers, labs, and topics across Lattice.
This paper reproduces and extends Recursive Language Models (RLMs), a framework enabling LLMs to handle near-infinite contexts via external REPL environments. The study investigates the impact of varying recursion depth in RLMs using DeepSeek v3.2 and Kimi K2 on S-NIAH and OOLONG benchmarks. Results demonstrate that while depth-1 RLMs improve accuracy on complex reasoning, depth-2 RLMs degrade performance on both complex and simple tasks while significantly increasing execution time and token costs, suggesting an "overthinking" phenomenon.
Deeper recursion in Recursive Language Models can backfire, causing LLMs to "overthink" and paradoxically degrade performance on both complex reasoning and simple retrieval tasks.
This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction