Search papers, labs, and topics across Lattice.
The paper introduces CodeGlance, a benchmark designed to evaluate LLMs' code reasoning abilities across three realistic scenarios: intrinsic logic, API interaction, and unseen function reasoning. Through evaluation of 7 LLMs, the study reveals that reasoning about unseen functions is particularly challenging, especially for smaller models. The authors identify code complexity features like execution trace length and API invocation count as key factors impacting reasoning difficulty, and analyze the effectiveness of augmentation strategies like CoT and code search.
LLMs struggle to reason about unseen functions in code, with even state-of-the-art models achieving surprisingly low accuracy, highlighting a critical gap in their code understanding capabilities.
In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models (LLMs). However, existing studies on code reasoning focus primarily on isolated code snippets, overlooking the complexity of real-world scenarios involving external API interactions and unfamiliar functions. This gap hinders our understanding of what truly makes code reasoning challenging for LLMs across diverse programming contexts. We present CodeGlance, a multi-dimensional benchmark investigating code reasoning challenges across three realistic scenarios: intrinsic logic reasoning, API interaction reasoning, and unseen function reasoning. Through systematic evaluation of 7 state-of-the-art LLMs, we reveal that unseen function reasoning poses significant challenges especially for smaller models, with Qwen2.5-3b achieving only 6.0\% accuracy on unseen functions compared to 37.5\% on familiar APIs. We identify critical code complexity features -- including execution trace length, API invocation count, and control flow complexity -- that significantly impact code reasoning difficulty across scenarios. We further investigate how common augmentation strategies, including CoT, document retrieval, and code search, can improve reasoning performance, finding that their effectiveness varies substantially depending on whether challenges stem from logical complexity or knowledge gaps. These findings provide actionable guidance for developing more capable code reasoning systems and deploying LLM-based programming assistants in real-world software development.