Feb 19, 2026arXiv:2602.17183

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Kishan Maharaj, Nandakishore Menon, Ashita Saxena, Srikanth Tamilselvam

AI Summary

This paper investigates the robustness of LLMs in long-context code question answering across Python, COBOL, and Java, using ablations to test sensitivity to answer format, distractors, and context scale. The study extends the LongCodeBench dataset and evaluates models under shuffled multiple-choice options, open-ended questions, and needle-in-a-haystack contexts. The results demonstrate significant performance degradation under these conditions, revealing limitations in current long-context evaluations and the brittleness of LLMs to irrelevant information.

Key Contribution

LLMs struggle with long-context code QA, losing significant accuracy when answer formats are changed or irrelevant information is added, revealing a brittleness masked by standard benchmarks.

Abstract

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Related Papers