Feb 25, 2026arXiv:2602.21611

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Kangning Shen, Jingyuan Zhang, Jingyuan Zhang, Chenxi Sun, Chenxi Sun, Wencong Zeng, Wencong Zeng, Yang Yue, Yang Yue

AI Summary

This paper addresses the granularity mismatch in memory mechanisms for LLM-based software engineering agents, where instance-level memory leads to misguided retrieval due to similar surface descriptions but distinct reasoning logic. They introduce Structurally Aligned Subtask-Level Memory, which aligns memory operations with the agent's functional decomposition into subtasks. Experiments on SWE-bench Verified demonstrate consistent improvements over vanilla agents and instance-level memory baselines, achieving a +4.7 pp average increase in mean Pass@1, with gains increasing with task complexity.

Key Contribution

LLMs struggle with long-horizon reasoning in software engineering because they retrieve irrelevant memories, but aligning memory with subtasks boosts performance by 4.7 points on SWE-bench.

Abstract

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Related Papers