Search papers, labs, and topics across Lattice.
The paper introduces R2ABench, a new benchmark dataset for requirement-to-architecture generation consisting of real-world software projects, PRDs, and PlantUML diagrams. They propose a hybrid evaluation framework combining structural graph metrics, multi-dimensional scoring, and anti-pattern detection to assess generated architectures. Experiments using this framework reveal that while LLMs exhibit strong syntactic validity and entity extraction, they struggle with relational reasoning, resulting in fragmented architectures, and that agentic workflows introduce instability.
LLMs can generate syntactically valid software architectures from requirements, but their struggle with relational reasoning leads to structurally unsound designs.
Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating state-of-the-art models and agentic workflows. Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.