Tsinghua AIApr 8, 2026arXiv:2604.06683

Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

AI Summary

The paper introduces R2ABench, a new benchmark dataset for requirement-to-architecture generation consisting of real-world software projects, PRDs, and PlantUML diagrams. They propose a hybrid evaluation framework combining structural graph metrics, multi-dimensional scoring, and anti-pattern detection to assess generated architectures. Experiments using this framework reveal that while LLMs exhibit strong syntactic validity and entity extraction, they struggle with relational reasoning, resulting in fragmented architectures, and that agentic workflows introduce instability.

Key Contribution

LLMs can generate syntactically valid software architectures from requirements, but their struggle with relational reasoning leads to structurally unsound designs.

Abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating state-of-the-art models and agentic workflows. Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

Related Papers