PKUApr 16, 2026arXiv:2604.14709

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

Fan Cui, Hongyuan Hou, Zizhang Luo, Chenyun Yin, Yun Liang

AI Summary

HWE-Bench, a new benchmark, evaluates LLM agents on real-world hardware bug repair using 417 tasks from six open-source projects (Verilog/SystemVerilog and Chisel). The benchmark uses containerized environments and native simulation/regression flows for validation. Experiments with seven LLMs and four agent frameworks reveal a top agent solving 70.7% of tasks, with performance varying significantly based on project complexity and bug type, highlighting challenges in fault localization, hardware-semantic reasoning, and cross-artifact coordination.

Key Contribution

LLMs fixing hardware bugs struggle with the complexity of SoCs, where performance drops below 65%, revealing a significant gap compared to simpler core designs.

Abstract

Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

Related Papers