Search papers, labs, and topics across Lattice.
CI-Repair-Bench is introduced as a new benchmark for evaluating automated program repair within the context of Continuous Integration (CI) workflows, using real GitHub Actions executions. The benchmark comprises 567 CI failure instances across 103 repositories, categorized into 12 CI error types, and evaluates repair correctness through full CI re-execution. Results show that automated repair struggles with environment, dependency, and configuration-related failures, with the best LLM achieving only an 18.9% repair success rate, highlighting the need for CI-native repair techniques.
Automated program repair still struggles in real-world CI environments, succeeding in less than 20% of cases, even with the best LLMs.
Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full CI re-execution under original workflows. Failures are categorized into 12 CI error types, enabling fine-grained, error-type-aware evaluation. To demonstrate benchmark usage, we include a reference CI repair workflow that analyzes CI logs to localize faults and generate candidate patches. Empirical results show that automated repair is most effective for localized, tool-enforced failures such as formatting and linting, while environment, dependency, and configuration-related failures remain challenging; the best-performing LLM achieves an 18.9% repair success rate. CI-Repair-Bench provides a realistic evaluation foundation for advancing research on CI-native automated program repair.