CUHKSMUApr 19, 2026arXiv:2604.17529

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

Renyi Zhong, Yulun Wu, Jinxi Kuang, Yintong Huo

AI Summary

This study introduces MultiLogBench, a comprehensive multilingual benchmark that evaluates automated logging across six programming languages, addressing the limitations of existing Java-centric datasets. By analyzing 63,965 production-code instances and 744 revision-history cases, the authors reveal significant cross-language variations in logging-site localization and framework-anchor matching, highlighting the challenges posed by complex structural contexts. The findings emphasize that robust conclusions about automated logging systems necessitate a multilingual approach and maintenance-oriented validation to ensure generalizability across diverse programming environments.

Key Contribution

Automated logging systems may perform unpredictably across programming languages, with framework-anchor matching proving particularly sensitive to language differences.

Abstract

Logging statements are central to debugging, failure diagnosis, and production observability, yet writing them requires developers to decide where to place a logging statement, which API and severity level to use, and what runtime information to expose. Automated logging aims to reduce this burden, but existing evidence remains dominated by Java-centric repository-snapshot dataset. It is therefore unclear whether conclusions about model behavior and model selection generalize across programming-language ecosystems or realistic code evolution. This paper presents MultiLogBench, a multilingual benchmark and empirical study spanning six programming language ecosystems. MultiLogBench contains 63,965 production-code repository-snapshot instances, 744 revision-history cases where developers introduce logging statements during maintenance, and a paired transformed revision-history branch for robustness analysis. Using seven contemporary large language models under a unified protocol, we evaluate logging-site localization, framework-anchor matching, severity prediction, message generation, variable recovery, and cascaded overall quality. Results show clear cross-language variation: framework-anchor matching is the most language-sensitive component, loop and nested-callable sites are the hardest structural contexts, and model rankings are stable only at the top tier. These patterns persist at a coarse level on revision-history data, while transformed inputs do not cause a broad same-direction performance collapse. Overall, MultiLogBench shows that robust claims about automated logging require multilingual evaluation and maintenance-oriented validation.

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

Related Papers