Corresponding authorsSKKUApr 12, 2026arXiv:2604.10520

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

Suyoung Bae, CheolWon Na, Jaehoon Lee, Yumin Lee, YunSeok Choi

AI Summary

The paper introduces ReFEree, a reference-free method for evaluating factual consistency in real-world code summaries by defining code-specific inconsistency criteria and leveraging dependency information. ReFEree operates at the segment level, enabling fine-grained evaluation of multi-sentence functionalities and dependency contexts, which are often missed by existing methods. Experiments on a new human-annotated benchmark demonstrate that ReFEree achieves a 15-18% improvement over the previous state-of-the-art in correlation with human judgment.

Key Contribution

Existing factual consistency metrics fall short on real-world code, but ReFEree closes the gap with a reference-free, fine-grained approach that better aligns with human judgment.

Abstract

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

Related Papers