Search papers, labs, and topics across Lattice.
The paper introduces ReFEree, a reference-free method for evaluating factual consistency in real-world code summaries by defining code-specific inconsistency criteria and leveraging dependency information. ReFEree operates at the segment level, enabling fine-grained evaluation of multi-sentence functionalities and dependency contexts, which are often missed by existing methods. Experiments on a new human-annotated benchmark demonstrate that ReFEree achieves a 15-18% improvement over the previous state-of-the-art in correlation with human judgment.
Existing factual consistency metrics fall short on real-world code, but ReFEree closes the gap with a reference-free, fine-grained approach that better aligns with human judgment.
As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.