DAMOCASCUHKFudanPengcheng LaboratoryApr 13, 2026arXiv:2604.11246

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Xiang Ao, Wao Lam, Yue Yu

AI Summary

This paper introduces WIMPE, a framework for evaluating long-form generative task outputs by decomposing reference answers into weighted, context-bound scoring points to better reflect the nuances of human evaluation. WIMPE uses Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP) to measure alignment and contradiction between model responses and references. Experiments across 10 generative tasks show WIMPE achieves higher correlations with human annotations compared to existing methods.

Key Contribution

Human-like evaluation of long-form generative AI is now possible, thanks to a new framework that breaks down reference answers into weighted, context-aware scoring points.

Abstract

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Related Papers