Tsinghua AIApr 21, 2026arXiv:2604.19071

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Andrew Feng, Cunxiang Wang, Yu-Wei Luo, Lin Fan, Yilin Zhou, Zikang Wang, Xiaotao Gu

AI Summary

The paper introduces Tree-of-Writing (ToW), a tree-structured approach to mitigate inconsistencies in LLM-as-a-judge methods by explicitly modeling aggregation weights of sub-features in text evaluation. To facilitate comprehensive evaluation, they also present HowToBench, a large-scale Chinese writing benchmark with 1302 instructions across 12 genres and 3 task categories. Experiments demonstrate that ToW achieves a 0.93 Pearson correlation with human judgments and exhibits robustness to textual disturbances, unlike overlap-based metrics and standard LLM-as-a-judge methods.

Key Contribution

LLM-as-a-judge can be made far more reliable by explicitly modeling the aggregation weights of sub-features in a tree structure, achieving near-human agreement on complex writing tasks.

Abstract

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Related Papers