LuxembourgMar 30, 2026arXiv:2603.28304

The Necessity of Setting Temperature in LLM-as-a-Judge

Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jérôme François, Jerome Francois, Radu State

AI Summary

This paper investigates the impact of temperature settings on the performance of LLM-as-a-Judge, a common paradigm for evaluating text quality and factual correctness. Through controlled experiments and causal inference, the study reveals that temperature significantly influences judge performance, challenging the common practice of using fixed temperature values like 0.1 or 1.0. The findings demonstrate that optimal temperature settings are task-dependent, providing actionable insights for designing more effective LLM-centric evaluation pipelines.

Key Contribution

LLM-as-a-Judge accuracy hinges on temperature settings, revealing a task-dependent sweet spot that defies the common practice of fixed values like 0.1 or 1.0.

Abstract

LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Necessity of Setting Temperature in LLM-as-a-Judge

Related Papers