Search papers, labs, and topics across Lattice.
1
0
2
4
LLM judges, widely used in AI benchmarks, can be surprisingly unreliable, with simple text formatting changes or paraphrasing leading to inconsistent judgments.