Search papers, labs, and topics across Lattice.
University of Science and Technology Beijing
1
0
2
Current LLM judges show a troubling reliability gap in long-form evaluations, raising questions about their effectiveness in real-world applications.