Search papers, labs, and topics across Lattice.
1
0
2
Subtle wording changes in benchmark rubrics can swing model performance by over 13%, revealing a hidden subjectivity in "objective" gold labels.