Search papers, labs, and topics across Lattice.
2
0
4
0
Subtle wording changes in benchmark rubrics can swing model performance by over 13%, revealing a hidden subjectivity in "objective" gold labels.
LLM-judged investment rationales reward verbosity and confidence over actual financial insight, penalizing concise, correct reasoning by nearly 3 points.