Search papers, labs, and topics across Lattice.
1
0
2
LLMs' apparent superhuman performance on benchmarks may be a mirage: contamination inflates scores by up to 20% in some domains, revealing a critical flaw in current evaluation practices.