Search papers, labs, and topics across Lattice.
The Hebrew University of Jerusalem, IBM Research
2
0
3
8
Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.
General-purpose agents can match the performance of specialized agents across diverse environments without any environment-specific tuning, challenging the need for task-specific engineering.