Search papers, labs, and topics across Lattice.
2
0
5
Forget hand-crafted benchmarks: CUA-Gym's auto-generated training data lets computer-use agents crush existing open-source models on real-world tasks.
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.