Search papers, labs, and topics across Lattice.
Alibaba Group
2
0
5
0
LLMs still struggle with PhD-level scanning probe microscopy tasks, but SPM-Bench offers a new automated pipeline to generate challenging scientific benchmarks and quantify model "personalities" like "Conservative" or "Gambler."
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.