Search papers, labs, and topics across Lattice.
Alibaba Group
2
0
5
2
Despite achieving comparable overall scores, top-performing medical LLMs exhibit surprising differences in reasoning, evidence use, and longitudinal follow-up when evaluated on a new Chinese medical benchmark, revealing critical gaps in clinically actionable treatment planning.
LLMs still struggle with PhD-level scanning probe microscopy tasks, but SPM-Bench offers a new automated pipeline to generate challenging scientific benchmarks and quantify model "personalities" like "Conservative" or "Gambler."