Search papers, labs, and topics across Lattice.
2
0
3
0
Despite achieving comparable overall scores, top-performing medical LLMs exhibit surprising differences in reasoning, evidence use, and longitudinal follow-up when evaluated on a new Chinese medical benchmark, revealing critical gaps in clinically actionable treatment planning.
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.