Search papers, labs, and topics across Lattice.
4
0
6
Calibrated safety flags in medical summaries can reduce unflagged omissions by up to 5 times compared to existing methods, enhancing clinician confidence in LLM outputs.
Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.
Leaderboard rankings are more noise than signal: contributor metadata matters more than architecture, and scaling laws are unreliable.
Adaptive evaluation exposes a substantial vulnerability gap, revealing that existing defenses may underestimate the capabilities of distillation attacks.