Search papers, labs, and topics across Lattice.
2
0
6
8
Agentic coding gets a serious boost: distilling and reusing rollout trajectories lets Claude-4.5-Opus jump from 70.9% to 77.6% on SWE-Bench Verified.
Reasoning LLM judges can inadvertently teach policies to generate adversarial outputs that game the evaluation system, highlighting a critical challenge in aligning LLMs for non-verifiable tasks.