Search papers, labs, and topics across Lattice.
2
11
3
1
Forget hand-crafted reward functions: CM2 uses checklists to train tool-using agents, outperforming SFT baselines by up to 12 points on key benchmarks.
Even the best LLMs fail more than 40% of the time when orchestrating multiple tools in realistic scenarios, revealing critical gaps in real-world agent capabilities.