Search papers, labs, and topics across Lattice.
3
0
5
Despite the advancements in multimodal agents, even the best models struggle with interactive spatial reasoning, achieving only a 17.4% success rate in complex real-world tasks.
Code agents that ace software engineering benchmarks often fail when faced with slight repository perturbations, suggesting they lack true repository context reasoning.
LLM agents in high-stakes domains can be verified more reliably by accumulating evidence grounded in expert guidelines, achieving a 12% AUROC improvement and 50% Brier score reduction over existing methods.