Search papers, labs, and topics across Lattice.
3
0
3
Even the best search agents struggle to exceed 35% accuracy on a benchmark designed to push the limits of long-horizon reasoning.
Current LLM agents still struggle to infer and leverage user preferences from fragmented, real-world interactions, revealing a substantial gap between their capabilities and the demands of personalized decision-making.
Agent-as-a-Judge can outperform LLM-as-a-Judge in complex environments, but still struggles to reliably verify agent behavior, revealing a critical gap in current LLM-based agent evaluation.