Search papers, labs, and topics across Lattice.
2
0
4
3
A unified assessment framework reveals hidden insights about agent performance, transforming how we evaluate AI systems.
Even frontier models with high reasoning budgets fail to effectively navigate densely interlinked knowledge bases and complex policies in realistic fintech customer support scenarios, achieving only ~25.5% pass rate.