Search papers, labs, and topics across Lattice.
6
0
7
9
LLMs can be made to think more like us: prompting them with human values yields behavior surprisingly aligned with real-world psychological patterns.
Agents that excel on traditional benchmarks may crumble under the pressure of newly synthesized tasks, revealing the limitations of current evaluation methods.
Current LLM agent evaluation tools are stuck in the Stone Age, but Agentic CLEAR automates dynamic, multi-level analysis, finally offering insights that adapt to the rapidly evolving agent landscape.
Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.
Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.
General-purpose agents can match the performance of specialized agents across diverse environments without any environment-specific tuning, challenging the need for task-specific engineering.