Search papers, labs, and topics across Lattice.
3
0
6
2
Agent-as-a-Judge can outperform LLM-as-a-Judge in complex environments, but still struggles to reliably verify agent behavior, revealing a critical gap in current LLM-based agent evaluation.
LLM agents can internalize skills via in-context RL, achieving zero-shot autonomous behavior without the token overhead and retrieval noise of traditional methods.
Forget hand-tuning rollout budgets: $V_{0.5}$ dynamically allocates compute to sparse RL rollouts based on a real-time statistical test of a generalist value model's prior, slashing variance and boosting performance.