Search papers, labs, and topics across Lattice.
11 papers from Meta AI (FAIR) on Eval Frameworks & Benchmarks
On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.
LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.
Forget scaling laws: a specialized 8B parameter translation model can outperform a 70B general-purpose LLM on 1,600 languages.
LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.
Even the best open-weight LLMs still fail on nearly two-thirds of questions requiring reasoning over scientific tables, highlighting a persistent "execution bottleneck" in translating strategy to action.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Multimodal models often exhibit lower confidence than their unimodal counterparts when they're about to fail, and this work leverages that insight to build a better failure detector.
Real-world social chat deployments reveal that iterative refinement using CharacterFlywheel can boost LLM engagement by nearly 20% and dramatically improve steerability.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
AI agents can now learn durable skills instead of constantly "reinventing the wheel," thanks to SkillNet's infrastructure for creating, evaluating, and connecting AI skills at scale.
Autonomous coding agents derail 30% of the time, but a lightweight intervention system can recover 90% of those misbehaviors with a single nudge.