Search papers, labs, and topics across Lattice.
7 papers from Meta AI (FAIR) on Eval Frameworks & Benchmarks
Hallucinating LLMs in enterprise workflows can be tamed by a new Hybrid Utility Minimum Bayes Risk (HUMBR) framework that synthesizes semantic and lexical signals to achieve consensus without ground truth.
Current egocentric video benchmarks miss the mark: EgoEverything uses human gaze to create questions that actually reflect how people behave, not just what they see.
Real-world coding benchmarks reveal that AI coding agents succeed more often when they iteratively validate their work with tests and static analysis, suggesting a path to better agents in unfamiliar codebases.
LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Autonomous coding agents derail 30% of the time, but a lightweight intervention system can recover 90% of those misbehaviors with a single nudge.