Meta AI (FAIR)

×Eval Frameworks & Benchmarks

7 papers from Meta AI (FAIR) on Eval Frameworks & Benchmarks

Apr 13, 2026

Meta AIApr 13, 2026

Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

Hallucinating LLMs in enterprise workflows can be tamed by a new Hybrid Utility Minimum Bayes Risk (HUMBR) framework that synthesizes semantic and lexical signals to achieve consensus without ground truth.

Chenhao Fang, J. Mola, Jordi Mola +14

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 9, 2026

Qiance and Ziqi contributed equally toApr 9, 2026·also Meta AI

EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment

Current egocentric video benchmarks miss the mark: EgoEverything uses human gaze to create questions that actually reflect how people behave, not just what they see.

Qiance Tang, Ziqi Wang, Jieyu Lin +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 2, 2026

Meta AIApr 2, 2026

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Real-world coding benchmarks reveal that AI coding agents succeed more often when they iteratively validate their work with tests and static analysis, suggesting a path to better agents in unfamiliar codebases.

Smriti Jha, Matteo Paltenghi, C. Maddila +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Mar 18, 2026

Mar 18, 2026·also Meta AI, UNC

Text-to-Stage: Spatial Layouts from Long-form Narratives

LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Mar 3, 2026

Amazon ScienceMar 3, 2026·also Meta AI, Stanford HAI

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.

Aman Chadha, Vinija Jain

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Mar 1, 2026

Amazon ScienceMar 1, 2026·also Meta AI, Stanford HAI

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.

Vinija Jain, Aman Chadha

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Feb 19, 2026

Meta AIFeb 19, 2026

Wink: Recovering from Misbehaviors in Coding Agents

Autonomous coding agents derail 30% of the time, but a lightweight intervention system can recover 90% of those misbehaviors with a single nudge.

Rahul Nanda, Chandra Maddila, Euna Mehnaz Khan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Search

Meta AI (FAIR)