Search papers, labs, and topics across Lattice.
15 papers from Berkeley AI Research (BAIR) on Eval Frameworks & Benchmarks
LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.
LLM agents can now learn on the fly and adapt to evolving user needs without disruptive downtime, thanks to a novel meta-learning framework that synthesizes new skills from failure trajectories and optimizes the base policy during inactive periods.
Current ML benchmarks may be ungameable in theory, as they can lack a stable equilibrium where developers are incentivized to improve true model quality rather than just leaderboard scores.
Existing QA benchmarks are too easy for LLMs, so iAgentBench offers a more realistic challenge by requiring agents to synthesize information from multiple sources on high-traffic topics.
Models are substantially better at pairwise self-verification than independent scoring, unlocking a more efficient and accurate approach to test-time scaling for complex reasoning.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Human-written solutions can actually *hurt* model performance on math problems, highlighting a critical gap between strategy usage and executability that Selective Strategy Retrieval (SSR) effectively bridges.
Now you can audit black-box LLM APIs for cheating (model substitution, overbilling) with <1% overhead, using verifiable computation.
A global consensus on AI safety risks and capabilities has emerged from a panel of 100+ independent experts, representing a landmark effort in international collaboration.
Forget temperature scaling: JUCAL calibrates aleatoric and epistemic uncertainty in classifier ensembles, achieving SOTA results with significantly smaller ensembles and lower inference costs.
Achieve 13-15% more efficient LLM watermark detection by using e-values for anytime-valid inference, enabling early stopping without sacrificing statistical guarantees.
Autonomous driving benchmarks get a reality check: ScenicRules exposes failures by combining prioritized, multi-objective rules with formally modeled, stochastic scenarios.
LLMs can't reliably generate the very skills that boost their performance, and smaller models equipped with expert-crafted skills can rival larger, skill-less models.
Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.
LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.