Search papers, labs, and topics across Lattice.
18 papers from CMU Machine Learning on Eval Frameworks & Benchmarks
Even GPT-5 and Gemini 2.5 Pro still fail to efficiently couple reasoning with tool use, requiring up to 2.7x more tool calls than theoretically optimal in a new diagnostic environment.
On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.
Strategic recovery from failures is key to deploying robots for complex assembly tasks in the real world.
Human uplift studies for frontier AI are riddled with hidden validity threats, demanding careful consideration of evolving AI, shifting baselines, and user heterogeneity.
LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.
Finally, a standardized benchmark for survival analysis HTE estimation lets you rigorously compare methods across synthetic, semi-synthetic, and real-world datasets.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
AI tools are surprisingly bad at classifying the cognitive demand of math problems, with accuracy barely above chance and a systematic bias towards average difficulty, raising concerns about their utility in supporting teachers.
Today's frontier LLMs can't autonomously patch critical zero-day vulnerabilities, revealing a significant gap in their cyberdefense capabilities.
General-purpose LLM agents stumble badly when faced with the messy reality of diverse, multi-domain tasks, and simply scaling interactions or parallel sampling doesn't fix it.
Forget training on narrow GitHub issues – Hybrid-Gym unlocks surprisingly broad coding skills by teaching agents to explore codebases and design architectures in synthetic environments.
Want to boost student performance in the age of GenAI? This RCT proves that scalable prompting interventions, grounded in the ICAP framework, can significantly improve student prompting skills and, ultimately, exam scores.
MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.
VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.
Regret matching, the unsung hero of two-player zero-sum games, now dominates first-order optimizers in broader imperfect-recall decision problems, opening new avenues for AI safety and privacy.
Multimodal agents still struggle with game development, solving only ~50% of tasks in a new benchmark, GameDevBench, highlighting the need for better multimodal reasoning in complex software environments.
LLMs' impressive general knowledge evaporates when faced with African economic data, as even advanced RAG pipelines struggle to answer questions based on World Bank reports, revealing a stark domain-specific knowledge gap.
LLMs struggle to track state across multiple tool-use steps, but a surprisingly simple fix—restating prior variable values—yields substantial performance gains.