Search papers, labs, and topics across Lattice.
28 papers from CMU Machine Learning on Eval Frameworks & Benchmarks
Existing robotic methods falter in tackling fundamental physical reasoning challenges, as evidenced by KinDER's rigorous benchmark evaluation.
Today's best web agents are shockingly inefficient, achieving only 1.15% trajectory efficiency on realistic long-horizon tasks, revealing a critical need to move beyond simple success rates.
Continual learning for LLM agents hits a wall: scaling models doesn't reliably improve skill generation, and self-feedback can lead to recursive drift.
Current user modeling benchmarks are child's play compared to the real-world challenges exposed by HORIZON, a massive new dataset spanning 54M users and diverse domains.
Frontier LLMs are surprisingly vulnerable to a wide range of task-specific exploits, from simple output spoofing to rootkit-style binary hijacking, even in seemingly well-defined environments.
Forget carrots and sticks: contracts and mediation are the surprisingly effective keys to unlocking cooperation between LLMs, even when individual incentives push toward defection.
Agentic coding gets a serious boost: distilling and reusing rollout trajectories lets Claude-4.5-Opus jump from 70.9% to 77.6% on SWE-Bench Verified.
Stop evaluating AI systems in isolation: marketplace dynamics like user switching and early-adoption advantages critically shape real-world success.
LLMs can mimic human writing, but not as well as you think: genre matters more than the source (human vs. LLM), and model choice trumps decoding strategy when it comes to style.
Data augmentation with LLMs can tank your NER performance even when it boosts POS tagging, proving task structure matters more than synthetic data quality.
DPO might not be the only game in town: a decision-directed approach to reward modeling can outperform it in pairwise preference optimization.
Interpretability methods often fail to improve over black-box prompting when models are uncooperative, suggesting current techniques may be more about elicitation than revealing internal mechanisms.
SAM models exhibit surprisingly divergent behaviors under occlusion, with some prioritizing visible tissue and others confidently hallucinating hidden anatomy.
Stop reimplementing multimodal models: TorchUMM offers a unified codebase for evaluation, analysis, and post-training, streamlining research across diverse architectures and tasks.
Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.
LLMs struggle to synthesize scientific conclusions from structured biomedical evidence, and current metrics fail to capture nuanced differences in their reasoning abilities.
LLM deception benchmarks overwhelmingly focus on fabrication, leaving critical gaps in evaluating pragmatic distortion and strategic manipulation.
Just 10 minutes of AI assistance can measurably degrade your ability to solve problems on your own.
LLM-powered forums may generate norm-aware language, but they fail to foster the crucial back-and-forth needed for communities to teach, enforce, and revise those norms.
LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.
Finally, a standardized benchmark for survival analysis HTE estimation lets you rigorously compare methods across synthetic, semi-synthetic, and real-world datasets.
Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.
Today's frontier LLMs can't autonomously patch critical zero-day vulnerabilities, revealing a significant gap in their cyberdefense capabilities.
Want to boost student performance in the age of GenAI? This RCT proves that scalable prompting interventions, grounded in the ICAP framework, can significantly improve student prompting skills and, ultimately, exam scores.
MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.
VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.
Multimodal agents still struggle with game development, solving only ~50% of tasks in a new benchmark, GameDevBench, highlighting the need for better multimodal reasoning in complex software environments.
LLMs' impressive general knowledge evaporates when faced with African economic data, as even advanced RAG pipelines struggle to answer questions based on World Bank reports, revealing a stark domain-specific knowledge gap.