Search papers, labs, and topics across Lattice.
Evaluation methodology for AI systems, benchmark design, capability measurement, and safety evaluations.
#5 of 24
3
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Hallucination detection can be reframed as a dynamical systems problem, enabling a surprisingly effective and efficient black-box approach that avoids expensive sampling or external knowledge retrieval.
LLMs harbor easily discoverable "natural backdoors"—token sequences that trigger harmful outputs without any semantic instruction, revealing a concerning vulnerability beyond traditional prompt-based jailbreaks.
Regularizing model sensitivity along the expected covariate drift directions, rather than isotropically, significantly improves the robustness of frozen models deployed in non-stationary environments.
LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.
Unstable BO leaderboard rankings? They're likely due to ignoring the budget ratio (B/|A|) and prior rank correlation, which this paper elegantly captures with the Portable Regime Score (PRS) to predict performance reversals.
Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.
Attention-based models for programming knowledge tracing might not be as effective as previously thought; careful experimental design reveals that their gains over simpler models are often overstated.
Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.
Hallucination detection can be nearly as effective with a single forward pass as with expensive multi-sample methods.
Interventions on LLMs, like knowledge editing or unlearning, can have surprising side effects that this automated pipeline can now surface and validate.
Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.
Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.
Frontier LLMs are leaving 70% of relevant pharmaceutical assets undiscovered, a gap that can be largely closed by swapping generic web search for a curated index.
Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.
AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.
Automating rubric-based feedback on presentation slides is now feasible and perceived as useful, thanks to LLMs and learning analytics dashboards.
LLM-guided code evolution, when combined with runtime feedback and MCTS, can reliably achieve 15x speedups on real-world Java code, surpassing naive LLM-based optimization.
LLM uncertainty can be efficiently estimated *without* sampling by measuring the stability of output distributions under semantically equivalent input perturbations.
Agent-repair leaderboards are more fragile than we thought: methods that peek at the evaluator's signals to guide internal repair choices can cause drastic reordering when the evaluator changes.
Developer-style keyword searches completely nullify the advantage of even the best code embedding models, highlighting a critical gap in current code search techniques.
Seemingly harmless fine-tuning data can stealthily nudge LLMs toward unsafe behavior by subtly shifting model parameters in "danger-aligned" directions.
LLMs can leapfrog current network troubleshooting benchmarks by explicitly encoding structured diagnostic policies, rather than relying on free-form deliberation.
A judge-orchestrated ensemble of diverse LLMs trounces single models in multi-turn response generation, proving that strategic model selection beats brute force scaling.
LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.
LLMs ace MRI multiple-choice tests, but can't actually recall basic facts about GE scanners, revealing a dangerous gap between perceived and actual competence.
Overconfident predictions plague mental health prediction models, but this new framework leverages evidential learning to provide more trustworthy uncertainty estimates and human-understandable reasoning signals.
LLMs differ most not in personality, but in how they represent themselves as having (or not having) rich internal experience.
Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.
Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.
TabEmbed leapfrogs existing text embedding models to achieve SOTA performance on tabular data by reformulating tasks as semantic matching problems and using contrastive learning.
Small LLMs paired with symbolic solvers can outperform larger zero-shot LLMs on formal reasoning tasks, but still struggle with multilingual inputs.
LLM benchmarks are missing a critical ingredient: social science data, which could significantly improve model generalization and robustness across a wide range of disciplines.
Ditch the black box: This unsupervised semantic projection method rivals supervised models in psychological assessment, offering interpretability and generalizability that supervised methods lack.
LLM surrogates in low-data optimization are far more sensitive to prompt engineering and query protocols than previously appreciated, fundamentally altering their beliefs and downstream performance.
LLMs can be surprisingly brittle: simply rephrasing a prompt, even while preserving its meaning, can cause them to completely abandon the requested output format.
Even state-of-the-art multilingual models struggle to tag parts-of-speech in Tajik when trained on isolated words, highlighting the critical role of syntactic context.
Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.
Stop reinventing the wheel (or worse, comparing apples to oranges) in XAI evaluation: a standardized "XAI Evaluation Card" could finally bring clarity and rigor to a fragmented field.
Roblox's chat moderation misses a disturbing amount of grooming, bullying, and other harmful content, despite its reliance on automated systems.
Despite achieving high accuracy on individual datasets, machine learning models for intrusion detection exhibit a significant generalization gap, with performance dropping drastically when tested on unseen network environments.
LLM agents that autonomously explore code repositories can match the classification accuracy of simpler LLMs with hand-crafted context, hinting at a future where agents surpass human-labeled data in complex software understanding tasks.
Developers overwhelmingly trust and directly apply LLM-generated code refactoring suggestions, but when they don't, the changes are surprisingly drastic and predictable.
GenAI coding assistants boost developer productivity, but the gains shrink outside the lab and don't translate to better learning.
Turns out, chunking code by function is the *worst* way to do retrieval-augmented code completion.
"Vibe coding" platforms promise effortless app creation, but SWE-WebDevBench reveals they often deliver visually appealing frontends with broken backends, struggle with security, and require significant human effort to reach production readiness.
Current alignment benchmarks are misleading: even if a model aces them, its real-world alignment could be totally different depending on the specific deployment context.
Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.