Search papers, labs, and topics across Lattice.
100 papers published across 6 labs.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Hallucination detection can be reframed as a dynamical systems problem, enabling a surprisingly effective and efficient black-box approach that avoids expensive sampling or external knowledge retrieval.
LLMs harbor easily discoverable "natural backdoors"—token sequences that trigger harmful outputs without any semantic instruction, revealing a concerning vulnerability beyond traditional prompt-based jailbreaks.
Regularizing model sensitivity along the expected covariate drift directions, rather than isotropically, significantly improves the robustness of frozen models deployed in non-stationary environments.
LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.
Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.
Hallucination detection can be reframed as a dynamical systems problem, enabling a surprisingly effective and efficient black-box approach that avoids expensive sampling or external knowledge retrieval.
LLMs harbor easily discoverable "natural backdoors"—token sequences that trigger harmful outputs without any semantic instruction, revealing a concerning vulnerability beyond traditional prompt-based jailbreaks.
Regularizing model sensitivity along the expected covariate drift directions, rather than isotropically, significantly improves the robustness of frozen models deployed in non-stationary environments.
LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.
Unstable BO leaderboard rankings? They're likely due to ignoring the budget ratio (B/|A|) and prior rank correlation, which this paper elegantly captures with the Portable Regime Score (PRS) to predict performance reversals.
Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.
Attention-based models for programming knowledge tracing might not be as effective as previously thought; careful experimental design reveals that their gains over simpler models are often overstated.
Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.
Hallucination detection can be nearly as effective with a single forward pass as with expensive multi-sample methods.
Interventions on LLMs, like knowledge editing or unlearning, can have surprising side effects that this automated pipeline can now surface and validate.
Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.
Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.
Frontier LLMs are leaving 70% of relevant pharmaceutical assets undiscovered, a gap that can be largely closed by swapping generic web search for a curated index.
Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.
AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.
Automating rubric-based feedback on presentation slides is now feasible and perceived as useful, thanks to LLMs and learning analytics dashboards.
LLM-guided code evolution, when combined with runtime feedback and MCTS, can reliably achieve 15x speedups on real-world Java code, surpassing naive LLM-based optimization.
LLM uncertainty can be efficiently estimated *without* sampling by measuring the stability of output distributions under semantically equivalent input perturbations.
Agent-repair leaderboards are more fragile than we thought: methods that peek at the evaluator's signals to guide internal repair choices can cause drastic reordering when the evaluator changes.
Developer-style keyword searches completely nullify the advantage of even the best code embedding models, highlighting a critical gap in current code search techniques.
Seemingly harmless fine-tuning data can stealthily nudge LLMs toward unsafe behavior by subtly shifting model parameters in "danger-aligned" directions.
LLMs can leapfrog current network troubleshooting benchmarks by explicitly encoding structured diagnostic policies, rather than relying on free-form deliberation.
A judge-orchestrated ensemble of diverse LLMs trounces single models in multi-turn response generation, proving that strategic model selection beats brute force scaling.
LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.
Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.
LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.
LLMs ace MRI multiple-choice tests, but can't actually recall basic facts about GE scanners, revealing a dangerous gap between perceived and actual competence.
Overconfident predictions plague mental health prediction models, but this new framework leverages evidential learning to provide more trustworthy uncertainty estimates and human-understandable reasoning signals.
LLMs differ most not in personality, but in how they represent themselves as having (or not having) rich internal experience.
Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.
Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.
TabEmbed leapfrogs existing text embedding models to achieve SOTA performance on tabular data by reformulating tasks as semantic matching problems and using contrastive learning.
Small LLMs paired with symbolic solvers can outperform larger zero-shot LLMs on formal reasoning tasks, but still struggle with multilingual inputs.
LLM benchmarks are missing a critical ingredient: social science data, which could significantly improve model generalization and robustness across a wide range of disciplines.
Ditch the black box: This unsupervised semantic projection method rivals supervised models in psychological assessment, offering interpretability and generalizability that supervised methods lack.
LLM surrogates in low-data optimization are far more sensitive to prompt engineering and query protocols than previously appreciated, fundamentally altering their beliefs and downstream performance.
LLMs can be surprisingly brittle: simply rephrasing a prompt, even while preserving its meaning, can cause them to completely abandon the requested output format.
Even state-of-the-art multilingual models struggle to tag parts-of-speech in Tajik when trained on isolated words, highlighting the critical role of syntactic context.
Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.
Stop reinventing the wheel (or worse, comparing apples to oranges) in XAI evaluation: a standardized "XAI Evaluation Card" could finally bring clarity and rigor to a fragmented field.
Roblox's chat moderation misses a disturbing amount of grooming, bullying, and other harmful content, despite its reliance on automated systems.
Despite achieving high accuracy on individual datasets, machine learning models for intrusion detection exhibit a significant generalization gap, with performance dropping drastically when tested on unseen network environments.
LLM agents that autonomously explore code repositories can match the classification accuracy of simpler LLMs with hand-crafted context, hinting at a future where agents surpass human-labeled data in complex software understanding tasks.
Developers overwhelmingly trust and directly apply LLM-generated code refactoring suggestions, but when they don't, the changes are surprisingly drastic and predictable.
GenAI coding assistants boost developer productivity, but the gains shrink outside the lab and don't translate to better learning.
Turns out, chunking code by function is the *worst* way to do retrieval-augmented code completion.
"Vibe coding" platforms promise effortless app creation, but SWE-WebDevBench reveals they often deliver visually appealing frontends with broken backends, struggle with security, and require significant human effort to reach production readiness.
Current alignment benchmarks are misleading: even if a model aces them, its real-world alignment could be totally different depending on the specific deployment context.
Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.
Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.
Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.
Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.
Existing restoration methods crumble when faced with the extreme geometric distortions caused by strong refractive warping, highlighting the need for robust new approaches benchmarked on this challenging dataset.
Current video generation benchmarks overlook crucial aspects of physical plausibility and temporal coherence, highlighting the need for holistic evaluation metrics like PhyScore.
LLMs struggle to navigate the complex, multi-turn justification and response dynamics of real-world patent examination, revealing critical gaps in legal reasoning and technical novelty judgment.
Current world models struggle with basic physical interaction tasks like distance perception and trajectory following, highlighting a critical gap in their ability to simulate realistic environments.
Today's AI agents are surprisingly inept at navigating the messy reality of digital workspaces, failing to reach even 70% accuracy on tasks that require understanding file dependencies.
Forget resource-intensive pipelines: a purely academic team achieves SOTA search agent performance with just 10.6k SFT data points, outperforming models trained with CPT+SFT+RL.
LLMs beat doctors at everyday symptom diagnosis, but only when they proactively interview patients instead of passively answering questions.
LLMs struggle with causal reasoning when noise is introduced, but explicitly modeling causal graphs can dramatically improve performance and generalization.
LLMs are surprisingly good at pinpointing what's *wrong* with student writing, even outperforming human graders in identifying relative weaknesses.
Existing hallucination detection methods are missing subtle, word-level medical errors, but a new data-centric pipeline and detector closes the gap by 15%.
Despite impressive multilingual capabilities, today's LLMs still can't reliably translate between English and Ghanaian languages at scale.
LLMs exhibit a surprising "False Illegitimation bias," systematically misclassifying legitimate battles as violence against civilians, highlighting a critical flaw for conflict monitoring applications.
LLMs may sound convincing when writing academic content, but they can still confidently fabricate facts and references at surprisingly high rates.
Forget the heavy transformers: surprisingly effective LLM-generated code detection can be achieved with lightweight stylometric features and decision trees, offering near-instant inference.
LLM benchmarks are increasingly measuring the capabilities of yesterday's models, not today's frontier, creating a widening gap that misrepresents the state of AI.
Scaling clinical LLMs doesn't guarantee safety: high-risk errors persist even with advanced RAG and max-context prompting, highlighting the critical role of evidence quality and deployment strategy.
LLMs can exhibit gender bias in emergency triage even when well-calibrated, and interventions effective for one model may backfire on another.
LLMs' own self-judgments, when logically linked to their response features, can significantly improve hallucination detection.
Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.
Naive application of transformer-based AI-text detectors can be brittle under distribution shift, but attention-based fusion of readability and vocabulary features can significantly improve robustness.
Language models can play the counterexample game, but their philosophical reasoning hits diminishing returns fast, and they're far more lenient judges than humans.
Clinicians trust AI recommendations nearly 3x more when those recommendations are broken down into verifiable facts linked to source guidelines, blowing traditional explainability out of the water.
Even top LLM judges struggle to reliably detect violations of specific constraints in complex instructions, especially when violations are partial or absent, revealing critical blind spots in current evaluation methods.
Separating LLMs into a deliberate validation layer, rather than making them an architectural default, can improve trustworthiness and efficiency in agentic AI systems.
Neural retrievers, despite their success on standard benchmarks, fail spectacularly when forced to reason about set-theoretic constraints, revealing a reliance on spurious correlations rather than true compositional understanding.
LLMs in Korean judicial workflows are surprisingly prone to hallucination, bias, and inconsistency, especially when retrieving precedents and summarizing jurisprudence.
Forget scaling laws: QLoRA-tuned Mistral 7B crushes other architectures for low-resource Tajik text generation, highlighting the importance of architecture choice in PEFT.
Pinpointing exactly where humans end and LLMs begin in co-authored text is now possible, thanks to a clever adaptation of time-series change point detection.
LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.
Innocuous-looking coding tasks, when chained together, trick even the best coding agents into creating exploitable code with alarming frequency.
LLM safety filters, which rely on semantic pattern matching, can be bypassed at scale by encoding harmful prompts as coherent mathematical problems, revealing a fundamental vulnerability.
Existing defenses crumble when LLM agents face prompt injections that adapt to dynamic context, but ARGUS offers a robust solution by tracking the provenance of agent decisions.
LLM agent skills are needlessly brittle and insecure: SkCC compiles them into a portable, hardened format that boosts performance by 50% and proactively blocks attacks.
Java developers drowning in unfixed bugs, rejoice: automated reproduction test generation is now a viable option, thanks to a new benchmark and adapted generator.
LLMs spontaneously exhibit collaborative behaviors like perspective-taking and theory of mind in embodied settings, suggesting a surprising capacity for modeling human collaborators without explicit training.
Forget running the full gauntlet: just 4-5 workloads from SPEC CPU2026 can accurately mirror the entire suite, slashing evaluation costs without sacrificing fidelity.
Public antiviral drug discovery datasets are riddled with errors that can be fixed with careful polyprotein splitting, unlocking significant performance gains in binding affinity prediction.
Modern speech models struggle to generalize to noisy, domain-specific African speech, highlighting a critical gap for localized voice AI.
RAG systems can now reduce unsafe answers by 37% using SURE-RAG, a transparent evidence verification method that outperforms even GPT-4o in controlled sufficiency tasks.
LLMs can generate formally correct postconditions for code, but they often miss crucial details, especially in complex, real-world scenarios.
LLMs can't rebuild software from scratch, even for widely used programs like FFmpeg and SQLite, revealing a critical gap in their ability to make high-level software architecture decisions.
Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.
Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.
HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.
Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.
Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.
Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.