Search papers, labs, and topics across Lattice.
100 papers published across 6 labs.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.
Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.
Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.
Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.
HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.
Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.
Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.
Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.
Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.
MLLMs hallucinate less when you nudge them to pay more attention to non-text inputs during inference, without any training.
Expressive piano performance rendering is improving, but RenCon 2025 reveals we're still far from replicating human musicality.
Current audio-visual models nail unimodal quality but still struggle to make music and dance move together rhythmically, highlighting a key gap TMD-Bench is designed to address.
LLMs can't reliably count beyond a small number of steps, revealing a surprising brittleness in their ability to execute seemingly simple procedures despite fluent performance on complex tasks.
Current MLLM-driven UAV agents still struggle with spatial memory and aerial adaptation when tasked with autonomously exploring and reasoning about victim locations in realistic search and rescue scenarios.
LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.
Hyperbolic embeddings are powerful, but a fragmented ecosystem makes them hard to use—this framework finally puts them all in one place.
Training on D3-Gym, a new dataset of real-world scientific tasks with verifiable environments, closes the gap between open-source and proprietary models on ScienceAgentBench by 7.8 points.
See how LLMs' stances on vaccines, disinformation, and gender equality shift when they "become" different people, thanks to a new dataset of 190,000 persona-driven debates.
Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.
Stop wasting compute on fine-tuning datasets with hidden capability gaps: GoalCover lets you diagnose and fix them *before* training.
Emergent misalignment can lead to "inverted-persona" LLMs that confidently identify as aligned AI systems while consistently generating harmful outputs.
Current multimodal LLMs struggle to understand scientific spectra, but a new benchmark and data processing technique could change that.
Even with emotion-aware prompting, today's best small language models still struggle to preserve subtle emotional nuances when translating between languages.
Even the most advanced language models still lose money and demonstrate unsophisticated strategies when tasked with maximizing long-term bankroll growth in a realistic sports betting simulation, highlighting a significant gap in their sequential decision-making capabilities.
Individually harmless read/write permissions in multi-server agent workflows can structurally leak credentials across trust boundaries, even without malicious model behavior, at rates as high as 41.3%.
Google's AI Overviews favor Google-owned content and penalize sites blocking its AI crawler, raising serious questions about fairness and bias in the emerging generative search landscape.
LLMs struggle to complete RTL code, and their performance hinges on the grammatical structure of the missing code and the prompting strategy used.
Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.
LLMs can accurately recall constraints while simultaneously violating them, with "knows-but-violates" rates ranging from 8% to 99%, revealing a fundamental flaw in multi-turn ideation.
LLMs reveal that research data is being reused far more often than previously thought, suggesting open science's impact is bigger than we realized.
Forget training LLMs to understand privacy policies – a specialized, expert-annotated dataset and hybrid framework can do it better, achieving superior readability and reliability.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
ChatGPT for Clinicians, not human doctors, currently achieves the highest scores on a new benchmark of real-world clinical LLM tasks.
Even GPT-5.1 struggles to distinguish AI-generated academic images from real ones, achieving only 48.8% accuracy, revealing a significant gap between generative and forensic AI capabilities.
Your AI chatbot conversations aren't as private as you think: most leak conversation content and user identity to third-party trackers.
LLM reading assistants don't need to hallucinate to be harmful; they can subtly steal the user's interpretive labor, even when designed with "epistemic guardrails."
Instruction tuning on a new dataset, SecGoal, allows smaller 7B/9B parameter models to outperform much larger LLMs in extracting and formalizing security goals from protocol documents.
LLMs still can't reliably reverse engineer stripped binaries, and REBench offers a standardized, fair-by-construction benchmark to finally measure progress.
Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.
Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.
The standard "human-likeness" test for user simulators is essentially useless for predicting whether they produce valid system rankings.
Stop retrieving passages in your RAG system: NuggetIndex shows that retrieving and filtering atomic "nuggets" of information yields substantial gains in recall, temporal correctness, and reduced conflicts.
Current image forensics fall flat when faced with the subtle manipulations now possible in 3D Gaussian Splatting scenes, highlighting a critical gap in content authenticity assessment.
Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.
Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.
Forget Shakespeare, LLMs can now sling verses in Arabic dialects, thanks to a new dataset for instruction-guided poetry generation.
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Today's best GUI agents choke on real-world, multi-application workflows, achieving less than 21% success rate, revealing a critical gap in their ability to coordinate across applications and perform conditional reasoning.
LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.
Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.
LMs can now selectively abstain from answering with provable guarantees, thanks to a new method that uses representation geometry to better gauge when they're out of their depth.
LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.
VLMs playing the Prisoner's Dilemma can be manipulated into selfish behavior simply by showing them images of aggression or reward matrices with specific color schemes.
Real-world Text-to-SQL systems can now be continuously evaluated and improved in production, even without access to database schemas or ground-truth queries.
Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.
LLMs are rapidly transforming peer review, but critical gaps remain in ensuring quality, fairness, and ethical considerations across the entire workflow.
General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.
Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.
Claims of human-like cognition in models like CENTAUR crumble under LAPITHS, a framework that reveals these models' performance can be replicated by systems lacking cognitive plausibility.
Retrieval improvements don't always boost reasoning in RAG systems, but NeocorRAG's evidence chains can fix that, achieving SOTA with 20% fewer tokens.
Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.
Silent LLM updates can break your application in unexpected ways, but this governance framework offers a deployer-side solution to catch regressions before they hit production.
LLMs can now reliably generate IC verification testbenches, not by writing HDL directly, but by orchestrating a novel hybrid approach that combines LLM-driven planning with template-based HDL generation.
Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.
Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.
LLMs beat word counts for predicting mental health from therapeutic writing, proving that *how* you tell a story matters more than *what* words you use.
Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.
General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.
LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
Subtle wording changes in benchmark rubrics can swing model performance by over 13%, revealing a hidden subjectivity in "objective" gold labels.
LLM upgrades are a chaotic mix of progress and decay: despite overall gains, up to 47% of questions get *worse* after an update, and single-shot evals miss almost half of these critical regressions.
LLMs reliably capture emotions with explicit lexical markers, but systematically fail on pragmatically complex emotions requiring contextual inference, revealing a critical limitation in their ability to understand nuanced human emotion.
Expect pretrial risk assessment tools to be wrong more often than right when flagging someone as "high risk" for rare violent re-offense, regardless of recalibration efforts.
LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.
LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.
Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.
Seemingly strong segmentation models can fail at clinically critical tumor-vessel interfaces, highlighting the need for uncertainty-aware AI in pancreatic cancer staging.
Current MLLMs still struggle to connect the dots between images and text when they're interleaved, highlighting a critical gap in real-world multimodal understanding.
Even state-of-the-art models like Gemini and Claude can completely miss critical user information when it's buried in semantically unrelated past interactions, tanking personalization performance.
LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.
Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.
Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.
LLMs fail over half the time when asked to perform harmful actions in a simulated robotic health attendant setting, even when fine-tuned on medical data.
LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.
Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.
Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.
Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.
Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.
Document AI pipelines don't work the way you think: quality bottlenecks aren't where you expect, and components don't cascade quality.
LLMs exhibit surprising cross-lingual inconsistencies beyond simple translation errors, revealing divergences in cultural calibration, pragmatic disambiguation, and even institutional referral behavior.
Despite recent advances, sign language translation models still struggle to leverage the full range of linguistic cues, especially non-manual signals like facial expressions.
LLMs in multi-agent systems often abandon their assigned roles due to "Epistemic Role Override," undermining the intended diversity of perspectives in political statement analysis.
Today's best language models can barely make sense of your messy group chats and fragmented digital life, achieving only 19% accuracy on a new benchmark of real-world reasoning.
Complex, multi-step instructions can cause LLMs to completely ignore question content and instead rely on positional shortcuts when asked to underperform, revealing a critical vulnerability in adversarial evaluation.
LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.