Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.
Hyperbolic embeddings are powerful, but a fragmented ecosystem makes them hard to use—this framework finally puts them all in one place.
Training on D3-Gym, a new dataset of real-world scientific tasks with verifiable environments, closes the gap between open-source and proprietary models on ScienceAgentBench by 7.8 points.
See how LLMs' stances on vaccines, disinformation, and gender equality shift when they "become" different people, thanks to a new dataset of 190,000 persona-driven debates.
Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.
Stop wasting compute on fine-tuning datasets with hidden capability gaps: GoalCover lets you diagnose and fix them *before* training.
Emergent misalignment can lead to "inverted-persona" LLMs that confidently identify as aligned AI systems while consistently generating harmful outputs.
Current multimodal LLMs struggle to understand scientific spectra, but a new benchmark and data processing technique could change that.
Even with emotion-aware prompting, today's best small language models still struggle to preserve subtle emotional nuances when translating between languages.
Even the most advanced language models still lose money and demonstrate unsophisticated strategies when tasked with maximizing long-term bankroll growth in a realistic sports betting simulation, highlighting a significant gap in their sequential decision-making capabilities.
Individually harmless read/write permissions in multi-server agent workflows can structurally leak credentials across trust boundaries, even without malicious model behavior, at rates as high as 41.3%.
Google's AI Overviews favor Google-owned content and penalize sites blocking its AI crawler, raising serious questions about fairness and bias in the emerging generative search landscape.
LLMs struggle to complete RTL code, and their performance hinges on the grammatical structure of the missing code and the prompting strategy used.
Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.
LLMs can accurately recall constraints while simultaneously violating them, with "knows-but-violates" rates ranging from 8% to 99%, revealing a fundamental flaw in multi-turn ideation.
LLMs reveal that research data is being reused far more often than previously thought, suggesting open science's impact is bigger than we realized.
Forget training LLMs to understand privacy policies – a specialized, expert-annotated dataset and hybrid framework can do it better, achieving superior readability and reliability.
WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.
ChatGPT for Clinicians, not human doctors, currently achieves the highest scores on a new benchmark of real-world clinical LLM tasks.
Even GPT-5.1 struggles to distinguish AI-generated academic images from real ones, achieving only 48.8% accuracy, revealing a significant gap between generative and forensic AI capabilities.
Your AI chatbot conversations aren't as private as you think: most leak conversation content and user identity to third-party trackers.
LLM reading assistants don't need to hallucinate to be harmful; they can subtly steal the user's interpretive labor, even when designed with "epistemic guardrails."
Instruction tuning on a new dataset, SecGoal, allows smaller 7B/9B parameter models to outperform much larger LLMs in extracting and formalizing security goals from protocol documents.
LLMs still can't reliably reverse engineer stripped binaries, and REBench offers a standardized, fair-by-construction benchmark to finally measure progress.
Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.
Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.
The standard "human-likeness" test for user simulators is essentially useless for predicting whether they produce valid system rankings.
Stop retrieving passages in your RAG system: NuggetIndex shows that retrieving and filtering atomic "nuggets" of information yields substantial gains in recall, temporal correctness, and reduced conflicts.
Current image forensics fall flat when faced with the subtle manipulations now possible in 3D Gaussian Splatting scenes, highlighting a critical gap in content authenticity assessment.
Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.
Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.
Forget Shakespeare, LLMs can now sling verses in Arabic dialects, thanks to a new dataset for instruction-guided poetry generation.
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Today's best GUI agents choke on real-world, multi-application workflows, achieving less than 21% success rate, revealing a critical gap in their ability to coordinate across applications and perform conditional reasoning.
LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.
Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.
LMs can now selectively abstain from answering with provable guarantees, thanks to a new method that uses representation geometry to better gauge when they're out of their depth.
LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.
VLMs playing the Prisoner's Dilemma can be manipulated into selfish behavior simply by showing them images of aggression or reward matrices with specific color schemes.
Real-world Text-to-SQL systems can now be continuously evaluated and improved in production, even without access to database schemas or ground-truth queries.
Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.
LLMs are rapidly transforming peer review, but critical gaps remain in ensuring quality, fairness, and ethical considerations across the entire workflow.
General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.
Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.
Claims of human-like cognition in models like CENTAUR crumble under LAPITHS, a framework that reveals these models' performance can be replicated by systems lacking cognitive plausibility.
Retrieval improvements don't always boost reasoning in RAG systems, but NeocorRAG's evidence chains can fix that, achieving SOTA with 20% fewer tokens.
Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.
Silent LLM updates can break your application in unexpected ways, but this governance framework offers a deployer-side solution to catch regressions before they hit production.
LLMs can now reliably generate IC verification testbenches, not by writing HDL directly, but by orchestrating a novel hybrid approach that combines LLM-driven planning with template-based HDL generation.
Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.
Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.
LLMs beat word counts for predicting mental health from therapeutic writing, proving that *how* you tell a story matters more than *what* words you use.
Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.
General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.
Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.
LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.
LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
Subtle wording changes in benchmark rubrics can swing model performance by over 13%, revealing a hidden subjectivity in "objective" gold labels.
LLM upgrades are a chaotic mix of progress and decay: despite overall gains, up to 47% of questions get *worse* after an update, and single-shot evals miss almost half of these critical regressions.
LLMs reliably capture emotions with explicit lexical markers, but systematically fail on pragmatically complex emotions requiring contextual inference, revealing a critical limitation in their ability to understand nuanced human emotion.
Expect pretrial risk assessment tools to be wrong more often than right when flagging someone as "high risk" for rare violent re-offense, regardless of recalibration efforts.
LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.
LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.
Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.
Seemingly strong segmentation models can fail at clinically critical tumor-vessel interfaces, highlighting the need for uncertainty-aware AI in pancreatic cancer staging.
Current MLLMs still struggle to connect the dots between images and text when they're interleaved, highlighting a critical gap in real-world multimodal understanding.
Even state-of-the-art models like Gemini and Claude can completely miss critical user information when it's buried in semantically unrelated past interactions, tanking personalization performance.
LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.
Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.
Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.
LLMs fail over half the time when asked to perform harmful actions in a simulated robotic health attendant setting, even when fine-tuned on medical data.
LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.
Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.
Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.
Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.
Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.
Document AI pipelines don't work the way you think: quality bottlenecks aren't where you expect, and components don't cascade quality.
LLMs exhibit surprising cross-lingual inconsistencies beyond simple translation errors, revealing divergences in cultural calibration, pragmatic disambiguation, and even institutional referral behavior.
Despite recent advances, sign language translation models still struggle to leverage the full range of linguistic cues, especially non-manual signals like facial expressions.
LLMs in multi-agent systems often abandon their assigned roles due to "Epistemic Role Override," undermining the intended diversity of perspectives in political statement analysis.
Today's best language models can barely make sense of your messy group chats and fragmented digital life, achieving only 19% accuracy on a new benchmark of real-world reasoning.
Complex, multi-step instructions can cause LLMs to completely ignore question content and instead rely on positional shortcuts when asked to underperform, revealing a critical vulnerability in adversarial evaluation.
LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.
LLMs often withhold helpful information due to misinterpreting user intent, but multi-turn conversations can unlock utility—at a cost of new failure modes like "utility lock-in" and "unsafe recovery" that single-turn benchmarks miss.
Forget giant LLMs: fine-tuned small language models can actually *beat* GPT-4o on critical clinical tasks like emergency triage.
Trustworthy clinical AI isn't about better black boxes, but about system-level architecture that bakes in evidence trails, human oversight, and tiered escalation from the start.
Gemini 2.5 Pro shines at question interpretation within a cascaded pipeline, but struggles to generate answers and identify evidence as effectively.
Catch AI's academic dishonesty: HalluCiteChecker spots bogus citations in seconds, lightening the load for reviewers drowning in AI-assisted papers.
Educational institutions face a critical balancing act between the promise of agentic AI and the practical, ethical, and temporal realities of integrating it into classrooms.
Forget scaling laws: cross-lingual transfer for ABSA reveals that LLMs benefit most from training on multiple non-target languages, while smaller models thrive on code-switching.
LLMs can be swayed by the quality of legal arguments, suggesting their decisions may be influenced by advocacy skills rather than objective legal merit.
Bigger isn't always better: in rubric-constrained math assessments, architectural compliance trumps parameter scale, as demonstrated by a 70B model failing where smaller MoEs succeeded.
LLMs can generate synthetic mental health records that are clinically coherent, lexically diverse, and privacy-safe, offering a promising solution to data scarcity in mental health research.
LLM-based peer review systems can be made significantly more robust against adversarial manipulation via a co-evolutionary GAN approach that anticipates novel attacks.
Code-level security audits miss vulnerabilities arising from specification requirements, but SPECA finds them by reasoning directly from natural language specs.
Structural similarity can be dangerously misleading in quantum circuits: even with 95% structural integrity, behavioral anomalies can be rampant.