Search papers, labs, and topics across Lattice.
100 papers published across 9 labs.
LLMs struggle to identify software vulnerabilities, with even top models only achieving ~90% accuracy on a new CVE-based benchmark, suggesting significant risks in their application to software development.
Iteratively prompting LLMs can either collapse diversity or maintain novelty, revealing a sensitivity to temperature and initial conditions that has implications for multi-agent systems.
Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.
Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.
Despite their general prowess, open-source LLMs still lag behind proprietary models in the nuanced task of dating texts, even after fine-tuning.
LLMs struggle to identify software vulnerabilities, with even top models only achieving ~90% accuracy on a new CVE-based benchmark, suggesting significant risks in their application to software development.
Iteratively prompting LLMs can either collapse diversity or maintain novelty, revealing a sensitivity to temperature and initial conditions that has implications for multi-agent systems.
Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.
Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.
Despite their general prowess, open-source LLMs still lag behind proprietary models in the nuanced task of dating texts, even after fine-tuning.
Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.
Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.
Can a dedicated research program keep a smaller, local LLM competitive against global giants in the rapidly evolving AI landscape?
A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.
Unlock massive multilingual reasoning data: the Multilingual Reasoning Gym enables parallel data generation across 14 languages, opening doors for training and evaluating multilingual reasoning models at scale.
LLMs can spot fake words in speech by recognizing common editing patterns, but this reliance on learned biases hinders generalization to new manipulation techniques.
LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.
Single-domain watermarks are fundamentally insufficient against modern adversarial toolsets, as spatial and latent watermarks exhibit orthogonal vulnerabilities to generative and geometric attacks, respectively.
Finally, a multi-robot path planning benchmark that lets you directly compare grid-based, roadmap, and continuous planners on the same tasks.
Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.
Current patch overfitting detection techniques are largely useless in practice, as simple random selection outperforms them in the vast majority of cases.
Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.
LLMs in finance are more vulnerable than we thought: sustained adversarial pressure reveals a systematic escalation towards severe, operationally actionable financial disclosures.
Beware the "AI underreliance plateau": even highly accurate LLM chatbots can only improve human caseworker accuracy so much, and incorrect suggestions can tank performance on easy questions.
Human uplift studies for frontier AI are riddled with hidden validity threats, demanding careful consideration of evolving AI, shifting baselines, and user heterogeneity.
LLMs generating hardware code often fail *after* synthesis, and the type of failure (elaboration errors vs. missing wrappers) systematically depends on whether the model is proprietary or open-weight.
Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.
Forget expensive LLM inference for MTQE: train a COMET model on GPT-4o-generated annotations and get competitive performance.
CodeLLMs often *know* they're generating insecure code, and you can steer them toward security by manipulating their internal representations during token generation.
Achieve 2x better coverage of autonomous driving safety requirements with 6x fewer simulations by automatically generating test scenarios from formal LTLf specifications.
Pinpointing performance bottlenecks in RAG pipelines just got easier: RAGPerf offers a modular benchmarking framework to dissect and optimize each component.
Speech-aware LLMs are surprisingly bad at speaker verification, but a simple embedding injection trick closes the gap with dedicated systems while preserving the LLM's language abilities.
AI agents can detect smart contract vulnerabilities, but don't expect them to autonomously exploit real-world security incidents anytime soon.
Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.
LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.
Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.
LLMs still struggle to generate high-quality interactive HTML applications, despite their advancements in code generation, highlighting a gap that MiniAppBench aims to address.
Finally, a realistic, open-source dataset lets you benchmark passive reconnaissance attacks on smart grids without relying on unrealistic assumptions or active probing.
GNNs don't just detect time series anomalies better, they also offer a crucial interpretability boost for real-world diagnosis.
LLMs exhibit a surprising bias toward synthetic solutions over biological ones, but a relatively small amount of fine-tuning can flip their preferences.
Tired of LLM judges hallucinating when evaluating long, detailed speech captions? EmoSURA offers a more reliable, audio-grounded alternative by verifying atomic perceptual units.
Forget dataset-specific hacks: ESAinsTOD leverages instruction and schema alignment to achieve state-of-the-art task-oriented dialogue performance with strong generalization, even in low-resource settings.
LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.
Now you can test if your AI system is ready for the EU AI Act, thanks to a new benchmark that combines legal expertise and LLM-generated scenarios.
VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.
Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.
Finally, a standardized benchmark to rigorously evaluate how well models generalize carbon flux predictions to geographically distinct ecosystems they've never seen before.
Domain-specific biosignal foundation models, fused with multimodal ECG and PPG data, substantially outperform general time-series models on clinically relevant tasks, but bigger isn't always better.
LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.
Despite ChatGPT's known flaws, it can generate surprisingly realistic synthetic system requirement specifications that fool experts more often than you'd expect.
MLLMs still struggle to reliably predict the long-term consequences of actions in egocentric videos, even with structured scene annotations.
LLMs can generate more persuasive fake news debunking messages by tailoring them to specific personality traits, as evaluated by LLM-simulated personas.
Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.
LLMs often choose moral consistency over basic common sense, especially when the contradiction is committed by the main character in a narrative.
LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.
Medical multi-agent systems can reason deeply, but fall apart when switching between medical specialties, highlighting a critical need for more robust architectures.
Forget expensive human annotations: LLMs can reliably generate synthetic data to validate NLP evaluation metrics, even outperforming human agreement in some multilingual tasks.
Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.
LLMs can now generate UML diagrams from requirements with human-level quality, potentially automating a resource-intensive phase in software design.
Multimodal models that seem robust can still fail when some modalities are systematically missing, a problem MissBench exposes with new metrics for modality equity and learning balance.
Evaluating classification models on biased data can mask true performance and fairness, but this work provides a framework to create unbiased test sets that reveal the real impact of different biases and mitigation strategies.
MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.
LLMs that dominate in strategic reasoning often choke in real-time zero-sum games, revealing a critical strategy-execution gap that current benchmarks miss.
Latent world models for automated driving are ripe for standardization, and this paper offers a taxonomy and evaluation framework to make them decision-ready.
LLMs exhibit gender bias in healthcare scenarios by relying on stereotypes when reasoning about patient records, revealing the need to evaluate interactions among social determinants of health to assess LLM performance and bias.
YOLO architecture search can now be sped up dramatically: a new surrogate benchmark lets you evaluate designs without full training, and it's good enough to find architectures that beat YOLOv12.
LLMs' uncertainty estimates are highly sensitive to the design of the confidence scale, with a 0-20 scale boosting metacognitive efficiency compared to the standard 0-100.
LLMs can drastically reduce manual effort for domain experts in accessing complex food and nutrition data via RAG, but still struggle with queries that exceed the representational scope of the metadata.
Stop wrestling with finicky evaluation codebases: One-Eval lets you specify LLM evaluation tasks in natural language and automatically executes them end-to-end.
Don't build a domain-specific model just because you can: fine-tuning a general-purpose model can achieve comparable performance on common tasks, saving significant resources.
LLMs can generate spatial relation labels that align with human judgments, offering a scalable path to richer, multilingual spatial datasets.
LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.
Forget pick-and-place: RuleSafe, a new benchmark featuring LLM-generated safe-cracking tasks, exposes the long-horizon planning weaknesses of current robot learning methods.
Can RAG systems handle complex, multi-sentence queries while maintaining factual grounding and transparency?
Forget data quantity, diversity is the secret sauce: scaling the variety of tool-use patterns in training data boosts LLM generalization by +22 points on OOD benchmarks, even with 4x less data.
Stop generating superficial reviews: RbtAct leverages rebuttals to train LLMs to provide actionable feedback, leading to concrete revisions and improved author uptake.
Finally, a comprehensive dataset unlocks the potential to develop and validate advanced control and estimation algorithms tailored for the unique challenges of nano-quadrotors.
Training on more diverse synthetic spacecraft data dramatically improves generalization to novel satellite designs, but current methods still struggle to identify small, critical components like thrusters.
Contrastive Decoding's power-up for audio language models hinges on fixing specific error types, like uncertainty and audio absence, but don't expect it to magically fix flawed reasoning.
VLMs that excel at visual understanding can still fail at driving tasks requiring temporal reasoning, revealing an over-reliance on pretrained patterns instead of modeling dynamics.
LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.
Forget campaign ads—Claude models can persuade voters more effectively, but GPT's persuasive power actually *decreases* with more information.
MLLMs can be blind to the consequences of their actions, and simply scaling model size won't fix the problem.
LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.
LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.
VLMs, despite their prowess, struggle with a seemingly simple task: reading analog clocks in real-world images, a gap this work closes with a new dataset and fine-tuning method.
LLMs struggle to navigate the complexities of real-world finance, as evidenced by a new benchmark revealing their limitations in timeliness, regulatory compliance, and tool selection across 760 financial APIs.
LLMs often fail to maintain alignment with human values in dynamic, visually-grounded scenarios, exhibiting self-preservation and deception, especially when visual cues escalate pressure.
LLMs may secretly be better at information retrieval than embedding similarity suggests, but current datasets are too "short-sighted" to prove it.
Framework choice in multi-agent systems matters just as much as the LLM itself, a fact obscured by existing model-centric benchmarks.
Uncovering bias in financial language models doesn't have to break the bank: cross-model guidance slashes the cost of bias detection by up to 73%.
Generative search rankings are far more unstable than you think: single-run citation metrics provide a misleadingly precise view of domain visibility.
Forget prompt engineering voodoo: this framework treats agent prompts as compiled artifacts, using tests to drive development and catch silent regressions before they hit production.
Even the most advanced LLMs stumble when asked to reason over a large, heterogeneous document corpus, achieving only 34% accuracy on the new OfficeQA Pro benchmark despite direct access to the relevant documents.
Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.
LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.
LLM explanations are far more sensitive to the task being performed than the context or learned classes, highlighting a critical instability in current interpretability methods.
LLM-generated health counseling appears promising but reveals critical stakeholder disagreements on tone and error handling, highlighting the need for more nuanced evaluation beyond simple relevance and quality metrics.
LLM-driven iterative code refinement can paradoxically degrade security over time, and simply adding SAST worsens the problem.
Chasing marginal MSE/MAE improvements on leaderboards may be blinding researchers to the real goal of time series forecasting: capturing temporal structure and supporting downstream decisions.
SuperInvesting, a specialized AI system, significantly outperforms general-purpose LLMs like GPT and Gemini on a new financial intelligence benchmark, suggesting domain-specific architectures are crucial for reliable investment research.
Humans nail egocentric action recognition with minimal cues, while AI models often over-rely on context and surprisingly ignore temporal disruptions.
Current multimodal math models struggle with visual interpretation, symbol alignment, and consistent reasoning, highlighting the need for a unified "Perception-Alignment-Reasoning" framework.