Search papers, labs, and topics across Lattice.
100 papers published across 8 labs.
LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.
LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.
Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.
Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.
Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.
Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.
Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.
Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.
LLM-powered security tools are surprisingly susceptible to confirmation bias, overlooking reintroduced vulnerabilities when pull requests are framed as security improvements.
Most sparse tensor compilers are riddled with bugs, silently miscompiling code or crashing on valid inputs, a problem exposed by a new fuzzer that guarantees valid tensor contractions.
LLMs' temporal reasoning crumbles in low-resource languages and rarer calendar formats, not due to a lack of reasoning ability, but because poor tokenization fragments dates and times.
GUI agents struggle with long tasks not because they mis-click, but because they forget what they were doing, and a new "anchored memory" method can fix it.
Despite advances in LLMs, human-AI collaboration still significantly outperforms AI-only agents in domain-specific data science tasks, proving that human expertise remains crucial.
Adding the T-pentomino to Tetris Block Puzzle makes the game significantly harder, quantified by a slowdown in SGAZ agent convergence.
Even in a seemingly simple tabular environment like Blackjack, model-free RL agents can converge to near-optimal *average* rewards while still making surprisingly poor decisions in specific states.
A simple vertex deletion fingerprint breaks graph isomorphism records, even distinguishing graphs that stump the classic 3-WL algorithm.
Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.
LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.
Current VLMs struggle with multi-hop spatial reasoning, often failing to compose even simple spatial relations across multiple steps, highlighting a critical gap for real-world VLA agent deployment.
LLMs can generate novel mathematical research problems in differential geometry that experts find both unknown and valuable, suggesting a new avenue for AI-assisted mathematical discovery.
Strategic visual aids are the secret weapon for geometric reasoning, and this work shows how to teach MLLMs to wield them effectively via reinforcement learning.
LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.
CNNs still reign supreme in Burmese handwritten digit recognition, but physics-inspired PETNNs are hot on their heels, outperforming Transformers and KANs.
LLMs that appear strategically savvy in standard games often crumble when faced with slight rule changes, suggesting they're mimicking rather than truly reasoning.
LLMs are far more susceptible to authority and framing biases than the field's obsession with demographic bias suggests.
Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.
LLMs surprisingly prioritize norm adherence over personal incentives in business scenarios, challenging assumptions about goal-driven behavior.
Multimodal LLMs suffer a major performance hit when asked to switch from text-based to image-based tasks mid-conversation, revealing a surprising asymmetry in their ability to handle task interference.
Forget comparing models with benchmarks – mapping them by prompt-response likelihoods reveals hidden relationships between architecture, training data, and even how prompts compose.
Open-source LLMs, when carefully prompted with representative examples, can rival or even surpass smaller commercial models like GPT-3.5-nano in resume screening tasks, offering a privacy-preserving alternative.
VLMs selectively ignore visual information based on question framing, even when the visual reasoning task remains identical, highlighting a critical vulnerability in their grounding capabilities.
ChatGPT's geographic reasoning can be surprisingly brittle, with minor syntactic changes causing significant output variations and task composition revealing unexpected distributional shifts.
MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.
Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.
Two heads are better than one: combining verbalized confidence and self-consistency with just two samples dramatically boosts uncertainty estimation in reasoning models, beating either signal alone even with much larger sampling budgets.
LLMs' chain-of-thought reasoning is more reliable when the uncertainty (entropy) decreases consistently at each step, not just overall.
LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.
LLMs aren't just regurgitating facts; they're actually better at generating high-quality, relation-preserving word analogies than humans.
LLMs understand your intent better when you structure your prompts with "who, what, when, where, why, how, how much, and how many," but only if you present it in natural language, not raw JSON.
LLMs can introspect on their own internal emotive states during conversations with surprising accuracy, opening a new avenue for monitoring and influencing their behavior.
Forget scaling laws: the *structure* of your AI governance system matters more than the specific LLM when it comes to preventing corruption.
Language learners find that Duolingo's general lessons are great for building a foundation, but personalized, work-related scenarios are key to achieving professional fluency.
Weaker autonomous web agents readily trust tampered website content, producing unsafe outputs, while stronger models exhibit better anomaly detection and safer fallback strategies under MITM attacks.
Human oversight can be systematically integrated into LLM-based text generation to improve accessibility, creating a traceable and auditable process.
LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.
Forget expensive multilingual annotations: this framework lets you evaluate LLMs in new languages by transferring knowledge from English, with surprisingly strong results.
A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.
CNNs still reign supreme for medical image segmentation on heterogeneous datasets, beating out hybrid transformer models despite the latter's theoretical advantages.
LLMs in a group Turing Test still make tell-tale mistakes that betray their AI origins, even when their language skills are otherwise convincing.
Human-AI teams often fail not because AI is inaccurate, but because humans miscalibrate their reliance on it, highlighting the need for readiness metrics beyond accuracy.
Humans get a creativity boost from random analogies, but LLMs are already so creative that the same trick doesn't help—unless you make the analogy really, really weird.
Even GPT-5 and Gemini 2.5 Pro still fail to efficiently couple reasoning with tool use, requiring up to 2.7x more tool calls than theoretically optimal in a new diagnostic environment.
Blindly maximizing human-AI performance can degrade human expertise over time, revealing a critical trade-off that demands a new approach to system design.
LLMs penalize informal language in essays so severely that it's like marking a B+ down to a C+, even when explicitly told to ignore writing style.
Supervised learning models can reliably outperform widely-used commercial AI text detectors, even across different languages and specialized domains like mental health.
Language model text is detectable because it misses the "long tail" of human word choice, not because it's less intelligent.
Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.
Hybrid LiDAR-inertial-visual odometry (LIVO) robustly handles visually challenging conditions, outperforming sparse-direct methods by combining direct photometric methods with learning-based feature descriptors.
Prompting language significantly impacts the accuracy and coherence of LLM responses for maternal health queries in Telugu, with GeminiAI favoring English prompts and Perplexity AI preferring Telugu.
Current benchmarks fail to rigorously evaluate deep research agents, but a new framework leveraging structured knowledge bases and synthetic data offers a verifiable and scalable solution.
Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.
On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.
Agentic AI systems are still far from maximizing hardware potential: SOL-ExecBench reveals a significant gap between current GPU kernel performance and analytically derived Speed-of-Light bounds across a wide range of AI models.
VLMs' safety judgments are easily manipulated by simple semantic cues, revealing a reliance on superficial associations rather than true visual understanding.
Deep learning's dominance in time series anomaly detection may be overstated: a carefully evaluated PCA baseline rivals the performance of the widely-used OmniAnomaly.
LLMs still struggle to reason about financial time-series data, even when they ace the textual fundamentals.
Multilingual question answering is harder than you think: even state-of-the-art RAG systems stumble when dealing with questions and knowledge in multiple languages.
LLM endpoints can appear "healthy" according to traditional metrics while undergoing subtle behavioral shifts detectable by monitoring output distributions, highlighting a critical gap in current reliability practices.
Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.
Forget scaling laws: Mi:dm K 2.5 Pro proves that targeted training pipelines and data curation can enable a 32B parameter model to achieve state-of-the-art performance in enterprise reasoning tasks, especially in low-resource languages like Korean.
LLMs beat traditional metrics at judging PDF table extraction quality, finally offering a way to evaluate semantic correctness, not just structural similarity.
Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.
Stop training LLMs to assign arbitrary scores to papers in isolation; comparison-based ranking unlocks significantly better generalization and accuracy in paper evaluation.
LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.
Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.
Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."
Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.
Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.
LLMs forget up to 60% of facts when summarizing and erode over half of project constraints during iterative compaction, but a simple discrete memory system (KOs) fixes this while slashing costs by 252x.
Software architecture, a critical but underspecified domain, finally gets a unified benchmarking platform with ArchBench, enabling standardized evaluation of LLMs on complex system design tasks.
Seemingly sophisticated dense retrieval methods can catastrophically fail at contradiction detection due to "Semantic Collapse," highlighting the surprising effectiveness of a simple, decoupled lexical approach for reliable biomedical QA.
A single Noise Sensitivity Exponent (NSE) dictates when learning becomes computationally intractable in high-dimensional single- and multi-index models.
Current machine translation systems exhibit systematic masculine overuse and inconsistent feminine realization when translating from gender-neutral languages, a problem that can now be quantified thanks to a new gold-standard annotation framework.
Instruction tuning can reduce masculine bias in decoder-only MT models, but these models still don't consistently outperform encoder-decoder architectures on gender-specific translation tasks.
Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.
Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.
LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.
LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.
Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.
LLMs can read datasheets, but still can't design circuits, failing at basic physical intuition despite showing promise in documentation understanding.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.
Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.
Students perceive AI assistants as less intimidating and more approachable than human teachers, but also recognize limitations in specialized knowledge and nuanced feedback.
Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.
Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.
Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.
Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.
Current machine translation systems often fail to capture the nuances of culturally-loaded expressions, highlighting a critical gap in their ability to truly understand and translate language.
LLM-powered recommendation agents, despite their reasoning prowess, are easily manipulated by contextual biases in high-stakes scenarios like paper review and job recruitment.
Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.
Forget chasing leaderboard hype: this study reveals that larger embedding models and strategic concatenation are key to unlocking LLM-powered tabular prediction, regardless of public rankings.