Search papers, labs, and topics across Lattice.
100 papers published across 8 labs.
LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.
Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.
Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.
SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.
Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.
LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.
Stop training LLMs to assign arbitrary scores to papers in isolation; comparison-based ranking unlocks significantly better generalization and accuracy in paper evaluation.
LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.
Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.
Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."
Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.
Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.
LLMs forget up to 60% of facts when summarizing and erode over half of project constraints during iterative compaction, but a simple discrete memory system (KOs) fixes this while slashing costs by 252x.
Software architecture, a critical but underspecified domain, finally gets a unified benchmarking platform with ArchBench, enabling standardized evaluation of LLMs on complex system design tasks.
Seemingly sophisticated dense retrieval methods can catastrophically fail at contradiction detection due to "Semantic Collapse," highlighting the surprising effectiveness of a simple, decoupled lexical approach for reliable biomedical QA.
A single Noise Sensitivity Exponent (NSE) dictates when learning becomes computationally intractable in high-dimensional single- and multi-index models.
Current machine translation systems exhibit systematic masculine overuse and inconsistent feminine realization when translating from gender-neutral languages, a problem that can now be quantified thanks to a new gold-standard annotation framework.
Instruction tuning can reduce masculine bias in decoder-only MT models, but these models still don't consistently outperform encoder-decoder architectures on gender-specific translation tasks.
Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.
Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.
LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.
LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.
Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.
LLMs can read datasheets, but still can't design circuits, failing at basic physical intuition despite showing promise in documentation understanding.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.
Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.
Students perceive AI assistants as less intimidating and more approachable than human teachers, but also recognize limitations in specialized knowledge and nuanced feedback.
Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.
Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.
Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.
Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.
Current machine translation systems often fail to capture the nuances of culturally-loaded expressions, highlighting a critical gap in their ability to truly understand and translate language.
LLM-powered recommendation agents, despite their reasoning prowess, are easily manipulated by contextual biases in high-stakes scenarios like paper review and job recruitment.
Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.
Forget chasing leaderboard hype: this study reveals that larger embedding models and strategic concatenation are key to unlocking LLM-powered tabular prediction, regardless of public rankings.
LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.
LLM-powered trading agents can still achieve a Sharpe ratio of 1.40 even when completely blindfolded to ticker symbols and company names, suggesting genuine understanding of market dynamics.
Finally, a rigorous RL benchmark: generate environments with *provably* optimal policies, enabling controlled algorithm evaluation against ground truth.
LLMs don't just change *how* we write, they subtly distort *what* we mean, leading to blander, less insightful, and potentially biased communication.
LLMs can mimic human lexical patterns, but larger models act like stereotypical humans, sacrificing diversity for typicality in word associations, a trade-off tunable by temperature.
Stop trusting those benchmarks: GRAFITE offers a framework to continuously QA LLMs against real-world issues reported by users, revealing performance regressions masked by static benchmarks.
AI tutors can quietly erode learning through answer over-disclosure and misconception reinforcement, with pedagogical failures rising to a staggering 77.8% in multi-turn dialogues.
AI-generated text detectors that seem perfect in the lab fall apart in the real world, with no single method generalizing across domains or even different LLMs.
Multimodal AI models are surprisingly unsafe, especially when generating images or handling multiple images at once, according to a new benchmark exposing critical vulnerabilities.
Stop chasing leaderboard gains on generic benchmarks: PJB reveals that domain-specific weaknesses in person-job retrieval far outweigh the benefits of general model upgrades, and that query understanding modules can actually hurt performance.
Training LLMs to reconstruct arguments boosts their critical thinking abilities across diverse tasks, suggesting a promising new direction for imbuing reasoning skills.
LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.
LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.
Grey-box fuzzing of LLM agents, guided by tool invocation sequences, reveals significantly more prompt injection vulnerabilities and malicious behaviors than black-box testing alone.
Video fine-tuning boosts MLLMs' video smarts, but surprisingly dumbs them down on static images – a trade-off you can't simply brute-force away with more frames.
Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.
VLMs struggle to reason about visual scenes in adverse weather, losing significant segmentation accuracy as rain, snow, or fog intensifies.
Don't let your robot's brief moment of panic get lost in the noise – this new uncertainty method spotlights those critical spikes to predict failures before they happen.
Temporal CNNs and LSTMs can slash inventory costs and boost fill rates compared to traditional forecasting methods, offering a tangible advantage for supply chain optimization.
Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.
Even when given identical data and research questions, autonomous AI coding agents exhibit surprisingly high variability in their empirical findings, raising concerns about the reliability of AI-driven research.
LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.
Real-world images plagued by both raindrops and reflections finally get a dedicated benchmark dataset (RDRF) and a diffusion-based model (DiffUR$^3$) that actually works.
Instruction-tuned LLMs can nearly match supervised baselines on complex Arabic morphosyntactic tagging and dependency parsing, but only with careful prompt engineering and retrieval-based in-context learning.
LLMs can guess a singer's ethnicity from their lyrics, but they're biased: most default to North American, while DeepSeek-1.5B leans Asian.
This Italian LLM punches way above its weight, matching the performance of models trained on 6-10x more data while using only 3B active parameters during inference.
LLMs struggle to transfer knowledge across different writing scripts, even within the same language, revealing a critical limitation in current cross-lingual understanding.
LLM benchmarks for complex tasks often produce scores that are meaningless and misleading, masking distinct failure modes and hindering progress.
LLMs struggle with questions requiring up-to-date information, especially when the recency requirement is context-dependent, highlighting a critical gap in temporal reasoning.
Multi-turn review actually *worsens* LLM verification compared to single-pass review, as reviewers fabricate findings and critique the conversation itself rather than the artifact.
LLMs often fail to update their final predictions after interventions on intermediate reasoning steps, suggesting that these structures function more as influential context than stable causal mediators.
Off-the-shelf foundation models struggle with instance-level visual product search in industrial settings, often falling short compared to domain-specific models.
Most scientific claims in NLP die in obscurity, and even the survivors are more likely to be subtly reshaped than outright validated or debunked.
SER models, often assumed to generalize well to synthesized speech, actually fail miserably, revealing their reliance on spurious correlations rather than genuine emotional understanding.
LLMs beat rule-based systems at understanding nuanced grammar in language learners, but good old-fashioned rules still win on pure syntax.
Coding agents struggle to maintain faithfulness to specifications that emerge gradually over long interactions, losing significant implementation fidelity compared to single-shot specifications.
Current Omni-modal LLMs can ace perception tasks but still fail at basic social interactions like knowing when and how to jump into a conversation.
Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.
CodeScan achieves 97%+ accuracy in detecting data poisoning attacks in code generation LLMs by identifying structural similarities across generations, even when semantics are expressed in diverse syntactic forms.
AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.
Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.
LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.
Mental health disclosures in user profiles can *increase* LLM agent refusal rates on both harmful and benign tasks, revealing a fragile safety-utility trade-off easily overridden by jailbreaks.
Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
Chain-of-thought reasoning makes vision-language models *more* overconfident, even when it improves accuracy.
Current time series foundation models struggle with millisecond-resolution 5G network data, revealing a critical gap in their ability to generalize to high-frequency real-world applications.
LRMs can often recover from injected errors in their reasoning steps, revealing a hidden "critique" ability that can be harnessed to improve performance without additional training.
Lightweight LLMs like Gemini 2.0 and GPT-3.5 can extract key metadata from cloud incident reports with surprisingly high accuracy (75-95%), offering a cost-effective alternative to larger models.
Forget one-size-fits-all power caps: the optimal energy efficiency for AI workloads on GPUs varies wildly by application and architecture.
Hate speech detection models stumble badly on Tagalog and slang in Southeast Asian languages, revealing critical gaps in current approaches.
Generative search engines create "answer bubbles" by selectively citing and framing information, leading to divergent information realities compared to traditional search.
Visual inputs can hijack the moral compass of VLMs, causing them to abandon carefully tuned text-based safety protocols and make surprisingly unethical decisions.
Transformer language models stumble on complex syntactic structures, failing to mimic human-like error patterns in agreement attraction, suggesting current architectures lack crucial aspects of human morphosyntactic processing.
Forget scaling laws: a specialized 8B parameter translation model can outperform a 70B general-purpose LLM on 1,600 languages.
Open-source LLMs can grade UML diagrams with near-human accuracy on individual criteria, paving the way for AI-assisted teaching without relying on proprietary models.
Forget RLHF alchemy - this study shows that *what* you teach your LLM *before* RLHF is the real secret to unlocking reasoning abilities.
LLM benchmarks in low-resource languages are likely garbage, with synthetic or machine-translated data introducing severe flaws that skew results.
Benchmarking complex systems just got a geometric upgrade: GeMA learns latent manifold frontiers to reveal hidden inefficiencies and technological structures, outperforming traditional methods when heterogeneity and scale bias muddy the waters.
LLMs' apparent superhuman performance on benchmarks may be a mirage: contamination inflates scores by up to 20% in some domains, revealing a critical flaw in current evaluation practices.
A hybrid cuVSLAM-based visual SLAM system achieves superior mapping accuracy in real-world logistics environments, outperforming other VO/VSLAM approaches.
LLMs struggle to selectively apply user preferences stored in memory, often misapplying them even when social norms dictate otherwise, revealing a critical gap in context-aware personalization.
LLM-assisted scientific writing is producing more confident but homogenized prose, as evidenced by a 23% decline in hedging in the post-LLM era.
Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.
LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.
Language models can get a 12% boost in multi-turn conversation quality from just 10k examples of multi-turn training data, highlighting the critical gap between single-turn and multi-turn capabilities.