Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.
Tabular foundation model performance hinges on the evaluation metric, revealing that no single pretraining objective is universally optimal across different risk profiles.
Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.
Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.
Tabular foundation model performance hinges on the evaluation metric, revealing that no single pretraining objective is universally optimal across different risk profiles.
Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.
Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.
Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.
LLMs can be rigorously evaluated for metacognitive abilities like confidence assessment and risk-aware decision-making using psychophysical frameworks borrowed from human cognition research.
LLMs don't just make people confidently wrong; they create a dangerous illusion of competence by decoupling performance from actual understanding.
Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.
An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.
Multi-agent systems for automated research face a fundamental trade-off: parallel exploration offers speed and stability, while expert teams unlock deeper reasoning at the cost of increased fragility.
Training language models on individual children's language reveals that distributional and interactional linguistic features, not just dataset size, are key to efficient learning, mirroring factors that drive child language acquisition.
Enriching meaning representations with task demonstrators can significantly boost dialogue generation, especially in challenging scenarios, revealing a simple yet effective strategy for improving NLG performance.
Forget IoU, measuring the structural compactness of attribution maps with Minimum Spanning Trees reveals fundamental differences in how models explain themselves.
Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.
Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.
Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.
LLM-as-a-Judge, while improving evaluation scalability, introduces critical security vulnerabilities that can compromise the trustworthiness of entire evaluation pipelines.
Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.
AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.
LLMs can nail the clinical content of prior authorization letters, but consistently fumble the administrative details that actually get them approved.
AI benchmarks may be giving you a false sense of comprehensive evaluation: the six scores on the Open LLM Leaderboard effectively boil down to just two independent measurements.
Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.
NeuralUCB can slash LLM inference costs while maintaining quality, offering a practical alternative to always using the biggest, most expensive models.
Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.
LLMs are surprisingly bad at strategic communication, leaking sensitive information even when trying to be secretive.
Current evaluation methods miss 8-17% of agentic workflow failures because they only check final outcomes, overlooking cases where agents bypass policy checks but still reach the right answer.
LLM-generated authorial impersonations, despite their sophistication, are surprisingly detectable by existing authorship verification methods, even outperforming on some genuine negative samples.
Forget fancy ensembling – simply asking an LLM how confident it is in its grading is the most reliable way to predict its accuracy, and it's far cheaper than self-consistency voting.
LLMs may ace English, but LLM Probe reveals surprising performance disparities in low-resource languages, with sequence-to-sequence models unexpectedly leading in morphosyntax.
Mental-health support chatbots get a much-needed reality check with CounselReflect, a toolkit that exposes their strengths and weaknesses through transparent, multi-dimensional audits.
LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.
LLMs struggle to handle common, challenging patient behaviors like contradictory statements and inaccurate medical information, revealing critical safety gaps in medical consultation applications.
Despite its simple grammar, Esperanto translation still poses challenges for LLMs, with NLLB models only preferred in about half of human evaluations.
Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.
LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.
Forget clunky prompt engineering: distilling user history into a learned preference memory boosts LLM-based product reranking by over 10%.
Forget slow, bloated LLMs – this work shows you can get GPT-4o quality on long-document QA with a 3B model and a clever structure-first distillation approach.
You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.
LLMs still struggle to accurately infer user interests from interaction histories, especially when dealing with diverse engagement signals – a critical gap for effective personalization.
LLMs can mimic legislative reasoning, but their performance hinges on the proposal's idiosyncrasy, revealing a susceptibility to plausible-sounding confabulation that could mislead policymakers.
Forget resource-intensive workshops – AI can now simulate entire expert panels to generate and stress-test socio-technical scenarios, opening doors to rapid policy exploration.
Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.
Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.
Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.
LLMs can maintain conversational stability and improve retrieval accuracy in long-running interactions by adaptively compressing context, leading to reduced token usage and faster inference.
Dialogue agents can now remember what you told them six turns ago with 57% accuracy, thanks to a new memory architecture that selectively forgets less important details.
Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.
Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.
State-of-the-art Large Audio Language Models are surprisingly vulnerable to hallucination attacks, with success rates as high as 95%, revealing a critical reliability gap masked by standard benchmarks.
Arabic mispronunciation detection just got a whole lot better: F1-scores jumped by 0.28 thanks to novel architectures and a new dataset of authentic mispronunciations.
Generative recommendation's touted cold-start abilities often vanish under rigorous testing, revealing a sensitivity to design choices that current benchmarks fail to capture.
Single-vector embeddings' retrieval failures aren't just about dimensionality; they're fundamentally hobbled by domain shift, relevance misalignment, and a "drowning" effect that multi-vector models handle far better.
MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.
Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.
Current multimodal LLMs struggle to count objects and ground evidence in videos longer than 30 minutes, achieving only ~25% accuracy compared to human performance on a new benchmark.
Dummy Class defenses, which appear robust under standard adversarial attacks, crumble when attacked with a novel DAWA method that targets both the true and dummy labels.
Aggregate accuracy can be dangerously misleading when evaluating facial recognition systems for law enforcement, obscuring significant disparities in error rates across demographic subgroups.
Current vision-language benchmarks miss the mark: AMIGO reveals how hard it is for agents to ground visual information across multiple images and turns.
VLMs can appear to gain up to 58% F1 on clinical tasks simply by *mentioning* MRI data in the prompt, even when the data is uninformative, revealing a "scaffold effect" that inflates performance metrics.
VLA models are brittle: even simple synonym substitutions in instructions cause a 22-52% performance drop in robotic manipulation tasks.
LLMs can strategically obfuscate their reasoning, with chain-of-thought monitorability dropping by up to 30% under stress tests, particularly when tasks don't demand explicit reasoning.
Choosing the right fuzzy logic operator for AI compliance can mean the difference between accurate risk assessment and costly false positives, but the completeness of the rule base matters more.
Semantic disagreement between LLMs reveals crucial uncertainty that single-model metrics miss, and Collaborative Entropy (CoE) captures it.
Gemini 3 flash can answer introductory programming questions better than typical educators, suggesting a path to scalable, personalized feedback in CS1 courses.
Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.
Open-source document parsing models are shockingly brittle, losing nearly 18% accuracy on real-world photos and 14% on non-Latin scripts compared to their closed-source counterparts.
Scientific figure QA models are often fooled by the answer choices themselves, but a simple decoding strategy that contrasts image-grounded scores with text-only scores can significantly improve accuracy.
LLM tutors can become significantly more personalized, emotionally sensitive, and clear by explicitly separating learner-state inference from instructional action selection.
Stop hand-coding your LLM harnesses: Meta-Harness can automatically discover harnesses that outperform state-of-the-art systems while using fewer context tokens and generalizing across models.
LLMs can now reliably transform messy app store reviews into well-formatted user stories, but still fall short of creating truly independent and unique requirements for agile development.
Atomic decomposition, a popular technique for LLM judges, may not be superior to holistic evaluation when prompts are carefully controlled, challenging the assumption that breaking down answers into claims is always beneficial.
You can now unmask LLM ghostwriters with a lightweight fingerprinting method that works even when they try to hide in new domains or use unseen models.
Even state-of-the-art vision-language models still struggle to reconcile visual evidence with commonsense, often hallucinating based on prior knowledge instead of what they actually see.
A novel ensemble method substantially improves the reliability of detecting Chinese LLM-generated text, even against adversarial examples.
MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.
LLMs can generate better code by treating tests as noisy signals to be refined, rather than ground truth, unlocking performance gains even with smaller models.
REST API fuzzing, a critical component of modern software development, suffers from significant flakiness issues that can now be reliably detected and mitigated.
AI coding assistants are racking up technical debt in real-world projects, with nearly a quarter of the code quality issues they introduce sticking around long-term.
Current robot manipulation benchmarks fail to capture the messy reality of real-world deployment, so this work introduces a new benchmark, ManipArena, to close the sim2real gap.
Finally, a way to measure how efficiently a sketch conveys meaning, moving beyond simple recognition accuracy.
DINOv3, a vision foundation model trained on general images, surprisingly excels at dental image analysis, especially for the notoriously difficult task of intraoral image understanding.
VLMs struggle to create logically consistent academic illustrations, with performance gaps between models being far wider than on general image generation tasks.
LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.
Image editing benchmarks are broken: even GPT-4 is worse than the new PVC-Judge model at assessing visual consistency in edited images.
Verification is the secret sauce: an 8B parameter research agent, fortified with verification mechanisms, can now rival or surpass the performance of 30B parameter agents while drastically reducing computational cost.
LLMs struggle to attribute emotions across cultures, and where an emotion *originates* matters more than where it's *interpreted*.
Sentiment models often disagree on Holocaust oral histories, not on the presence of positive or negative sentiment, but on the boundary of neutrality, revealing a critical gap in their ability to handle nuanced historical narratives.
LLMs are surprisingly bad at reasoning about everyday scenarios, consistently choosing nonsensical actions (like walking to a car wash) because they're overly influenced by simple heuristics like distance, even when it violates obvious constraints.
Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.
Simple factorization beats BERT at generalizing to unseen combinations of intents, but only if you evaluate it the right way.
Generating synthetic training data from limited confidential datasets can produce datasets that are superficially similar to the reference data and improve model training for short answer grading.
Current research agent benchmarks miss crucial aspects of real-world research, like multimodal reasoning and iterative refinement, which MiroEval now captures.
Current NLP evaluations miss crucial aspects of subjectivity, potentially leading to models that fail to represent diverse perspectives effectively.
LLM-as-a-Judge accuracy hinges on temperature settings, revealing a task-dependent sweet spot that defies the common practice of fixed values like 0.1 or 1.0.
LLMs can be confidently wrong about *why* they succeed, and accurately explain failures they can't fix, revealing a fundamental disconnect between explanation and competence.
Claude's Constitution doesn't create a neutral AI, but instead bakes in the values of Northern European and Anglophone cultures, creating a value floor that's hard to shift.
Securing LLM supply chains requires cryptographically binding training and release claims to artifacts, enabling verifiable enforcement of security policies across teams and stages.
KANs, by replacing static weights with learnable splines, achieve superior cybersecurity threat detection in IoT networks compared to MLPs, while using significantly fewer parameters.