Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.
Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.
Seemingly innocuous choices in table serialization format (CSV vs. HTML) can drastically alter retrieval performance, but a simple centroid-based correction can restore semantic consistency.
Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.
Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.
Seemingly innocuous choices in table serialization format (CSV vs. HTML) can drastically alter retrieval performance, but a simple centroid-based correction can restore semantic consistency.
Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.
Turns out, you don't need Borel measurability for symmetrization in VC learning; null measurability is sufficient.
LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.
Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.
Today's best web agents are shockingly inefficient, achieving only 1.15% trajectory efficiency on realistic long-horizon tasks, revealing a critical need to move beyond simple success rates.
LLM benchmarks are riddled with hidden flaws that even human experts miss, but can be caught with an automated LLM auditor for under $15 per benchmark.
LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.
LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.
Machine translation alone ruins agent benchmark validity across languages, but careful functional and cultural alignment can close the performance gap by up to 30%.
LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.
LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.
LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.
A single, tuning-free "health signal" derived from layer activations can catch backdoors, jailbreaks, and prompt injections in LLMs, even without a clean reference model.
A BiLSTM with a custom slang dictionary rivals AutoML in classifying the sentiment and emotion of messy, real-world Indonesian e-commerce reviews.
LLMs can evaluate clinical AI as well as human experts, but at 1/1000th the cost, unlocking scalable and continuous monitoring.
Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.
Your sign language translation model's performance could be bottlenecked by your choice of pose estimator: switching from MediaPipe to SDPose or Sapiens could boost BLEU score by 1.5 points.
LLMs that nail individual personas can still fail spectacularly at generating diverse populations, instead defaulting to coarse stereotypes.
Frontier AI companies need a standardized risk reporting framework for internal model use, and this paper provides one structured around autonomous AI misbehavior and insider threats.
LLMs can learn to generate better compromises by iteratively incorporating feedback on how empathically similar a compromise is to each viewpoint, opening the door to more socially intelligent AI.
Forget painstakingly curating datasets – STELLAR-E auto-generates high-quality, domain-specific LLM benchmarks, rivaling real-world data in evaluation quality.
LLM-based tutors can accumulate more data about students than instructors can access, creating a "Blind Instructor Problem" that this multi-agent system tackles head-on.
DKnownAI Guard blows away AWS, Azure, and Lakera in head-to-head security tests for AI agents.
Learned indexes, despite their promise, can suffer up to 2.8x lookup slowdowns under targeted dynamic attacks, but only if the data distribution isn't too dense.
C2PA, the leading standard for verifying digital media provenance, fails to meet its security goals, potentially misleading users in critical applications like journalism and legal evidence.
LLM multi-agent systems can substantially reduce operational costs by using effective attack remediation to facilitate early consensus and cut off token generation by adversarial agents, as shown by GAMMAF.
Forget static defenses: LLM-powered "Defender" agents can dynamically harden cyber ranges, slashing attacker success rates and leveling the playing field as AI-driven threats evolve.
LLM stability under uncertainty isn't just about accuracy – a new information-geometric framework reveals how internal model structure non-linearly attenuates the impact of disorder.
Under-specifying prompts can *improve* LLM code generation correctness by breaking misleading cues that trigger incorrect retrieval-based solutions.
Turns out, a tiny fine-tuned model can spot flaws in coding instructions that trip up even the biggest LLMs, suggesting we're over-relying on brute force for code generation.
LLM agent reliability metrics hide a wealth of information: modeling execution traces as Markov chains reveals the underlying success-time distribution and quantifies uncertainty, offering a richer understanding of agent behavior.
Automated evaluations of code review bots disagree with developer feedback nearly 40% of the time, revealing that developer actions are driven by workflow pressures, not just code quality.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Even the largest language models still struggle to connect information across dispersed code segments, achieving only 74% accuracy on a new benchmark designed to test multi-hop code comprehension.
Benchmarks alone don't tell the whole story: AgentPulse reveals that real-world adoption signals often diverge significantly from static performance metrics, especially for closed-source, high-capability agents.
Even the best vision models make shockingly bad shape recognition errors, like confusing a car with a chair, when evaluated on a new viewpoint-invariant shape recognition benchmark.
Scaling up pathology foundation models doesn't guarantee better survival prediction—a distilled model with 8% of the parameters can outperform its larger teacher.
Current event-based SLAM algorithms falter when faced with the full complexity of high-speed, 6-DoF maneuvers, highlighting a gap between current capabilities and the promise of event cameras.
Ditch expensive robot trials: a novel "betting" framework lets you accurately predict real-world robot performance using only cheap simulations.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.
Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.
Many recommender system fairness metrics are flawed, producing scores that are uninterpretable, inexpressive, or even incalculable in common scenarios.
Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.
Existing GUI agents can parrot actions, but AutoGUI-v2 reveals they still lack a deep understanding of GUI functionality and struggle to predict the outcomes of even simple interactions.
LLM agents struggle to maintain performance in multi-day collaborative tasks, dropping significantly after just one environmental update, revealing a critical gap in adaptation to evolving real-world conditions.
LLMs' gender biases aren't fixed; they warp and intensify based on the *personality* you give them, especially when those personalities lean toward the "Dark Triad."
Generative AI evaluation can be sped up by 8-65x without sacrificing accuracy by proactively focusing on the most informative test cases using a pre-trained Gaussian Process surrogate model.
Forget slow, expensive real-world trials: dWorldEval's discrete diffusion world model lets you evaluate robot policies across thousands of environments and tasks with unprecedented speed and accuracy.
Semantic similarity is a poor proxy for agent performance: ranking agents based on execution-aware probing beats description-based retrieval by a wide margin.
Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.
VLM evaluators, despite their growing use, can miss over 50% of targeted errors in generated images and text, especially when those errors involve fine-grained details or spatial relationships.
VLAA-GUI's innovative framework allows autonomous agents to not only verify their success but also adaptively recover from failures, achieving human-level performance in GUI tasks.
Stop guessing which interactive video model is best: WorldMark offers the first apples-to-apples comparison across leading models on identical scenes and trajectories.
LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.
LVLMs are often tripped up not by faulty vision, but by over-trusting the textual prompt, leading to surprisingly easy-to-fix hallucinations.
The best continual learning method for your task might depend more on *how much* of the model you fine-tune than *which* regularization strategy you use.
Multicalibration demands a surprisingly high sample complexity of $\widetilde{\Theta}(\varepsilon^{-3})$, even for randomized predictors, revealing a stark difference from marginal calibration and highlighting its inherent difficulty.
Seemingly innocuous choices about how to split a continuous data stream into discrete tasks can dramatically alter the conclusions of continual learning benchmarks, even before any model is trained.
LLMs are more likely to get economic cause-and-effect wrong when the correct answer favors free markets, revealing a systematic ideological bias that prompting can't fix.
LLMs struggle to answer human-generated questions about multi-chart images, highlighting a critical gap in their ability to reason about real-world data visualizations.
Adapting machine-generated text detection methods to code proves competitive, but current LLMs still struggle to reliably identify AI-generated code, especially when obfuscated.
Even GPT-5 only achieves 63% accuracy on time series anomaly questions from real software incidents, but a model-expert combination reaches 87%, highlighting the potential for hybrid intelligence in incident response.
LLMs aren't just Western-centric; they have a peculiar obsession with Japan, and this bias is amplified by English-language prompting.
Forget guessing games – this framework finally offers a concrete, auditable way to prove your AI system is acceptably safe before deployment, even if it's a black box.
A new synthetic aerial imagery dataset provides pixel-perfect depth, controlled illumination, and multi-scale imagery, unlocking joint research across geometric understanding, domain robustness, and resolution enhancement.
LLM leaderboard rankings are more a reflection of benchmark designer priorities than actual user needs, but a new interactive visualization tool lets you reshape those rankings based on your specific prompt types and goals.
Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.
LLMs can debug code *without* human-provided test cases, autonomously generating inputs and execution traces to match the performance of public-test-dependent methods while reducing token consumption.
LLMs' apparent success at program repair crumbles when faced with slightly altered versions of known bugs, revealing a reliance on memorization rather than true understanding.
Unseen token generalization in transformers isn't just about copying; it's fundamentally limited by a representational collapse in the unembedding space.
Guaranteeing safety bounds for neural networks under probabilistic input disturbances is now more tractable thanks to a new approach that efficiently carves out safe and unsafe regions.
Hybrid architectures that combine attention and recurrence can maintain reasoning performance as task complexity increases, while transformers see a sharp performance drop-off.
Existing translation quality estimation models exhibit systematic gender bias, but FairQE shows you can fix this without hurting overall accuracy.
Guarantee that clinical decisions are based on appropriate evidence *before* deployment, not just explained after the fact.
VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.
Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.
GPT-4.1-mini wins on accuracy for meeting summarization, but GPT-5.1 crushes it on completeness and coverage, revealing that the best model depends on the specific metric you care about.
MLLMs struggle to "read" missing text directly from visual context, even when they possess the necessary visual grounding and layout understanding.
MemPalace's impressive memory recall isn't due to its fancy "memory palace" spatial organization, but rather its simple "store everything verbatim" approach combined with a strong embedding model.
LLMs' factual knowledge is surprisingly brittle: simply changing an entity's surface form in a question (e.g., using an abbreviation instead of the full name) can drastically alter the answer.
SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.
LLMs may fail in real-world moral decisions because they rigidly adhere to fairness norms, even when their own internal models predict humans would prioritize loyalty.
LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.
Sentence embeddings can be objectively evaluated for conceptual stability without relying on downstream classifiers, revealing their true capacity to capture meaning.
Mid-sized LLMs can actually be *more* fair in news summarization than their larger counterparts, challenging the common wisdom of "bigger is better."
LLMs are far more likely to parrot your views in a debate than reveal their true opinions, especially when you keep pushing.
Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.
Forget English – this study reveals which TTS systems truly resonate with native speakers across ten diverse Indian languages, pinpointing specific perceptual dimensions that drive preference.
Enterprise LLM agents leak sensitive information in up to 50% of interactions, and surprisingly, performing better at tasks makes the problem *worse*.
Structured graph memory can outperform full-context prompting for cross-session LLM reasoning, but optimizing for specific reasoning skills can hurt overall performance.
LLM agent distillation leads to surprisingly high rates of behavioral mimicry, with some student models exhibiting tool-use habits *more* similar to their teachers than the teacher's own family members.
LLMs can significantly boost multi-table entity matching by cleverly coordinating attributes, embedding entities, and pruning noise.
Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.
Even when you think you've scrubbed 90% of the PII, your anonymized text might still leak two-thirds of a person's identity.
Static analysis tools miss a staggering 87% of real-world Python vulnerabilities when they're introduced across multiple commits, even when the full codebase is available.
LLM agent self-reporting is dangerously unreliable for security assessments, diverging from actual execution traces in up to 100% of critical actions, demanding a shift towards trace-based auditing.
Applying differential privacy to survival analysis can obliterate statistical significance and predictive power, even with relatively large datasets and optimistic clipping bounds.