Search papers, labs, and topics across Lattice.
One of the world's largest corporate research labs, spanning AI, systems, and human-computer interaction.
76
384
5
Generative recommendation models can adapt to evolving user behavior without catastrophic forgetting by selectively updating item tokens based on a novel drift-detection mechanism.
Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
Hypergraph modeling of patient visits, coupled with contrastive pre-training, significantly boosts medication recommendation accuracy and safety by capturing complex relationships missed by traditional graph-based approaches.
LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.
Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.
LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.
AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.
LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.
SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.
Ditch the task-specific verifier: energy-based fine-tuning (EBFT) lets you directly optimize sequence-level behavior in LMs, beating SFT and matching RLVR in downstream tasks.
Forget brute-force scaling: Tiny Aya proves a 3B parameter model can achieve state-of-the-art multilingual performance with clever training and region-aware specialization.
LLM-generated text alone can be a surprisingly effective and cost-efficient source of feedback for pseudo-relevance feedback, rivaling corpus-derived feedback in low-resource information retrieval tasks.
A 4B parameter model can now beat much larger models at social reasoning, thanks to a new RL framework that aligns model reasoning trajectories with human cognition.
Can RAG systems handle complex, multi-sentence queries while maintaining factual grounding and transparency?
LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.
Answering at the wrong time can be as bad as answering incorrectly in streaming video, so this work introduces a new framework that learns when to answer based on the availability of supporting visual evidence.
Lockbox offers a practical blueprint for enterprises to adopt cloud-based AI processing on sensitive data without compromising security, by implementing a zero-trust architecture.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.
Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.
Forget direct prompt editing: this agentic planning framework, powered by offline RL and synthetic data, masters complex image styling by breaking it down into interpretable tool sequences.
LLMs writing long stories frequently contradict themselves on basic facts and timelines, especially in the middle of the narrative, highlighting a critical weakness in long-form generation.
Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.
Forget unimodal tasks—UniM throws down the gauntlet for truly unified multimodal AI, demanding models juggle any combination of text, image, audio, video, code, documents, and 3D inputs and outputs in a single, interleaved stream.
A 4B parameter SLM can now rival frontier agent performance in complex tool-use environments, thanks to a novel reinforcement finetuning framework that teaches it how to strategically acquire context and execute actions.
Despite codebases evolving rapidly, retrieval benchmarks can remain surprisingly robust even when re-judged on newer versions of the corpus.
Forget same-family constraints: you can compress prompts for LLaMA with a Qwen draft model and still get 90-100% of the original performance.
Decoupling confidentiality from trust, Mica lets you build secure TEE pipelines where components don't need to trust each other.
Ditch mean pooling in your geospatial foundation models: richer pooling methods like GeM can boost accuracy by up to 5% and slash the geographic generalization gap by 40%.
LLMs can mimic your style, but your friends can still tell it's not really you, especially when it comes to your opinions.
LLMs can now more accurately answer questions on complex documents thanks to a new system that understands layout and hierarchical relationships between document components.
Achieve state-of-the-art TTS and SLM performance while slashing inference costs and eliminating content hallucinations by synchronizing text and acoustic tokens.
Achieve up to 57% better point cloud compression by combining the generalization of pretrained models with the robustness of implicit neural representations.
LLMs struggle with instruction following in Indic languages despite progress in high-resource languages, as shown by a new benchmark spanning 14 languages.
VisRAG models can now handle real-world image degradations like blur and shadows without sacrificing accuracy, thanks to a new causality-guided architecture that disentangles semantics from visual distortions.
GUI agents can achieve significantly stronger task-solving capabilities through carefully designed post-training and data curation, without relying on costly online data collection.
Forget static rubrics: SibylSense adaptively learns rubrics at inference time, leading to more discriminative rewards and better RL performance in open-ended generation tasks.
Forget slow, reactive GUI agents – ActionEngine uses a state-machine memory to plan actions programmatically, slashing costs by 11.8x and doubling speed while boosting task success to 95%.
AgentOS reimagines LLMs as reasoning kernels within a structured OS, offering a blueprint for more robust and scalable AI agents.
NanoKnow reveals that even with external evidence, LLMs are more accurate when answers were seen during pre-training, highlighting the crucial role of parametric knowledge.
ImageNet-pretrained CNNs can spot looted archaeological sites from space with surprising accuracy, leaving traditional methods in the dust.
Imagine a world where web agents don't just click and type, but orchestrate complex tasks with the reliability of APIs – Web Verbs offer a path to that future.
World models can now effectively simulate complex desktop software environments like Microsoft Office, enabling agents to reason about actions before execution and significantly improving performance.
Guaranteeing consistent communication between AI agents is now possible: a new certification protocol slashes disagreement by up to 96% by ensuring agents share a common understanding of terms.
Diffusion models can now efficiently tackle rare event sampling in molecular dynamics, unlocking rapid calculation of folding free energies in minutes to hours on a GPU.
LLMs can reason more causally by simply checking if their counterfactual predictions are consistent, even without any extra training data.
Forget full-cache rollouts: this parameter-efficient fine-tuning method lets large reasoning models maintain accuracy while slashing memory usage during RL training.
Sampling from diffusion models with quadratic rewards can be surprisingly hard: negative-definite tilts are intractable even in simple cases, while a new algorithm makes low-rank positive-definite tilts tractable.
LLMs can't reliably debug code in long contexts (64k-128k tokens) even with perfect information retrieval, despite impressive performance in agentic workflows that decompose the task.
Knowing VM lifetimes in advance doesn't always guarantee better placement, challenging common assumptions about clairvoyance in cloud resource optimization.
Quantum databases are no longer just a theoretical exercise: Qute shows real speedups over classical databases on real quantum hardware.
By predicting tracking models rather than image features, GOT-JEPA unlocks more robust object tracking, even when objects are heavily occluded or the environment is dynamic.
Scaling laws hold for interest modeling: bigger LLMs and more inference-time sampling consistently boost news recommendation quality, and can be distilled into smaller, deployable models.
LLM development teams often resort to workarounds and augmentation strategies when faced with the practical challenges of integrating domain experts, revealing a gap between ideal participatory design and real-world constraints.
On-policy RL (GRPO) makes LLMs significantly better at vulnerability detection than SFT or preference optimization, outperforming even strong zero-shot baselines.
By explicitly prompting for reflection on failure, ERL unlocks up to 81% better performance in complex RL tasks and 11% gains in tool-using reasoning.
Language models can now internalize experiential knowledge and system prompts more effectively through on-policy context distillation, leading to better task accuracy and out-of-distribution generalization.
Speech recognition models stumble badly on real-world street names, especially for non-English speakers, but a simple synthetic data boost can dramatically improve accuracy.
By explicitly detecting and escaping "Forbidden Zones" during training, AMD unlocks significant gains in sample fidelity and training robustness for few-step generative models like SDXL.
Ditch the army of task-specific models: AdNanny shows a single, reasoning-centric LLM can handle diverse offline advertising tasks with improved accuracy and reduced manual effort.
LLMs can get a 12% performance boost in low-resource languages by using a new framework that tailors data refinement, synthetic text generation, and continual pretraining to each language's digital footprint.
Most AI models are failing to disclose critical safety information like deception behaviors and hallucination risks, even from top labs.
Enterprise AI assistants can achieve zero data retention, but the architectural and compliance paths taken by Salesforce and Microsoft reveal significant trade-offs.
Reasoning-based safety guardrails, once thought to be a strong defense against jailbreaks, crumble with just a few strategically placed tokens.
Even the best LLMs fail more than 40% of the time when orchestrating multiple tools in realistic scenarios, revealing critical gaps in real-world agent capabilities.
Open-source biomolecular modeling just got a boost: RF3 closes the gap with AlphaFold3 in structure prediction, thanks to the new AtomWorks data framework.
Ditch the high-fidelity simulator: IRL-VLA uses a lightweight reward world model trained with inverse reinforcement learning to enable efficient and effective closed-loop RL training for autonomous driving.
LLMs can now automate structured reporting from nurse dictations and medical order extraction from doctor-patient consultations, thanks to two new open-source datasets and an agentic pipeline for generating realistic training data.
VLMs can be effectively adapted, even under data and compute constraints, to create a unified evaluator for video world models that rivals task-specific models and aligns well with human judgment.
LLMs and VLLMs can team up to generate synthetic image data so good, it beats state-of-the-art methods and boosts performance on rare classes and open-vocabulary object detection.
A foundation model trained on a million hours of geophysical data crushes operational weather forecasts while slashing compute costs.
A 1-bit LLM can match the performance of full-precision models, promising huge gains in efficiency.
ChatGPT-4 slashes data extraction time in scoping reviews by 66%, but don't ditch the human reviewers just yet.
LLMs can generate plain language summaries of scientific research that are as good as human-written ones, but easier to read.
Forget task-specific models: Magma, a single foundation model, now outperforms them in both UI navigation and robotic manipulation by bridging verbal and action abilities.