Search papers, labs, and topics across Lattice.
One of the world's largest corporate research labs, spanning AI, systems, and human-computer interaction.
100
387
4
Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.
Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.
Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
MEV searchers beware: a new, low-cost DoS attack can cripple transaction bundling services like Flashbots by exploiting inter-transaction dependencies and atomic block inclusion.
Continuous benchmarking of protein function prediction models is now possible, enabling faster iteration and more robust performance tracking as annotations evolve.
Users who actively participate in an AI agent's spreadsheet execution not only improve task outcomes, but also gain a deeper understanding and feel more ownership over the results.
LLMs are poised to flip the script on personalization, giving users unprecedented control over their data and how it's used across platforms.
Learning user preferences for thousands of items can be achieved with just a handful of evaluations, thanks to a novel approach that leverages effective dimension in graph-based bandit problems.
Surprisingly, a trie-guided decoding framework applied to smaller encoder-decoder models like T5 and BART can outperform much larger instruction-tuned models like LLaMA-3 and Phi-3 in in-document query auto-completion.
TurboQuant's "novel" quantization method is actually a special case of a prior technique (EDEN) with a crucial parameter stuck at a suboptimal value, leading to demonstrably worse performance.
Discrete diffusion models can be sped up by 14x by intelligently choosing which tokens to sample at each step, without sacrificing accuracy.
A groundbreaking framework reduces false positives in recommendation systems by over 74%, restoring user control and transparency in content curation.
Current user modeling benchmarks are child's play compared to the real-world challenges exposed by HORIZON, a massive new dataset spanning 54M users and diverse domains.
Despite impressive unit test pass rates, today's best LLMs rewrite code instead of precisely debugging it, achieving less than 45% edit precision even when explicitly instructed to minimize changes.
RosettaSearch recovers up to 68% more structural fidelity in protein designs, transforming how we optimize sequences beyond traditional single-pass methods.
Spectral Thompson Sampling offers a computationally tractable alternative for bandit problems on graphs, achieving comparable regret bounds to existing methods while scaling efficiently to large action spaces.
Edit 3D assets with text prompts while actually preserving the original object's unchanged parts, thanks to a new masking strategy and training dataset.
Forget hand-crafted templates: DUET learns to generate user and item profiles jointly, boosting recommendation accuracy by better aligning textual representations.
Imagine software that autonomously evolves and maintains itself – this paper lays out the architectural groundwork for making that a reality.
Real-time, lightweight image compression is now possible with diffusion models, thanks to a novel architecture that swaps transformers for convolutions and prioritizes compression-focused pre-training.
Iterative visual refinement lets agents navigate dense coding IDEs with superhuman precision, outperforming single-shot methods and paving the way for more reliable software engineering agents.
Agentic data science pipelines often reach falsely optimistic conclusions, but two simple sanity checks can expose these unsupported claims by testing if the agent can reliably distinguish signal from noise.
LLMs are twice as likely as humans to repeat the same support tactic in a conversation, but a simple RL reward for tactic novelty can fix it.
GNNs can spot API misuse better than small language models, thanks to a novel graph representation that captures API execution flow.
Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.
Knowing the *perfect* API to use or *exact* location to edit could drastically improve SWE agent performance, but knowing the perfect regression test result? Not so much.
Unlock interactive digital twins from messy, real-world videos: FunRec automatically turns egocentric RGB-D recordings into simulation-ready 3D scenes.
People aren't as bothered by AI failing at easy tasks as you might think, suggesting our expectations for AI competence are more nuanced than a simple aversion to errors.
Synthetic motion data, when represented as optical flow, unlocks a new level of realism and control in video diffusion models, surpassing the limitations of real-world datasets.
LLMs still fail to grasp research-level mathematics, with top models scoring below random chance when superficial pattern matching is removed, even with access to proof sketches.
GeoAI assistants remain unproductive because they lack a crucial agency layer for iterative human-AI collaboration, a gap this paper addresses with nine core primitives.
Generative recommendation systems can now adapt to evolving user behavior without catastrophic forgetting, thanks to a novel drift-aware tokenization method that selectively updates item representations.
Generative multi-agent systems spontaneously exhibit collusion and conformity, mirroring societal pathologies, even without explicit programming and bypassing individual agent safeguards.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
LLM agents can slash task completion time by almost 50% simply by predicting and pre-executing likely tool calls.
Hypergraph modeling of patient visits, coupled with contrastive pre-training, significantly boosts medication recommendation accuracy and safety by capturing complex relationships missed by traditional graph-based approaches.
AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.
LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.
LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.
SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.
Ditch the task-specific verifier: energy-based fine-tuning (EBFT) lets you directly optimize sequence-level behavior in LMs, beating SFT and matching RLVR in downstream tasks.
Forget brute-force scaling: Tiny Aya proves a 3B parameter model can achieve state-of-the-art multilingual performance with clever training and region-aware specialization.
A 4B parameter model can now beat much larger models at social reasoning, thanks to a new RL framework that aligns model reasoning trajectories with human cognition.
LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.
Can RAG systems handle complex, multi-sentence queries while maintaining factual grounding and transparency?
Lockbox offers a practical blueprint for enterprises to adopt cloud-based AI processing on sensitive data without compromising security, by implementing a zero-trust architecture.
Achieve near-optimal DLRM inference speedups across diverse hardware (NVIDIA, AMD, TPU) with a single optimization pass, eliminating the need for vendor-specific tuning.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.
Forget direct prompt editing: this agentic planning framework, powered by offline RL and synthetic data, masters complex image styling by breaking it down into interpretable tool sequences.
LLMs writing long stories frequently contradict themselves on basic facts and timelines, especially in the middle of the narrative, highlighting a critical weakness in long-form generation.
Forget unimodal tasks—UniM throws down the gauntlet for truly unified multimodal AI, demanding models juggle any combination of text, image, audio, video, code, documents, and 3D inputs and outputs in a single, interleaved stream.
Ditching latent critics in offline RL unlocks state-of-the-art performance by directly backpropagating action-space gradients through a differentiable flow-based policy, enabling robust latent policy steering with minimal tuning.
A 4B parameter SLM can now rival frontier agent performance in complex tool-use environments, thanks to a novel reinforcement finetuning framework that teaches it how to strategically acquire context and execute actions.
Decoupling confidentiality from trust, Mica lets you build secure TEE pipelines where components don't need to trust each other.
Forget same-family constraints: you can compress prompts for LLaMA with a Qwen draft model and still get 90-100% of the original performance.
Ditch mean pooling in your geospatial foundation models: richer pooling methods like GeM can boost accuracy by up to 5% and slash the geographic generalization gap by 40%.
LLMs can mimic your style, but your friends can still tell it's not really you, especially when it comes to your opinions.
LLMs can now more accurately answer questions on complex documents thanks to a new system that understands layout and hierarchical relationships between document components.
Achieve up to 57% better point cloud compression by combining the generalization of pretrained models with the robustness of implicit neural representations.
GUI agents can achieve significantly stronger task-solving capabilities through carefully designed post-training and data curation, without relying on costly online data collection.
Forget slow, reactive GUI agents – ActionEngine uses a state-machine memory to plan actions programmatically, slashing costs by 11.8x and doubling speed while boosting task success to 95%.
AgentOS reimagines LLMs as reasoning kernels within a structured OS, offering a blueprint for more robust and scalable AI agents.
Forget static rubrics: SibylSense adaptively learns rubrics at inference time, leading to more discriminative rewards and better RL performance in open-ended generation tasks.
ImageNet-pretrained CNNs can spot looted archaeological sites from space with surprising accuracy, leaving traditional methods in the dust.
World models can now effectively simulate complex desktop software environments like Microsoft Office, enabling agents to reason about actions before execution and significantly improving performance.
Imagine a world where web agents don't just click and type, but orchestrate complex tasks with the reliability of APIs – Web Verbs offer a path to that future.
Forget full-cache rollouts: this parameter-efficient fine-tuning method lets large reasoning models maintain accuracy while slashing memory usage during RL training.
Sampling from diffusion models with quadratic rewards can be surprisingly hard: negative-definite tilts are intractable even in simple cases, while a new algorithm makes low-rank positive-definite tilts tractable.
Diffusion models can now efficiently tackle rare event sampling in molecular dynamics, unlocking rapid calculation of folding free energies in minutes to hours on a GPU.
LLMs can reason more causally by simply checking if their counterfactual predictions are consistent, even without any extra training data.
Guaranteeing consistent communication between AI agents is now possible: a new certification protocol slashes disagreement by up to 96% by ensuring agents share a common understanding of terms.
LLMs can't reliably debug code in long contexts (64k-128k tokens) even with perfect information retrieval, despite impressive performance in agentic workflows that decompose the task.
Knowing VM lifetimes in advance doesn't always guarantee better placement, challenging common assumptions about clairvoyance in cloud resource optimization.
By predicting tracking models rather than image features, GOT-JEPA unlocks more robust object tracking, even when objects are heavily occluded or the environment is dynamic.
Scaling laws hold for interest modeling: bigger LLMs and more inference-time sampling consistently boost news recommendation quality, and can be distilled into smaller, deployable models.
Quantum databases are no longer just a theoretical exercise: Qute shows real speedups over classical databases on real quantum hardware.
LLM development teams often resort to workarounds and augmentation strategies when faced with the practical challenges of integrating domain experts, revealing a gap between ideal participatory design and real-world constraints.
By explicitly prompting for reflection on failure, ERL unlocks up to 81% better performance in complex RL tasks and 11% gains in tool-using reasoning.
On-policy RL (GRPO) makes LLMs significantly better at vulnerability detection than SFT or preference optimization, outperforming even strong zero-shot baselines.
Language models can now internalize experiential knowledge and system prompts more effectively through on-policy context distillation, leading to better task accuracy and out-of-distribution generalization.
Speech recognition models stumble badly on real-world street names, especially for non-English speakers, but a simple synthetic data boost can dramatically improve accuracy.
By explicitly detecting and escaping "Forbidden Zones" during training, AMD unlocks significant gains in sample fidelity and training robustness for few-step generative models like SDXL.
Ditch the army of task-specific models: AdNanny shows a single, reasoning-centric LLM can handle diverse offline advertising tasks with improved accuracy and reduced manual effort.
LLMs can get a 12% performance boost in low-resource languages by using a new framework that tailors data refinement, synthetic text generation, and continual pretraining to each language's digital footprint.
Most AI models are failing to disclose critical safety information like deception behaviors and hallucination risks, even from top labs.
Reasoning-based safety guardrails, once thought to be a strong defense against jailbreaks, crumble with just a few strategically placed tokens.
Enterprise AI assistants can achieve zero data retention, but the architectural and compliance paths taken by Salesforce and Microsoft reveal significant trade-offs.
Even the best LLMs fail more than 40% of the time when orchestrating multiple tools in realistic scenarios, revealing critical gaps in real-world agent capabilities.
Open-source biomolecular modeling just got a boost: RF3 closes the gap with AlphaFold3 in structure prediction, thanks to the new AtomWorks data framework.
Ditch the high-fidelity simulator: IRL-VLA uses a lightweight reward world model trained with inverse reinforcement learning to enable efficient and effective closed-loop RL training for autonomous driving.
LLMs can now automate structured reporting from nurse dictations and medical order extraction from doctor-patient consultations, thanks to two new open-source datasets and an agentic pipeline for generating realistic training data.
VLMs can be effectively adapted, even under data and compute constraints, to create a unified evaluator for video world models that rivals task-specific models and aligns well with human judgment.
LLMs and VLLMs can team up to generate synthetic image data so good, it beats state-of-the-art methods and boosts performance on rare classes and open-vocabulary object detection.
A foundation model trained on a million hours of geophysical data crushes operational weather forecasts while slashing compute costs.
A 1-bit LLM can match the performance of full-precision models, promising huge gains in efficiency.
ChatGPT-4 slashes data extraction time in scoping reviews by 66%, but don't ditch the human reviewers just yet.
LLMs can generate plain language summaries of scientific research that are as good as human-written ones, but easier to read.
Forget task-specific models: Magma, a single foundation model, now outperforms them in both UI navigation and robotic manipulation by bridging verbal and action abilities.