Search papers, labs, and topics across Lattice.

Amazon's research arm covering ML, NLP, robotics, and cloud AI. Drives Alexa, AWS AI services, and logistics optimization.
53
4
0
Semantic watermarks, embedded via AMR, survive paraphrasing attacks that obliterate token-level watermarks.
Turns out, the best template for documenting architectural decisions depends on whether you value conciseness (Nygard) or structural detail (MADR).
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Directly embedding quantile tokens into input sequences leads to sharper and more accurate distribution predictions, outperforming traditional methods by a substantial margin.
Upcycling MoE models can achieve the same performance as larger fixed-size models while cutting GPU costs by 32%.
Current red-teaming efforts miss the forest for the trees: ARES reveals that safety failures often stem from a systemic breakdown between the LLM *and* the reward model, not just the LLM itself.
Targeted neuro-symbolic integration can reduce content bias in syllogistic reasoning, achieving over 94% accuracy while cutting content effects by 16%.
RAG systems are stuck in a factual echo chamber, ignoring the rich tapestry of opinions that shape real-world understanding.
LLMs can now autonomously translate entire C projects to Rust with near-perfect accuracy, thanks to a novel agentic framework that dynamically navigates dependencies and iteratively verifies translations.
Domain-specific fine-tuning can induce "agentic collapse" in LLMs, but a surprisingly small amount of agentic data from *another* domain can bring those general tool-use skills roaring back.
Forget wrestling with language-specific tooling: ReCodeAgent autonomously translates and validates entire code repositories across diverse languages with a 60% boost in test pass rates.
Speculative decoding's speed boost just got a whole lot bigger: DIVERSED dynamically loosens the verification constraints, letting more good tokens through and accelerating inference.
Prime Video's new anomaly detection system spots real incident-related services missed by traditional load testing, proving that synthetic traffic can't always predict live event behavior.
LLMs editing code are far more reliable and efficient when manipulating ASTs instead of raw text, slashing invalid patches and token costs.
LLMs aren't culture-aware reasoners, but biased translators: they generate stereotyped metaphors and default to Western perspectives even when prompted with specific cultural identities.
Ditch the computational bloat: DeltaWorld slashes parameters by 35x and FLOPs by 2000x while generating more realistic video futures.
LLMs can boost code performance by 25%, but only when working *with* compilers in a carefully orchestrated multi-agent system.
LLMs can automatically generate web vulnerability detection rules with surprisingly high accuracy, but only with careful validation and human oversight to mitigate overconfidence.
Recommending popular items isn't always what users want: SPREE steers sequential models to align with individual users' preferences for popular or niche content, improving recommendations.
LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.
Memory-augmented LLMs get a strategic upgrade: MemMA uses multi-agent reasoning to proactively guide memory construction and repair, leading to significant performance gains.
Agentic LLMs are surprisingly vulnerable: a new framework finds successful attacks in 84% of attempts by escalating prompt injection techniques across multiple stages.
Achieve minute-level navigable video world models by combining the strengths of explicit 3D patch memory with implicit generative modeling.
Achieve near-full light throughput in spectral imaging with a novel oscillating dispersion technique and deep unfolding network, enabling high-fidelity reconstruction even under light-starved conditions.
Achieve 50% bitrate savings in ultra-low-bitrate image compression by cleverly turning image decoding into a next-frame prediction problem using video diffusion priors.
LoRA fine-tuning can significantly boost the voice cloning capabilities of LLM-based TTS systems, but only if the training data is acoustically diverse enough.
LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.
Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.
MC3D models can now generalize to unseen camera configurations thanks to a new framework that explicitly accounts for spatial prior discrepancies.
Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
LLM-based recommender systems can trigger users' personal trauma, phobias, or self-harm history, but a new framework cuts these safety violations by 96.5% while maintaining recommendation quality.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Injecting knowledge graphs into LLMs boosts medical question generation by 8%, suggesting a simple way to patch up LLM knowledge gaps.
Forget Bonferroni: a new sequential testing approach slashes audit times for multi-stream ML systems, especially when anomalies are widespread.
Despite matching or exceeding human expert performance on generating potential diagnoses, current MLLMs struggle to synthesize multimodal clinical evidence for final diagnosis, revealing a critical gap in their clinical reasoning abilities.
Latent reasoning models often take shortcuts to achieve high accuracy, and stronger supervision, while mitigating this, paradoxically restricts the diversity of their latent representations.
Stop training your M3OD models on the same old entangled data: this method decomposes and recomposes objects, scenes, and camera poses to generate diverse training examples on the fly, boosting performance without needing more real-world data.
Soft pseudo-labels, theoretically equivalent to hard labels when perfectly calibrated, tank performance in cross-domain semantic segmentation, motivating a new calibration framework.
Forget fine-tuning: inject targeted time-series insights into general LLMs and watch their reasoning skills skyrocket by up to 26%.
Static benchmarks can be fooled by fluent text and aligned citations, but DREAM leverages agentic evaluation to expose the critical capability mismatch in assessing temporal validity and factual correctness of research agents.
LLMs may ace the test, but their uncertainty estimates are far from perfect, raising serious concerns about their reliability in high-stakes educational assessments.
An end-to-end system extracts funny scenes from movies with 87% accuracy, opening new avenues for automated content repurposing.
Stop hand-rolling your multi-task learning to rank models: DeepMTL2R provides a ready-to-use framework with 21 SOTA algorithms and Pareto-optimal optimization.
Give new e-commerce products a warm start by borrowing behavioral signals from their substitutes, boosting search relevance and product discovery.
Object hallucination in MLLMs can be significantly reduced by simply masking salient visual features during contrastive decoding.
MLLMs can now reason about road traffic accidents by fusing remote sensing imagery and structured data, unlocking interpretable insights previously inaccessible to traditional methods.
Pinpointing the root causes of supply chain anomalies just got easier: a Shapley value-based attribution mechanism rapidly decomposes simulation outputs into individual input effects.
Open-source LLMs can now autonomously optimize AI accelerator kernels, matching the performance of proprietary models at a fraction of the cost.
AI-generated feedback on student portfolios from GPT-4o and Claude-Sonnet-4 shows promise for high-stakes clinical assessments, but careful evaluation is needed to ensure accuracy and educational value.
LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.
Achieve up to 39.6% FLOP reduction in LLM inference without retraining or architectural changes using QuickSilver's dynamic token-level optimizations.
By focusing on the most challenging examples, CRPO significantly boosts machine translation accuracy and data efficiency compared to standard preference optimization techniques.