Search papers, labs, and topics across Lattice.
We track OpenAI, DeepMind, Anthropic, and 17 other labs daily - with AI-powered summaries, trend charts, and a weekly digest.
We read everything so you don't have to. One email, zero noise.
Reward hacking is rampant in agent benchmarks, but a novel hacker-fixer loop can eliminate exploits and ensure robust verifier performance.
VLM agents exhibit vastly different skill evolution patterns, revealing that initial performance scores can be misleading without considering improvement dynamics.
Latent spatial memory can accelerate video generation by over 10 times while dramatically reducing memory usage, revolutionizing how we model dynamic scenes.
Despite the advancements in multimodal agents, even the best models struggle with interactive spatial reasoning, achieving only a 17.4% success rate in complex real-world tasks.
FlashMemory-DeepSeek-V4 slashes GPU memory usage by over 90% for ultra-long contexts while enhancing model accuracy.
Raw context outperforms compact memory designs, revealing that memory structure is crucial for effective video generation in action-conditioned models.
Synthetic data can bootstrap NMT models for low-resource languages, but without authentic inputs, they risk overfitting to rigid structures and losing semantic depth.
Personalization is key: agents struggle with multi-app tasks, achieving only 37% accuracy despite an overall score of 52%.
Merging sparse local telemetry with a high-resolution dataset can dramatically enhance forecasting accuracy in volatile cloud-edge environments.
AdvGRPO enables robust attacker-defender co-training that significantly improves defender performance on safety benchmarks while generating effective attacks.
PRISM reveals the hidden instructions guiding LLM behavior, outperforming traditional methods in security-critical contexts.
Evasive steganographic payloads in LLMs can be detected again by strategically recontextualizing the data, even after successful evasion of traditional methods.
We read everything so you don't have to. One email, zero noise.
Delegation intelligence in LLMs can significantly enhance their ability to tackle complex, long-horizon tasks, as demonstrated by SearchSwarm's superior performance on challenging benchmarks.
Agents struggle to orchestrate GUI, CLI, and code operations, with top models only achieving a 41.2% success rate on real-world tasks.
Contextual grounding in defect classification can elevate accuracy to over 98%, transforming a traditionally ambiguous task into a precise science.
Rethinking data work through a reparative justice lens reveals that accountability, not just algorithms, should be at the heart of AI safety efforts.
Achieving a 20.24 percentage-point improvement in task success rates while slashing inference costs by 92% could revolutionize automated documentation verification in cloud environments.
SIFT cuts retrieval costs by 24,000x while boosting response speed by 1.71x, all without sacrificing accuracy.
EgoPressureDiff not only outperforms traditional methods but also effectively resolves visual-physical ambiguities in grasp pressure estimation for complex 3D interactions.
Routine encounter metadata can trigger high rates of verbatim memorization and sensitive diagnosis recovery in medical LMs, raising serious privacy concerns.
No single TTS model excels across all low-resource languages, revealing a critical gap in synthetic speech quality that demands targeted solutions.
Reliable civil court judgments can now be simulated with a framework that adapts to the complexities of legal claims and remedies.
Multiplex semantic networks reveal that creativity is not a one-dimensional trait but a complex interplay of diverse cognitive tasks, with significant implications for how we assess and understand creative potential.
A novel unified energy framework that corrects distribution shifts in diffusion models, outperforming traditional auto-regressive methods.
We read everything so you don't have to. One email, zero noise.
INFUSER outperforms a frozen 32B model with just an 8B co-evolving generator, showcasing the power of adaptive question generation in self-evolution.
Text world models can transform LLM-based agents from reactive responders to proactive planners, fundamentally changing how they interact with complex environments.
Language confusion in LLMs can be effectively reduced without fine-tuning, enhancing multilingual performance while maintaining output quality.
Calibrated safety flags in medical summaries can reduce unflagged omissions by up to 5 times compared to existing methods, enhancing clinician confidence in LLM outputs.
A security-first API development framework can cut security incidents by 30% and post-release vulnerabilities by 40%, transforming how organizations defend against API threats.
HDSL achieves a remarkable reduction in editing token usage by over 5 times while maintaining scene integrity and enhancing generation speed.
Treating raw visual images as action representations, iMac significantly boosts prediction accuracy and task success in robotic manipulation, outperforming traditional action vector methods.
MAVIS redefines video retrieval by enabling agents to collaboratively reason and refine candidate selections, outperforming traditional methods without task-specific tuning.
CP4D achieves photorealistic 4D scene generation by seamlessly integrating static environments with dynamic objects, outperforming existing methods in visual fidelity and physical consistency.
Z-Reward achieves nearly 90% human preference accuracy by transforming subjective visual preferences into nuanced score distributions, outperforming traditional reward models.
Achieving long-range consistency in video generation without excessive computational overhead is now feasible with MilliVid's hierarchical token approach.
An inexpensive time-of-flight camera can achieve reliable stabilization of an inverted pendulum, challenging the assumption that high-resolution sensors are necessary for precise control.
We read everything so you don't have to. One email, zero noise.
VAIC enables humanoid robots to perform complex object interactions in real-world settings without the need for perfect state information, achieving superior performance across diverse tasks.
Achieving high-quality text-to-speech synthesis without intermediate representations, BareWave shows that direct waveform generation can rival traditional methods in intelligibility and naturalness.
Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.
Reusing training data during inference can boost imitation learning performance by up to 46%, transforming how we approach generalization in AI systems.
Muon outperforms Adam and SGD by yielding features that are not only more robust but also transfer more effectively across tasks.
MetaSeq achieves a 45% improvement in response accuracy for acoustic metamaterial design by treating structures as sequences, revolutionizing how we approach inverse design in this field.
LexRubric reveals that even state-of-the-art LLMs struggle with open-ended legal tasks, exposing critical gaps in their contextual understanding and reasoning abilities.
RPA is redefined as a closure approximation to the Hessian of an effective functional, revealing deep connections among major quantum many-body theories.
By integrating dynamic meteorological semantics, LangRetrieval transforms satellite-to-radar retrieval accuracy, adapting in real-time to complex weather patterns.
CapRL++ redefines caption quality through utility, enabling models to produce high-fidelity descriptions without the constraints of traditional supervised fine-tuning.
Prefix failure in on-policy distillation can be effectively mitigated by correcting problematic prefixes, leading to significant improvements in reasoning coverage and accuracy.
Token recovery in continuous language diffusion models hinges on navigating a high-margin basin, revealing hidden failures in traditional evaluation metrics.
We read everything so you don't have to. One email, zero noise.
AI systems currently miss critical temporal and interpretive elements of clinical reasoning, limiting their effectiveness in real-world healthcare settings.
Current clinical AI systems often neglect the temporal dimension of patient care, limiting their effectiveness in longitudinal reasoning.
Distribution shifts in adaptation data can amplify privacy risks in LLMs, challenging the effectiveness of differential privacy guarantees.
Images can serve as a powerful standalone medium for reasoning, achieving nearly double the token efficiency of traditional text methods.
Sparse rewards can be transformed into actionable turn-level feedback, enabling agents to learn from both successful and misleading actions in long-horizon tasks.
SkeMex enables medical agents to evolve their reasoning capabilities by transforming raw experience into structured, reusable skills, outperforming traditional memory systems.
SwiftVR achieves real-time 1080p video restoration on consumer GPUs, a first in the field, while maintaining high perceptual quality and low inference costs.
Transforming uninformative reward signals into actionable insights, Reasoning Arena boosts reasoning performance while slashing training costs.
AHA-WAM achieves a remarkable 4.59x speedup in closed-loop control while maintaining high success rates in complex manipulation tasks, redefining efficiency in robot action execution.
LCLMs redefine the efficiency of long-context inference, achieving superior compression without sacrificing model quality.
TNOs redefine operator learning by leveraging topological structures, achieving superior accuracy in complex PDE problems while maintaining physical fidelity.
Dri-MED adapts to user preferences and context drifts, achieving significantly lower regret than traditional methods in dynamic environments.
PRIME reveals a crucial precursor to reward hacking that can predict and adapt to misalignment before it manifests, offering a new lens on alignment risks in RL systems.
Forward gradient estimators can train quantum neural networks orders of magnitude more efficiently than conventional methods, revolutionizing gradient estimation in PQCs.
We read everything so you don't have to. One email, zero noise.
Holographic reduced representations enable neural networks to achieve superior disentanglement by treating representations as symbolic structures, outperforming traditional methods in robustness and quality.
Invariance in diffusion models peaks at intermediate noise levels, revealing a critical link between representation quality and classification performance.
Learning can emerge directly from a system's physical responses without the need for explicit backpropagation or centralized processing.
Transformers require a surprisingly high number of examples for effective chain-of-thought learning, challenging assumptions about their efficiency.
Typographic tricks can make harmful content invisible to LLMs while remaining easily recognizable to humans, exposing a major flaw in current moderation systems.
Optimizing concept alignment is a multi-objective challenge, and surprisingly, just 0.1% of paired data can yield strong instance-level alignment when done right.
Transition-based modeling reveals a more data-efficient approach to predicting Alzheimer's disease progression, outperforming traditional sequence models in accuracy.
Achieving a rate-optimal queue length regret of $\widetilde{\mathcal{O}}(T^{-1/2})$ reveals that strategic exploration can significantly enhance learning efficiency in contextual queueing environments.
Video LLMs struggle to correct mistakes in real-time cooking tasks, but a new synthetic dataset can dramatically enhance their performance.
Mapping pure differential privacy to Gaussian differential privacy reveals that a conservative $μ$ value can significantly enhance privacy without sacrificing performance.
Code generation requires a unique approach to uncertainty estimation, as a single wrong token can disrupt an entire program's functionality.
FMplex achieves up to 80% lower latency while hosting six times more tasks by virtualizing foundation models for efficient resource sharing.
We read everything so you don't have to. One email, zero noise.
The new REO framework reveals that the true challenge in differential equation discovery lies not just in recovering equations, but in leveraging them to reshape scientific understanding.
Safe-RULE can effectively neutralize the impact of data poisoning in offline Safe RL, enhancing safety without the need for retraining.
The 1-block STGCN variant achieves superior performance with significantly lower computational costs, challenging the assumption that deeper architectures are always better for traffic prediction.
Transitioning from Dirac mass initial conditions, this framework achieves robust approximations of Fokker-Planck solutions, even in the face of numerical instabilities.
Elo rankings can accurately reflect model performance, achieving over 90% correlation with ground-truth accuracy, even amidst stylistic biases.
Soft prompt distillation outperforms traditional safety alignment methods, achieving high safety without the heavy resource burden of dual-model systems.
Finetuning LLMs on narrow safety tasks can induce emergent alignment, revealing significant differences in how well ethical personas project across various alignment strategies.
GraMO achieves unprecedented accuracy in long-horizon predictions by seamlessly integrating spatial and temporal dynamics in a single model.
SAILS reveals the functional forms of feature interactions in machine learning models, transforming how we interpret model behavior beyond mere detection.
Low-KL agreement can trap models in ineffective training regimes, but KAT offers a dynamic solution that boosts accuracy while slashing rollout lengths.
Cross-tokenizer On-Policy Distillation achieves superior efficiency and flexibility, enabling knowledge transfer between diverse model families without the constraints of shared tokenizers.
SG-OPD boosts on-policy distillation performance by leveraging a binary verifier, leading to substantial gains in mathematical reasoning tasks.
We read everything so you don't have to. One email, zero noise.
BSTabDiff achieves superior synthetic data generation in HDLSS contexts by intelligently leveraging block-subunit structures to capture complex dependencies.
Achieving over 4× acceleration in full-model inference while using 10× fewer measurements could revolutionize tensor program optimization.
T-GAN not only decodes complex in-possession phases in football but also reveals that sequence modeling is crucial for enhancing segmentation quality.
A single outlier constraint can derail the learning process in LLMs, but a new reward structure can turn that weakness into a strength, boosting solving rates dramatically.
A single late fusion layer is enough to maintain strong multimodal performance, drastically reducing unnecessary visual computation in deep language models.
FASE achieves a 25% improvement in correlation with ground-truth test cases while slashing computational costs to a fraction of traditional methods.
Claw-R1 transforms agentic RL by treating interaction data as valuable assets, enabling real-time inspection and curation for optimized training.
RAM achieves an 86% F1-score in pose reachability with nanosecond inference, revolutionizing how robots can adapt to diverse morphologies in real-time.
Quantum policies can be trained to be significantly safer and more efficient than classical ones by minimizing reliance on safety filters, revealing the true source of safety in learned controllers.
Lowering energy barriers in neural training could revolutionize how we approach deep learning efficiency, achieving backpropagation-level performance with significantly reduced energy costs.
Anomaly detection can be dramatically improved by leveraging visual prompting and synthetic data, achieving a notable 3.5% boost in performance on the AeBAD dataset.
AI's energy appetite could surge by up to 723 TWh by 2050, threatening Europe's net-zero ambitions unless strategic policy interventions are implemented.
We read everything so you don't have to. One email, zero noise.
Next-token prediction not only excels in sleep stage classification but also generalizes to daytime physiology, outperforming specialized models in critical health tasks.
SRT boosts semantic diversity in AI-generated content by up to 167%, challenging the trend of creative homogenization.