Search papers, labs, and topics across Lattice.
Power-law relationships in model scaling, emergent capabilities at scale, and compute-optimal training.
#7 of 24
5
Transformers can be explicitly designed to perform nonlinear regression in-context by leveraging attention as a featurizer, offering a theoretical understanding of how these models learn complex relationships from prompts.
Infinite-width approximations, a cornerstone of neural network theory, crumble much faster in recurrent models than previously thought, failing beyond a depth of order $\sqrt{n}$.
Scaling clinical LLMs doesn't guarantee safety: high-risk errors persist even with advanced RAG and max-context prompting, highlighting the critical role of evidence quality and deployment strategy.
Forget separate structure and fidelity models – Khala shows you can generate high-quality music with text-vocal alignment using a single acoustic-token hierarchy.
Achieve perfect train-test error tracking with a new training algorithm, Decoupled Descent, that eliminates the need for validation sets in certain stylized settings.
Machine learning can turn sparse simulation data into a complete phase diagram for collective motion models, revealing nuanced phase boundaries.
Subword tokenization's secret sauce isn't just vocabulary size – it's the boosted training throughput and the subtle linguistic priors baked into subword boundaries.
Language diffusion models aren't just generative, they're associative memories that reveal a sharp memorization-to-generalization transition detectable via conditional entropy.
Forget scaling laws: this study reveals a detailed empirical map of *when* and *why* transformers succeed or fail at in-context learning, highlighting the crucial interplay of dimensionality, signal strength, and contextual information.
Chain-of-Thought reasoning in Transformers hits a surprising expressivity ceiling when generalizing to longer sequences, unless you let your vocabulary grow with the problem size and use "signpost" tokens.
Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.
LLMs from different vendors and sizes secretly speak the same statistical language, enabling a blazing-fast, model-agnostic output verification method.
Probabilistic Transformers can now scale to 0.4B parameters and beat standard Transformers of the same size, thanks to a hyperparameter transfer trick.
Forget training from scratch: HyLo lets you breathe new (long-context) life into your existing Transformer LLMs, achieving 32x context extension and 90% KV-cache reduction.
Forget pruning by variance: high-variance activations in transformers are surprisingly uncorrelated with predictive power.
Depth in neural networks isn't just about the final output; this work shows how each intermediate layer can be a progressively refined approximation, with error explicitly tied to the layer's geometric scale.
Despite architectural differences, language models exhibit convergent evolution by learning similar periodic features for number representation, but achieving geometric separability depends on subtle training factors.
Image generators aren't just for making pretty pictures; they're secretly state-of-the-art vision learners, rivaling specialized models in tasks from segmentation to depth estimation.
Looping a language model block four times only gives you the effective capacity of 1.4 additional unique blocks, but costs as much to train as 2.4.
Forget training from scratch: Nexusformer lets you scale Transformers by nonlinearly expanding attention, inheriting knowledge and slashing compute by up to 41.5%.
Forget scaling laws: strategically equipping small language models with tools delivers a better performance/cost tradeoff than simply scaling up or deploying multi-agent systems.
Upcycling MoE models can achieve the same performance as larger fixed-size models while cutting GPU costs by 32%.
LLMs waste compute on tokens that have already "figured it out" – DASH selectively skips these tokens during prefill, speeding things up without retraining or sacrificing accuracy.
Unveiling the "topological dual of a dataset" provides a Rosetta Stone for neuro-symbolic AI, promising to unlock mechanistic interpretability and overcome scaling bottlenecks.
Training on mixed complexity datasets can yield up to 5x sample efficiency in low data regimes, challenging conventional wisdom about data quantity in LLM fine-tuning.
Forget expensive compression trials – a simple spectral statistic can accurately predict how much your LLM will degrade *before* you even compress it.
Fine-tuned small language models can reliably generalize to larger and structurally distinct graphs, maintaining strong performance in graph property estimation.
TriMix reveals that prioritizing small, specialized models can dramatically improve low-resource language adaptation, overturning the assumption that bigger models always lead the way.
LLMs' surprising grammatical struggles aren't due to inherent limitations, but rather a lack of exposure to specific linguistic structures in their training data – a problem fixable with just a tiny amount of targeted data augmentation.
LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.
Decomposing LLMs doesn't have to mean sacrificing inference speed: DeInfer unlocks efficient parallel inference for these models.
RankUp tackles representation collapse in deep recommender systems, unlocking significant GMV gains in real-world deployments by strategically boosting the effective rank of token representations.
GSQ closes the accuracy gap in low-precision quantization, achieving results comparable to complex vector methods while remaining easy to implement.
LLMs can achieve up to 2x inference speedup without retraining by intelligently sharing KV cache states during early exit, sidestepping the usual performance bottlenecks.
By embedding attention within a recurrent state, Sessa unlocks power-law memory decay and selective retrieval capabilities previously unattainable by either Transformers or Mamba-style models alone.
Generative AI's "black box" nature isn't a bug, it's a feature stemming from a fundamental mismatch between user expectations and the technology's statistical foundations.
LLM agent systems can achieve up to 76% speedups and significantly reduced hotspot miss rates by intelligently caching logits and scheduling compute resources based on agent behavior.
LLM scaling bottlenecks demand a shift towards cloud-native architectures and distributed systems, unlocking potential gains from serverless inference and quantum computing.
On-policy distillation makes language models more accurate, but also dangerously overconfident, revealing a fundamental tension between capability and calibration.
Output diversity in post-trained models collapses due to training data composition, not just post-training methods, challenging assumptions about inference-time fixes.
Forget static datasets - co-evolving LLMs and tasks unlocks a broader range of expert capabilities than hand-crafted training pipelines, all while using less GPU memory.
LLM agent simulations are no longer black boxes: CAMO reveals the hidden causal pathways from individual agent actions to emergent social behaviors.
Stop retraining from scratch: WeiT lets you initialize models of *any* size with SOTA performance, adapting pre-trained knowledge to your specific compute budget.
Agentic coding gets a serious boost: distilling and reusing rollout trajectories lets Claude-4.5-Opus jump from 70.9% to 77.6% on SWE-Bench Verified.
Learning rate decay, a common optimization technique, might be the culprit behind catastrophic forgetting in LLMs during fine-tuning.
LLMs hit a "reasoning collapse" where accuracy plummets by over 50% on classical reasoning tasks once complexity exceeds a surprisingly low threshold, even with deterministic validation.
Looped language models can now rival Transformers in quality at a fraction of the parameter count, thanks to a new architecture that tames their notorious instability.
Scaling up LLMs doesn't uniformly improve context handling; instead, it paradoxically amplifies the tendency to copy irrelevant tokens while simultaneously improving resistance to misinformation.
LLMs can achieve competitive performance simply by optimizing data mixing strategies as a graph-constrained optimization problem.
LLMs learn skills in a surprisingly consistent order during pretraining, revealing a hidden curriculum that's predictable across models and readable from their internal representations.