Search papers, labs, and topics across Lattice.
Power-law relationships in model scaling, emergent capabilities at scale, and compute-optimal training.
#23 of 24
5
Quantum chemistry's density matrix approach reveals interpretable early warning signals of phase transitions in deep learning, from grokking to emergent misalignment.
LLMs spontaneously organize into brain-like functional units where the whole is greater than the sum of its parts, and destroying these synergistic cores cripples reasoning.
Training language models on individual children's language reveals that distributional and interactional linguistic features, not just dataset size, are key to efficient learning, mirroring factors that drive child language acquisition.
Forget full automation – the sweet spot for AI deployment is often partial automation, where humans and AI collaborate to minimize costs.
Forget painstaking hyperparameter tuning: this hypersphere parameterization lets you transfer a single learning rate across model sizes, depths, and even MoE architectures, slashing compute costs by 1.58x.
Scaling laws work so well because they capture the essence of computation, not the specifics of implementation, leading to a persistent efficiency arms race.
Scientific reasoning gains from prompt engineering are often mirages, driven by model-specific hacks that don't generalize.
LLMs exhibit polarity illusions without rational inference, suggesting that "good enough" processing and partial grammaticalization may suffice to explain these phenomena in both machines and humans.
Two heads are better than one: combining verbalized confidence and self-consistency with just two samples dramatically boosts uncertainty estimation in reasoning models, beating either signal alone even with much larger sampling budgets.
Forget rephrasing: stitching synthetic text into "megadocs" unlocks surprisingly better pre-training, especially for long-context tasks, and keeps improving as you scale.
Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.
Optimizing multilingual training? Shapley values reveal the hidden cross-lingual transfer effects that current scaling laws miss, leading to better language mixture ratios.
Forget quadratic attention: FEAT achieves state-of-the-art performance on structured data with linear complexity and 40x faster inference.
Masked diffusion language models can now achieve 21.8x better compute efficiency than autoregressive models, thanks to binary encoding and index shuffling.
Mamba-3 delivers a 1.8 point accuracy boost over competing models in downstream language tasks, proving that SSM-inspired techniques can unlock substantial performance gains without sacrificing inference efficiency.
LLMs' true power lies in the "unexplainable" – capabilities that exceed rule-based systems, challenging the pursuit of full interpretability.
Forget trial-and-error: this paper derives hyperparameter scaling laws for modern optimizers directly from convergence bounds, potentially automating and optimizing the hyperparameter tuning process.
Forget scaling laws: smaller, domain-adapted AI systems can mathematically outperform massive generalist models in real-world institutional settings, thanks to a non-monotonic relationship between model size and "institutional fitness."
Forget simple scaling laws: the compute-optimal number of parallel rollouts in LLM RL plateaus, revealing distinct mechanisms for easy vs. hard problems.
Re-training LLMs on their own generated content can fundamentally limit what they can learn, but only under specific, theoretically-defined conditions related to generation quality.
Forget brute-force scaling: the secret to better educational AI agents lies in carefully structuring their roles, skills, and tools.
Nanofilaments can paradoxically aggregate due to entropic forces, defying the conventional wisdom that entropy always favors disaggregation at the nanoscale.
Language models seem to prefer truth not because they're seeking it, but because correct information is often easier to compress and more internally consistent.
RAG with small language models (<8B parameters) can be a net negative, as they often ignore retrieved context and even "forget" existing knowledge.
Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.
AI electricity demand won't necessarily explode as AI scales – whether it does or doesn't hinges on sustained efficiency improvements outpacing income-driven demand.
Row-normalized optimizers can match Muon's performance on large language models while being faster in large-token and low-loss regimes, offering a practical alternative for pre-training.
Forget parameter counts – the true memorization capacity of deep ReLU networks is fundamentally bounded by the product of squared width and squared depth, $W^2L^2$, scaling linearly with data size.
Language models often disregard provided context, choosing instead to rely on potentially outdated or conflicting information learned during pre-training, revealing a critical flaw in their knowledge integration.
Chasing marginal MSE/MAE improvements on leaderboards may be blinding researchers to the real goal of time series forecasting: capturing temporal structure and supporting downstream decisions.
Forget elegant compression and unifying principles: AGI might just be a vast, brittle archipelago of specialized modules, mirroring how human experts actually operate.
FineRMoE achieves 6x higher parameter efficiency, 281x lower prefill latency, and 136x higher decoding throughput compared to strong baselines, demonstrating a significant leap in MoE performance.
Protein language models finally scale predictably: Reverse Distillation unlocks consistent gains by distilling large models into nested, Matryoshka-style embeddings guided by smaller, capacity-constrained models.
Multi-task learning's generalization boost comes from implicit regularization, effectively postponing the dreaded double descent.
You can accurately predict the NDCG of a 1B-parameter reranking model by only training models up to 400M parameters, unlocking massive compute savings.
By strategically warming up residual connections layer-by-layer, ProRes unlocks faster and more stable pretraining for language models.
Overparameterization isn't just a quirk of deep learning; it's provably *necessary* for stable, robust classification, even for discontinuous functions.
Grokking isn't magic: it's all about neural nets learning to exploit the hidden symmetries baked into algorithmic tasks.
In resource-constrained Earth observation, smaller object detection models can outperform larger ones in both efficiency and accuracy, overturning common scaling law assumptions.
SignSGD can outperform SGD in linear regression when noise dominates, thanks to a unique "noise-reshaping" effect that steepens its compute-optimal scaling law.
LLMs don't just learn from novels, they amplify the unique qualities of fictional discourse, suggesting that the genre of training data significantly shapes AI outputs.
A principled framework for General World Models reveals the limitations of current systems and the architectural requirements for future progress.
LLMs can achieve the same accuracy with 16x less data by constraining their hidden-state trajectories to follow geodesics on a semantic manifold.
Weather models defy language model scaling trends: wider architectures and larger datasets yield bigger gains than deeper networks.
Tri-modal masked diffusion models can now be trained from scratch, achieving strong results in text generation, text-to-image, and text-to-speech, thanks to a systematic exploration of the design space and a novel SDE-based batch size reparameterization.
Protein language models, like LLMs, suffer from a "Curse of Depth," where deeper layers contribute surprisingly little to the final prediction, suggesting opportunities for more efficient architectures.
Robust generalization isn't as hard as you think: it only tweaks, rather than revolutionizes, the Lipschitz constant needed for smooth interpolation.
Forget tedious hyperparameter tuning: this spectral approach lets you transfer learning rates across model sizes for optimizers like AdamW and LAMB, making large-scale training far more efficient.
Don't count on unembedding matrix geometry to predict language model performance—it's more a reflection of training hyperparameters than inherent capabilities.
Encoder-decoder architectures can beat decoder-only transformers in novel view synthesis, overturning conventional wisdom with a compute-optimal design (SVSM) that slashes training costs.