Scaling Laws & Emergent Abilities

Capabilities

Power-law relationships in model scaling, emergent capabilities at scale, and compute-optimal training.

Keywords

scaling lawsemergent abilitiescompute optimalchinchillaneural scalingemergencelarge language modelsmodel scaling

Recent Papers

Feb 12, 2026

2d ago

In-Context Function Learning in Large Language Models

This paper investigates in-context learning in LLMs by framing it as Gaussian Process (GP) regression, using controlled experiments with function samples drawn from known GP priors. They compare LLM prediction error against empirical GP-regression (lower bound) and 1-NN (upper bound) baselines, finding that LLM learning curves approach the GP lower bound with increasing demonstrations. The authors also analyze LLM inductive biases via likelihood analysis, revealing a preference for less smooth GP kernels, and demonstrate that post-training can shift these biases to improve sample efficiency on smoother kernels.

Quantifies the extent to which LLMs behave like GP learners and provides methods for steering their inductive biases for continuous function learning tasks.

Elif Akata, Konstantinos Voudouris, Vincent Fortuin +12602.11863

Scaling Laws & Emergent AbilitiesNatural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)

2d ago

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

This paper investigates the impact of model and data scaling on multilingual machine translation (MT) performance using open large language models (LLMs). The authors adapt Gemma3 models via continual pretraining and instruction finetuning, creating MiLMMT-46, a model covering 46 languages. Results demonstrate that MiLMMT-46 surpasses existing open-source SOTA models and rivals proprietary systems like Google Translate and Gemini 3 Pro in multilingual translation quality.

Demonstrates that scaling model size and training data via continual pretraining and instruction finetuning significantly improves the multilingual translation capabilities of open LLMs, achieving performance competitive with proprietary systems.

Wei Liu, Jian Luan2602.11961

Scaling Laws & Emergent AbilitiesNatural Language ProcessingOpen-Source Models & Weights

2d ago

Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

This paper addresses the limited generalization of diffusion-based policies in semantic manipulation by introducing bounding-box instructions to guide the policy's attention to target objects. They developed Label-UMI, a handheld segmentation device with an automated annotation pipeline, to efficiently collect demonstration data with semantic labels. Through real-world experiments, the authors demonstrated improved generalization and adaptability using a semantic-motion-decoupled framework and revealed a power-law relationship between generalization performance and the number of bounding-box objects, achieving 85% success rates across various tasks.

Demonstrates that bounding-box guided diffusion policies, trained on large-scale datasets collected with a novel handheld segmentation device, significantly improve generalization in semantic manipulation tasks and exhibit a power-law scaling relationship.

Yihao Wu, Shoujie Li, Mingliang Zhou +22602.11885

Robotics & Embodied AIScaling Laws & Emergent AbilitiesComputer Vision

Sep 30, 2025

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

The paper introduces Recursive Self-Aggregation (RSA), a novel test-time scaling method for LLMs that iteratively refines a population of reasoning chains by aggregating subsets of solutions. RSA leverages information from intermediate reasoning steps to bootstrap from partially correct chains of thought, combining parallel and sequential scaling benefits. Empirical results demonstrate that RSA significantly improves performance across various tasks and models, enabling smaller models like Qwen3-4B to compete with larger reasoning models.

Introduces Recursive Self-Aggregation (RSA), a novel inference-time scaling method that recursively aggregates and refines reasoning chains to improve LLM performance.

S. Venkatraman, Vineet Jain, Sarthak Mittal +952509.26626

Scaling Laws & Emergent AbilitiesReasoning & Chain-of-ThoughtInference & Quantization

Sep 29, 2025

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

The paper investigates the data requirements for reasoning in sub-billion parameter language models, challenging the assumption that massive datasets (>10T tokens) are necessary. They demonstrate that by carefully curating and resampling open-source datasets to ~2T tokens, strong reasoning abilities can emerge with significantly less data. The resulting MobileLLM-R1 models achieve state-of-the-art performance among open-source sub-billion parameter models, even surpassing larger models trained on much larger datasets.

Demonstrates that strong reasoning capabilities can emerge in sub-billion parameter language models with significantly less data than previously believed by carefully curating and resampling open-source datasets.

Changsheng Zhao, Ernie Chang, Zechun Liu +82509.24945

Reasoning & Chain-of-ThoughtScaling Laws & Emergent AbilitiesOpen-Source Models & Weights

Jun 19, 2025

Munich Center for Digital Sciences and AI (MUC.DAI)Jun 19, 2025

Energy costs of communicating with AI

This paper benchmarks the energy consumption of 14 LLMs (7B-72B parameters) on the MMLU benchmark across five subjects, measuring CO2 emissions using the Perun framework on NVIDIA A100 GPUs. The study finds a strong positive correlation between model size, reasoning capabilities, token generation, and CO2 emissions, with larger, reasoning-enabled models achieving higher accuracy (up to 84.9%) at the cost of significantly increased energy usage. Subject-level analysis reveals that symbolic domains like Abstract Algebra are particularly computationally expensive and yield lower accuracy, underscoring the need for more efficient reasoning strategies.

Quantifies the relationship between LLM size, reasoning performance, token generation, and CO2 emissions across a range of models on the MMLU benchmark.

Maximilian Dauner, Gudrun Socher6

Eval Frameworks & BenchmarksScaling Laws & Emergent AbilitiesDistributed Systems & Hardware

Mar 3, 2025

Nature-Inspired Population-Based Evolution of Large Language Models

This paper introduces a population-based evolutionary framework for adapting large language models (LLMs) to new tasks, drawing inspiration from natural evolution. The framework evolves a population of LLMs through crossover, mutation, selection, and succession operations, enabling rapid adaptation with limited data (200 samples per task) and without gradient-based optimization. Experiments across 12 datasets demonstrate that the evolutionary approach outperforms existing LLM merging and adaptation techniques, achieving accuracy improvements of up to 54.8% compared to the initial LLM population.

Introduces a novel population-based evolutionary framework for adapting LLMs to new tasks, demonstrating its effectiveness in low-data regimes and its ability to generalize to unseen tasks.

Yiqun Zhang, Peng Ye, Xiaocui Yang +572503.01155

Scaling Laws & Emergent AbilitiesTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

Feb 24, 2025

Muon is Scalable for LLM Training

This paper addresses the scalability limitations of the Muon optimizer for large language model (LLM) training by introducing weight decay and carefully adjusting the per-parameter update scale. The authors demonstrate that these techniques enable Muon to achieve approximately 2x computational efficiency compared to AdamW in compute-optimal training scenarios. They further validate the improved Muon optimizer by training Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model, achieving state-of-the-art performance with significantly fewer training FLOPs and releasing the distributed implementation and model checkpoints.

Demonstrates the scalability of the Muon optimizer to large language models by incorporating weight decay and per-parameter update scale adjustments, achieving superior computational efficiency compared to AdamW.

Jingyuan Liu, Jianling Su, Xingcheng Yao +251282502.16982

Training Efficiency & OptimizationScaling Laws & Emergent AbilitiesDistributed Systems & Hardware

Jan 13, 2025

LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch

The paper details the training process of LLM360 K2-65B, a 65 billion-parameter language model, emphasizing a 360-degree open-source approach to provide full transparency and access to training resources. K2 DIAMOND, the first model in the K2 project, achieves performance surpassing LLaMA-65B and rivaling LLaMA2-70B with fewer FLOPs and tokens. The work contributes a longitudinal analysis of K2 DIAMOND's capabilities throughout training and outlines future models in the TXT360 series.

Presents a fully transparent, end-to-end account of training a 65B parameter LLM, including implementation details and longitudinal performance analysis, to address the lack of transparency in training large-scale models.

Zhengzhong Liu, Bowen Tan, Hongyi Wang +2292501.07124

Training Efficiency & OptimizationOpen-Source Models & WeightsScaling Laws & Emergent AbilitiesDistributed Systems & Hardware

Lattice is designed for desktop

Scaling Laws & Emergent Abilities

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Scaling Laws & Emergent Abilities

Keywords

Top Labs in This Topic

Recent Papers

Search