Scaling Laws & Emergent Abilities
CapabilitiesPower-law relationships in model scaling, emergent capabilities at scale, and compute-optimal training.
Keywords
Recent Papers
This paper investigates in-context learning in LLMs by framing it as Gaussian Process (GP) regression, using controlled experiments with function samples drawn from known GP priors. They compare LLM prediction error against empirical GP-regression (lower bound) and 1-NN (upper bound) baselines, finding that LLM learning curves approach the GP lower bound with increasing demonstrations. The authors also analyze LLM inductive biases via likelihood analysis, revealing a preference for less smooth GP kernels, and demonstrate that post-training can shift these biases to improve sample efficiency on smoother kernels.
Quantifies the extent to which LLMs behave like GP learners and provides methods for steering their inductive biases for continuous function learning tasks.
This paper investigates the impact of model and data scaling on multilingual machine translation (MT) performance using open large language models (LLMs). The authors adapt Gemma3 models via continual pretraining and instruction finetuning, creating MiLMMT-46, a model covering 46 languages. Results demonstrate that MiLMMT-46 surpasses existing open-source SOTA models and rivals proprietary systems like Google Translate and Gemini 3 Pro in multilingual translation quality.
Demonstrates that scaling model size and training data via continual pretraining and instruction finetuning significantly improves the multilingual translation capabilities of open LLMs, achieving performance competitive with proprietary systems.
This paper addresses the limited generalization of diffusion-based policies in semantic manipulation by introducing bounding-box instructions to guide the policy's attention to target objects. They developed Label-UMI, a handheld segmentation device with an automated annotation pipeline, to efficiently collect demonstration data with semantic labels. Through real-world experiments, the authors demonstrated improved generalization and adaptability using a semantic-motion-decoupled framework and revealed a power-law relationship between generalization performance and the number of bounding-box objects, achieving 85% success rates across various tasks.
Demonstrates that bounding-box guided diffusion policies, trained on large-scale datasets collected with a novel handheld segmentation device, significantly improve generalization in semantic manipulation tasks and exhibit a power-law scaling relationship.
The paper introduces Recursive Self-Aggregation (RSA), a novel test-time scaling method for LLMs that iteratively refines a population of reasoning chains by aggregating subsets of solutions. RSA leverages information from intermediate reasoning steps to bootstrap from partially correct chains of thought, combining parallel and sequential scaling benefits. Empirical results demonstrate that RSA significantly improves performance across various tasks and models, enabling smaller models like Qwen3-4B to compete with larger reasoning models.
Introduces Recursive Self-Aggregation (RSA), a novel inference-time scaling method that recursively aggregates and refines reasoning chains to improve LLM performance.
The paper investigates the data requirements for reasoning in sub-billion parameter language models, challenging the assumption that massive datasets (>10T tokens) are necessary. They demonstrate that by carefully curating and resampling open-source datasets to ~2T tokens, strong reasoning abilities can emerge with significantly less data. The resulting MobileLLM-R1 models achieve state-of-the-art performance among open-source sub-billion parameter models, even surpassing larger models trained on much larger datasets.
Demonstrates that strong reasoning capabilities can emerge in sub-billion parameter language models with significantly less data than previously believed by carefully curating and resampling open-source datasets.
This paper benchmarks the energy consumption of 14 LLMs (7B-72B parameters) on the MMLU benchmark across five subjects, measuring CO2 emissions using the Perun framework on NVIDIA A100 GPUs. The study finds a strong positive correlation between model size, reasoning capabilities, token generation, and CO2 emissions, with larger, reasoning-enabled models achieving higher accuracy (up to 84.9%) at the cost of significantly increased energy usage. Subject-level analysis reveals that symbolic domains like Abstract Algebra are particularly computationally expensive and yield lower accuracy, underscoring the need for more efficient reasoning strategies.
Quantifies the relationship between LLM size, reasoning performance, token generation, and CO2 emissions across a range of models on the MMLU benchmark.
This paper introduces a population-based evolutionary framework for adapting large language models (LLMs) to new tasks, drawing inspiration from natural evolution. The framework evolves a population of LLMs through crossover, mutation, selection, and succession operations, enabling rapid adaptation with limited data (200 samples per task) and without gradient-based optimization. Experiments across 12 datasets demonstrate that the evolutionary approach outperforms existing LLM merging and adaptation techniques, achieving accuracy improvements of up to 54.8% compared to the initial LLM population.
Introduces a novel population-based evolutionary framework for adapting LLMs to new tasks, demonstrating its effectiveness in low-data regimes and its ability to generalize to unseen tasks.
This paper addresses the scalability limitations of the Muon optimizer for large language model (LLM) training by introducing weight decay and carefully adjusting the per-parameter update scale. The authors demonstrate that these techniques enable Muon to achieve approximately 2x computational efficiency compared to AdamW in compute-optimal training scenarios. They further validate the improved Muon optimizer by training Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model, achieving state-of-the-art performance with significantly fewer training FLOPs and releasing the distributed implementation and model checkpoints.
Demonstrates the scalability of the Muon optimizer to large language models by incorporating weight decay and per-parameter update scale adjustments, achieving superior computational efficiency compared to AdamW.
The paper details the training process of LLM360 K2-65B, a 65 billion-parameter language model, emphasizing a 360-degree open-source approach to provide full transparency and access to training resources. K2 DIAMOND, the first model in the K2 project, achieves performance surpassing LLaMA-65B and rivaling LLaMA2-70B with fewer FLOPs and tokens. The work contributes a longitudinal analysis of K2 DIAMOND's capabilities throughout training and outlines future models in the TXT360 series.
Presents a fully transparent, end-to-end account of training a 65B parameter LLM, including implementation details and longitudinal performance analysis, to address the lack of transparency in training large-scale models.

