Depth in neural networks isn't just about the final output; this work shows how each intermediate layer can be a progressively refined approximation, with error explicitly tied to the layer's geometric scale.

Shijun Zhang, Zuowei Shen, Yuesheng Xu

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities

Apr 22, 2026·also UCSD

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Despite architectural differences, language models exhibit convergent evolution by learning similar periodic features for number representation, but achieving geometric separability depends on subtle training factors.

Deqing Fu, Tianyi Zhou, Mikhail Belkin +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing+1

project leads and equal contributionsApr 22, 2026·also core contributors, leadership sponsors, project advisors

Image Generators are Generalist Vision Learners

Image generators aren't just for making pretty pictures; they're secretly state-of-the-art vision learners, rivaling specialized models in tasks from segmentation to depth estimation.

Valentin Gabeur, Shangbang Long, Songyou Peng +25

Computer Vision Multimodal Models Scaling Laws & Emergent Abilities

All Papers (24)

Apr 22, 2026

Samuel SalfatiApr 22, 2026

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Forget pruning by variance: high-variance activations in transformers are surprisingly uncorrelated with predictive power.

Samuel Salfati

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scaling Laws & Emergent Abilities

Shijun Zhang +2Apr 22, 2026

Geometric Layer-wise Approximation Rates for Deep Networks

Shijun Zhang, Zuowei Shen, Yuesheng Xu

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities

Apr 22, 2026·also UCSD

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Deqing Fu, Tianyi Zhou, Mikhail Belkin +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing+1

project leads and equal contributionsApr 22, 2026·also core contributors, leadership sponsors, project advisors

Image Generators are Generalist Vision Learners

Image generators aren't just for making pretty pictures; they're secretly state-of-the-art vision learners, rivaling specialized models in tasks from segmentation to depth estimation.

Valentin Gabeur, Shangbang Long, Songyou Peng +25

Computer Vision Multimodal Models Scaling Laws & Emergent Abilities

Kristian Schwethelm +2Apr 22, 2026

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

Looping a language model block four times only gives you the effective capacity of 1.4 additional unique blocks, but costs as much to train as 2.4.

Kristian Schwethelm, D. Rueckert, G. Kaissis

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Apr 21, 2026

Weijie Zhao +5Apr 21, 2026

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Forget training from scratch: Nexusformer lets you scale Transformers by nonlinearly expanding attention, inheriting knowledge and slashing compute by up to 41.5%.

Weijie Zhao, Mingquan Liu, Bolun Wang +3

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Proximus Luxembourg S.AApr 21, 2026·also Luxembourg

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Forget scaling laws: strategically equipping small language models with tools delivers a better performance/cost tradeoff than simply scaling up or deploying multi-agent systems.

Xinlin Wang, Mats Brorsson

Inference & Quantization Scaling Laws & Emergent Abilities Tool Use & Agents

Amazon ScienceApr 21, 2026·also ASU

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Upcycling MoE models can achieve the same performance as larger fixed-size models while cutting GPU costs by 32%.

Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scaling Laws & Emergent Abilities+1

Apr 20, 2026

Apr 20, 2026·also CMU ML, University of California, UT Austin

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

LLMs waste compute on tokens that have already "figured it out" – DASH selectively skips these tokens during prefill, speeding things up without retraining or sacrificing accuracy.

Yujie Chen, Tailai Chen, Yifeng Gao +4

Inference & Quantization Scaling Laws & Emergent Abilities

Apr 20, 2026

The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

Unveiling the "topological dual of a dataset" provides a Rosetta Stone for neuro-symbolic AI, promising to unlock mechanistic interpretability and overcome scaling bottlenecks.

Anthony Bordg

Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Apr 20, 2026·also UW-Madison

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Training on mixed complexity datasets can yield up to 5x sample efficiency in low data regimes, challenging conventional wisdom about data quantity in LLM fine-tuning.

Justin Bauer, Thomas Walshe, Derek Pham +3

Data Curation & Synthetic Data Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Mingxue +1Apr 20, 2026

Predicting LLM Compression Degradation from Spectral Statistics

Forget expensive compression trials – a simple spectral statistic can accurately predict how much your LLM will degrade *before* you even compress it.

Mingxue, Xu

Inference & Quantization Scaling Laws & Emergent Abilities

NASK National Research Institute WarsawApr 20, 2026

Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference

Fine-tuned small language models can reliably generalize to larger and structurally distinct graphs, maintaining strong performance in graph property estimation.

Michal Podstawski

Natural Language Processing Scaling Laws & Emergent Abilities

Adelaide UniversityApr 20, 2026·also PKU

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

TriMix reveals that prioritizing small, specialized models can dramatically improve low-resource language adaptation, overturning the assumption that bigger models always lead the way.

Chen Zhang, Jiuheng Lin, Zhiyuan Liao +1

Natural Language Processing Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Apr 20, 2026

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

LLMs' surprising grammatical struggles aren't due to inherent limitations, but rather a lack of exposure to specific linguistic structures in their training data – a problem fixable with just a tiny amount of targeted data augmentation.

H S V N S Kowndinya Renduchintala, Sumit Bhatia

Data Curation & Synthetic Data Natural Language Processing Scaling Laws & Emergent Abilities

Jiaqi Song +11Apr 20, 2026·also UC Santa Cruz

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

LLM-based ASR can be shrunk to 2.3B parameters and still beat larger models in real-world scenarios by carefully delineating encoder and LLM roles and using a multi-stage training approach.

Jiaqi Song, Guang Qiu, Guanghui Qiu +9

Inference & Quantization Natural Language Processing Scaling Laws & Emergent Abilities+1

Youzhi Huang +3Apr 20, 2026

DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

Decomposing LLMs doesn't have to mean sacrificing inference speed: DeInfer unlocks efficient parallel inference for these models.

Youzhi Huang, You-Liang Huang, Xinhao Huang +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

Jin Chen +20Apr 20, 2026·also Tencent AI

RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems

RankUp tackles representation collapse in deep recommender systems, unlocking significant GMV gains in real-world deployments by strategically boosting the effective rank of token representations.

Jin Chen, Shangyu Zhang, Bin Hu +18

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Scaling Laws & Emergent Abilities

ETHApr 20, 2026

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

GSQ closes the accuracy gap in low-precision quantization, achieving results comparable to complex vector methods while remaining easy to implement.

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan +2

Inference & Quantization Scaling Laws & Emergent Abilities

Ying-Chi Shen +2Apr 20, 2026·also SJTU

River-LLM: Large Language Model Seamless Exit Based on KV Share

LLMs can achieve up to 2x inference speedup without retraining by intelligently sharing KV cache states during early exit, sidestepping the usual performance bottlenecks.

Ying-Chi Shen, Yingtao Shen, An Zou

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Scaling Laws & Emergent Abilities

Liubomyr HorbatkoApr 20, 2026

Sessa: Selective State Space Attention

By embedding attention within a recurrent state, Sessa unlocks power-law memory decay and selective retrieval capabilities previously unattainable by either Transformers or Mamba-style models alone.

Liubomyr Horbatko

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities

Apr 19, 2026

Generative AI Technologies, Techniques & Tensions: A Primer

Generative AI's "black box" nature isn't a bug, it's a feature stemming from a fundamental mismatch between user expectations and the technology's statistical foundations.

John T. Behrens

Constitutional AI & AI Ethics Natural Language Processing Scaling Laws & Emergent Abilities

Apr 19, 2026

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

LLM agent systems can achieve up to 76% speedups and significantly reduced hotspot miss rates by intelligently caching logits and scheduling compute resources based on agent behavior.

Zizhang Luo, Yuhao Luo, Youwei Xiao +3

Scaling Laws & Emergent Abilities Tool Use & Agents

Apr 19, 2026·also Carleton, Newcastle University, UESTC, UMacau +2

Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

LLM scaling bottlenecks demand a shift towards cloud-native architectures and distributed systems, unlocking potential gains from serverless inference and quantum computing.

Minxian Xu, Jingfeng Wu, Shengye Song +13

Distributed Systems & Hardware Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Search

Scaling Laws & Emergent Abilities - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (24)