May 11 – May 18, 2026

Architecture Design (Transformers, SSMs, MoE) - Weekly Roundup

2 papers published across 1 lab.

3950% acceleration

Selected Labs publishing this week

DAMO1

Top Papers

May 16, 2026

DAMO1w ago·also NJU

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Full-attention LLMs are intrinsically sparse and can be transformed into highly efficient sparse models with minimal training, sidestepping the need for expensive sparse pre-training.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

May 13, 2026

1w ago·also D Pareto candidate set

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Forget static KV cache compression – KVServe dynamically adapts compression strategies to your service context, slashing latency by up to 32.8x in disaggregated LLM serving.

Zedong Liu, Xinyang Ma, Dejun Luo +9

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Search

Architecture Design (Transformers, SSMs, MoE) - Weekly Roundup

Selected Labs publishing this week

Top Papers