May 11 – May 18, 2026

Inference & Quantization - Weekly Roundup

4 papers published across 2 labs.

303% acceleration

Selected Labs publishing this week

DAMO1 NUS1

Top Papers

May 16, 2026

DAMO2w ago·also NJU

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Full-attention LLMs are intrinsically sparse and can be transformed into highly efficient sparse models with minimal training, sidestepping the need for expensive sparse pre-training.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

NUS2w ago

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

LLM agents can now maintain long-term memories with 6x higher throughput thanks to a novel hierarchical temporal indexing approach that avoids costly full-state rewrites.

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

May 13, 2026

2w ago·also D Pareto candidate set

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Forget static KV cache compression – KVServe dynamically adapts compression strategies to your service context, slashing latency by up to 32.8x in disaggregated LLM serving.

Zedong Liu, Xinyang Ma, Dejun Luo +9

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Beomjin Ahn +32w ago

LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters

Stop IP thieves cold: LoREnc lets you lock down your foundation models and LoRA adapters without retraining, crushing model recovery attacks while keeping performance intact for authorized users.

Beomjin Ahn, Jungmin Kwon, Chanyong Jung +1

Inference & Quantization Open-Source Models & Weights Training Efficiency & Optimization

Search

Inference & Quantization - Weekly Roundup

Selected Labs publishing this week

Top Papers