Mar 9, 2026arXiv:2603.08065

Deterministic Differentiable Structured Pruning for Large Language Models

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen

AI Summary

This paper introduces Deterministic Differentiable Pruning (DDP), a novel mask-only optimization method for structured pruning of LLMs that directly optimizes a deterministic soft surrogate of the discrete l0 objective, avoiding stochastic relaxations. DDP offers greater expressiveness, reduced train-test mismatch, and faster convergence compared to prior stochastic approaches. Experiments on dense and MoE models like Qwen3-32B and Qwen3-30B-A3B demonstrate that DDP achieves minimal performance loss (1%) at 20% sparsity and provides end-to-end inference speedups with vLLM.

Key Contribution

Ditch the stochasticity: Deterministic pruning slashes LLM size with minimal performance loss, outperforming stochastic methods and accelerating inference.

Abstract

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Deterministic Differentiable Structured Pruning for Large Language Models

Related Papers