Mar 9, 2026arXiv:2603.08083

High-Fidelity Pruning for Large Language Models

AI Summary

This paper introduces a novel pruning criterion for large language models (LLMs) based on the information entropy of the model's output distribution, aiming to improve the fidelity of Taylor-based pruning. Unlike traditional Taylor pruning that relies on one-hot cross-entropy loss, the proposed method evaluates neuron importance by considering the entire output distribution, thereby capturing a more global perspective. Experiments on LLaMA and Qwen models demonstrate that this entropy-based pruning consistently outperforms existing methods on zero-shot benchmarks.

Key Contribution

LLMs can be pruned more effectively by considering the information entropy of their output distribution, surpassing the limitations of traditional cross-entropy-based Taylor pruning.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

High-Fidelity Pruning for Large Language Models

Related Papers