Search papers, labs, and topics across Lattice.
The paper introduces SimDiff, a new layer importance criterion for depth pruning LLMs that combines representational similarity with transformation difference, using MSSD (sensitive to outliers) and MASD (robust average contribution) metrics. They show that SimDiff overcomes the limitations of similarity-only methods, which can lead to unpredictable performance. Experiments on models from 0.5B to 13B parameters demonstrate that SimDiff outperforms existing methods, retaining over 91% of LLaMA2-7B's performance at 25% pruning and achieving up to 1.49x speedup on LLaMA3.1-8B.
Similarity alone is a poor guide for LLM depth pruning: jointly considering representational similarity *and* transformation difference unlocks significantly better compression.
Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.