Apr 28, 2026arXiv:2604.25098

Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

AI Summary

This paper investigates the impact of unstructured pruning on the test-time scaling (TTS) performance of reasoning LLMs, contrasting it with the previously observed degradation caused by structured pruning. Through experiments on s1.1-7B and Qwen3-8B across four reasoning benchmarks, the authors demonstrate that unstructured pruning can enhance TTS performance, even surpassing that of unpruned models. They also analyze the influence of different layer-wise sparsity allocation strategies on unstructured pruning effectiveness.

Key Contribution

Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.

Abstract

While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.

Inference & Quantization Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Related Papers