Aleph AlphaBosch Center for Artificial IntelligenceFreiburgApr 28, 2026arXiv:2604.25530

The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Muhammad Ali, Kevin Alexander Laube, Madan Ravi Ganesh, Lukas Schott, Niclas Popp, Thomas Brox

AI Summary

This paper investigates the effectiveness of canonical knowledge distillation (KD) methods, specifically logit- and feature-based KD, for semantic segmentation compared to recent, more complex segmentation-specific KD techniques. By comparing performance under matched wall-clock compute, the authors demonstrate that canonical KD outperforms recent methods, suggesting that gains from complex objectives may simply reflect greater compute. Furthermore, with extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance, with a smaller student model closely approaching the performance of a larger teacher model.

Key Contribution

Forget fancy distillation losses: simple feature-based knowledge distillation, given enough compute, lets a ResNet-18 student nearly match a ResNet-101 teacher in semantic segmentation.

Abstract

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

Computer Vision Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Related Papers