May 25, 2026arXiv:2605.25561

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

AI Summary

The paper argues that current semi-supervised learning (SSL) methods for 3D medical image segmentation suffer from overconfidence due to conflating prediction confidence with uncertainty in pseudo-labeling and using the test set for validation. To address this, they propose TCSeg, a tri-space calibrated segmentation framework that decouples confidence from uncertainty to mitigate confirmation bias across feature, probability, and image spaces. Experiments on three benchmark datasets demonstrate TCSeg's strong performance, while the authors advocate for more rigorous evaluation protocols using multiple-run, final-checkpoint results.

Key Contribution

Apparent SOTA gains in semi-supervised 3D medical image segmentation may be illusory, driven by confirmation bias and test-set overfitting, not genuine progress.

Abstract

Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

Related Papers