Search papers, labs, and topics across Lattice.
This paper investigates the impact of different validation criteria on the test performance of neural classifiers, focusing on early stopping and post-hoc selection. Through a systematic empirical study using fully connected networks and $k$-fold evaluation on standard benchmarks, the authors compare validation based on accuracy versus various loss functions (cross-entropy, C-Loss, PolyLoss). The key finding is that early stopping with validation accuracy consistently underperforms loss-based validation criteria and post-hoc selection, often selecting checkpoints with lower test accuracy than the test-optimal checkpoint.
Validation accuracy is a surprisingly poor guide for model selection, especially with early stopping, often leading to worse test performance than simply picking the best checkpoint after training.
Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under $k$-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.