MIT CSAILFeb 19, 2026arXiv:2602.17531

Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

AI Summary

This position paper critiques the current evaluation paradigm for ECG representation learning, which overly relies on arrhythmia and waveform-morphology labels from PTB-XL, CPSC2018, and CSN datasets, limiting the assessment of broader clinical information encoded in ECGs. The authors demonstrate that applying proper evaluation practices for multi-label, imbalanced settings changes the ranking of representation performance and reveals that a randomly initialized encoder can match state-of-the-art pre-training on many tasks. They advocate for expanding evaluation to include structural heart disease and patient-level forecasting, alongside improved benchmarking practices.

Key Contribution

Randomly initialized encoders can match state-of-the-art pre-trained models on many ECG representation learning tasks, suggesting current benchmarks are misleading.

Abstract

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Position: Evaluation of ECG Representations Must Be Fixed

Related Papers