IPAI Foundation gGmbHKITJun 11, 2026arXiv:2606.13194

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

Maximilian Burzer, Tobias King, Till Riedel, T. Riedel, Michael Beigl, Tobias Roddiger, Tobias Röddiger

AI Summary

This paper addresses the comparability crisis in Wearable Human Activity Recognition (WHAR) by introducing a large-scale, open-source benchmark that consolidates 30 diverse datasets with standardized processing and evaluation protocols. By evaluating 17 architectures across 4760 training runs, the authors find that while CNN-HAR achieves the highest mean macro-F1 score, the top-performing models are closely clustered, indicating a performance ceiling has been reached. The study highlights that compact models like TinierHAR and classical Random Forests are more efficient for deployment, suggesting that future advancements should focus on optimizing deployment efficiency and adapting to domain shifts rather than solely improving predictive performance.

Key Contribution

The WHAR state of the art reveals a surprising distribution of performance across architectures, with compact models outperforming larger ones in deployment efficiency.

Abstract

Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References106

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

Related Papers