Search papers, labs, and topics across Lattice.
This paper addresses the comparability crisis in Wearable Human Activity Recognition (WHAR) by introducing a large-scale, open-source benchmark that consolidates 30 diverse datasets with standardized processing and evaluation protocols. By evaluating 17 architectures across 4760 training runs, the authors find that while CNN-HAR achieves the highest mean macro-F1 score, the top-performing models are closely clustered, indicating a performance ceiling has been reached. The study highlights that compact models like TinierHAR and classical Random Forests are more efficient for deployment, suggesting that future advancements should focus on optimizing deployment efficiency and adapting to domain shifts rather than solely improving predictive performance.
The WHAR state of the art reveals a surprising distribution of performance across architectures, with compact models outperforming larger ones in deployment efficiency.
Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.