Beijing QBoson Quantum Technology Co.Hefei Comprehensive National ScienceHFUTHITSydneyTJUJun 8, 2026arXiv:2606.09261

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu

AI Summary

This paper details XInsight Lab's winning solution for the micro-gesture classification track at the 4th MiGA Challenge, achieving a new state-of-the-art accuracy of 74.419%. The approach combines a self-supervised RGB model, pretrained on 120K unlabeled clips through masked video modeling, with existing supervised multi-stream models, demonstrating the power of transferable representations. The ensemble method not only surpasses previous benchmarks but also highlights the significant role of self-supervised learning in enhancing performance in micro-gesture recognition tasks.

Key Contribution

Achieving 74.419% accuracy, this ensemble approach shows that self-supervised learning can dramatically enhance micro-gesture recognition performance.

Abstract

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Related Papers