Free University of Bozen-BolzanoMay 5, 2026arXiv:2605.03848

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

AI Summary

This paper presents three parameter-efficient methods for multi-view proficiency estimation on the Ego-Exo4D dataset: SkillFormer, PATS, and ProfVLM. SkillFormer uses a discriminative architecture for multi-view fusion, PATS improves temporal sampling, and ProfVLM reformulates proficiency estimation as conditional language generation to produce both proficiency labels and expert feedback. The proposed methods achieve state-of-the-art accuracy with significantly fewer trainable parameters and training epochs compared to video-transformer baselines, while also enabling interpretable feedback generation.

Key Contribution

Get expert-level feedback on your performance, not just a score, thanks to a new approach that uses language generation for proficiency estimation.

Abstract

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Related Papers