Search papers, labs, and topics across Lattice.
This study introduces the Understanding-Enhanced Model Collaboration Method (UE-MCM) for detecting user mistakes in egocentric video data by leveraging both coarse-grained and fine-grained action analysis. The approach employs a dual-branch architecture, where a small model identifies inconsistencies in actions relative to the overall workflow, while a large model focuses on the accuracy of the actions themselves. By optimizing classifiers with complementary objectives and utilizing a lightweight collaboration gate, the system achieves a balance of speed and accuracy, effectively addressing the challenges posed by long-tailed distributions of mistakes.
Subtle and rare mistakes in egocentric videos can be detected with unprecedented accuracy by fusing insights from both coarse and fine action analyses.
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.