Search papers, labs, and topics across Lattice.
This paper introduces Dense Point-to-Mask Optimization (DPMO), which leverages SAM with a Nearest Neighbor Exclusive Circle constraint to generate instance segmentation masks from point annotations in crowded scenes. They then use these generated masks to train a Reinforced Point Selection (RPS) framework with Group Relative Policy Optimization (GRPO) for predicting instance segmentation. Experiments on four crowd datasets demonstrate state-of-the-art segmentation performance and show that mask annotations significantly improve counting accuracy.
Turns out, you can get SOTA crowd instance segmentation by cleverly combining SAM with point supervision and reinforcement learning to select optimal points for mask generation.
Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.