Search papers, labs, and topics across Lattice.
Anny-Fit is introduced as a multi-person, camera-space optimization framework designed for all-age 3D human mesh recovery (HMR), addressing limitations of existing methods that assume adult subjects and independent optimization. The framework jointly optimizes all individuals in the camera coordinate system, enforcing global spatial consistency using metric depth maps, instance segmentation, 2D keypoints, and VLM-derived semantic attributes. Results demonstrate improved 2D reprojection accuracy, relative depth ordering, 3D estimation error, and shape estimation across diverse datasets, and the framework enables zero-shot adaptation of adult-trained HMR pipelines to the full age spectrum.
Adult-trained human mesh recovery models can now handle kids, too, thanks to a new framework that enforces spatial consistency and leverages VLM-derived age and gender cues.
Recovering 3D human pose and shape from a single image remains a cornerstone of human-centric vision, yet most methods assume adult subjects and optimize each person independently. These assumptions fail in real-world, all-age scenes, where body proportions and depth must be resolved jointly. We introduce Anny-Fit, a multi-person, camera-space optimization framework for all-age 3D human mesh recovery (HMR). Unlike existing per-person fitting methods, Anny-Fit jointly optimizes all individuals directly in the camera coordinate system, enforcing global spatial consistency. At the core of our approach is the use of multiple forms of expert knowledge -- including metric depth maps, instance segmentation, 2D keypoints, and, VLM-derived semantic attributes such as age and gender -- each obtained from dedicated off-the-shelf networks. These complementary signals jointly guide the optimization, constraining the depth-scale ambiguity characteristic of all-age scenes. Across diverse datasets, Anny-Fit consistently improves 2D reprojection accuracy (+13 to 16), relative depth ordering (+6 to 7), 3D estimation error (-9 to -29) and shape estimation (+25 to +82), producing more coherent scenes. Finally, we show that VLM-based semantic knowledge can be distilled into an HMR model via the pseudo-ground-truth annotations produced by Anny-Fit on training data, enabling it to learn semantically meaningful shape parameters while improving HMR performance. Our approach bridges adult-only and all-age modeling by enabling zero-shot adaptation of adult-trained HMR pipelines to the full age spectrum without retraining. Code is publicly available at https://github.com/naver/anny-fit.