Tsinghua AINJUWestlakeApr 23, 2026arXiv:2604.21575

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

Zeyu Cai, Yuliang Xiu, Renke Wang, Zhijing Shao, Xiaoben Li, Siyuan Yu, Chao Xu, Yang Liu, Baigui Sun, Jian Yang, Zhenyu Zhang

AI Summary

OmniFit is introduced as a novel method for fitting a 3D body model to clothed human assets from multi-modal inputs (point clouds, depth, images) without requiring a known metric scale. It uses a conditional transformer decoder to map surface points to dense body landmarks, which are then used for SMPL-X parameter fitting, along with an optional image adapter and scale predictor. OmniFit achieves state-of-the-art performance, surpassing multi-view optimization baselines and achieving millimeter-level accuracy on standard benchmarks.

Key Contribution

Achieve millimeter-level accuracy in 3D human body fitting from multi-modal inputs, even with scale distortion common in AI-generated assets.

Abstract

Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References69

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

Related Papers