CASYNUApr 14, 2026arXiv:2604.12356

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Dongjian Yu, Weiqing Min, Qian Jiang, Xing Lin, Xin Jin, Shuqiang Jiang

AI Summary

The paper introduces OmniFood8K, a new dataset of 8,036 food samples with detailed nutritional annotations and multi-view images, addressing the lack of Chinese cuisine data for nutrition estimation. To improve prediction accuracy from single RGB images, they propose a method that predicts and refines a depth map, then uses a Frequency-Aligned Fusion Module (FAFM) to hierarchically fuse RGB and depth features in the frequency domain. Experiments show their method outperforms existing approaches on multiple datasets.

Key Contribution

You can now estimate the nutrition of your favorite Chinese dishes from a single photo, thanks to a new dataset and frequency-domain fusion technique that beats existing methods.

Abstract

Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Related Papers