American International University–BangladeshFeb 24, 2026arXiv:2602.20531

A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

AI Summary

This paper introduces a lightweight vision-language fusion framework for predicting app ratings by jointly leveraging mobile UI layouts and app metadata. The framework uses MobileNetV3 for visual feature extraction and DistilBERT for textual feature extraction, fusing these features with a gated fusion module and predicting ratings with an MLP regression head. Experiments demonstrate that the model achieves strong performance, with an R2 of 0.8529 and a Pearson correlation of 0.9251, highlighting the benefits of multimodal fusion for app rating prediction.

Key Contribution

Stop relying solely on text or UI features for app rating prediction: a lightweight vision-language framework achieves state-of-the-art results by fusing MobileNetV3 visual features with DistilBERT textual features.

Abstract

App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

Related Papers