Mar 31, 2026arXiv:2603.29211

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guang Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao

AI Summary

Xuanwu VL-2B is presented as a case study for developing industrial-grade multimodal models for content ecosystems, using a compact InternViT-300M + MLP + Qwen3 1.7B architecture. A progressive three-stage training pipeline (pre-training, mid-training, and post-training) with a data iteration and curation mechanism was used to balance business specialization with general capability retention. Results show Xuanwu VL-2B outperforms baselines like InternVL 3.5 2B and Gemini-2.5-Pro in multimodal benchmarks, business moderation tasks, and adversarial OCR scenarios, demonstrating a practical balance between performance and deployment cost.

Key Contribution

You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.

Abstract

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Related Papers