Nov 17, 2025arXiv:2511.12937

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun

AI Summary

The paper introduces Yanyun-3, a Vision-Language Model agent designed for cross-platform strategy game automation, addressing the challenges of diverse user interfaces and dynamic environments. They propose a novel data organization principle called "combination granularity" to improve multimodal data fusion during fine-tuning, using QLoRA on a curated dataset from three strategy game platforms. Results show that Yanyun-3 achieves a 12.98x improvement in BLEU-4 score and a 63% reduction in inference time compared to full fusion, demonstrating successful cross-platform task execution without platform-specific tuning.

Key Contribution

A 12x BLEU score jump and 63% faster inference show that structured multimodal data organization is a surprisingly effective way to boost VLM performance in complex embodied tasks like cross-platform game automation.

Abstract

Cross-platform strategy game automation remains a challenge due to diverse user interfaces and dynamic battlefield environments. Existing Vision--Language Models (VLMs) struggle with generalization across heterogeneous platforms and lack precision in interface understanding and action execution. We introduce Yanyun-3, a VLM-based agent that integrates Qwen2.5-VL for visual reasoning and UI-TARS for interface execution. We propose a novel data organization principle -- combination granularity -- to distinguish intra-sample fusion and inter-sample mixing of multimodal data (static images, multi-image sequences, and videos). The model is fine-tuned using QLoRA on a curated dataset across three strategy game platforms. The optimal strategy (M*V+S) achieves a 12.98x improvement in BLEU-4 score and a 63% reduction in inference time compared to full fusion. Yanyun-3 successfully executes core tasks (e.g., target selection, resource allocation) across platforms without platform-specific tuning. Our findings demonstrate that structured multimodal data organization significantly enhances VLM performance in embodied tasks. Yanyun-3 offers a generalizable framework for GUI automation, with broader implications for robotics and autonomous systems.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Related Papers