Mar 12, 2026arXiv:2603.11447

Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

Xinyu Zhang, A. Konno, Toshihiko Yamasaki, Ling Xiao

AI Summary

The paper introduces Group Competitive Learning (GCL) to improve the performance of lightweight Vision Language Models (VLMs) for socially compliant robot navigation. GCL uses a Group Competitive Objective (GCO) to align global semantics with distributional regularization and Asymmetric Group Optimization (AGO) to maximize model performance. Experiments on social navigation benchmarks show that GCL significantly improves VLM performance, enabling a 3B model to outperform an 8B baseline.

Key Contribution

A novel Group Competitive Learning (GCL) strategy allows a 3B vision-language model to not only catch up to, but surpass the performance of an 8B model in socially compliant navigation tasks.

Abstract

Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.

Multimodal Models Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

Related Papers