UWChina University of Mining and Technology-BeijingApr 6, 2026arXiv:2604.04929

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

SiXun Dong, Juhua Hu, Steven Li, Wei Wen

AI Summary

This paper analyzes the latency bottlenecks in Vision-Language Models (VLMs), finding that the number of output tokens significantly impacts end-to-end latency. They demonstrate that larger models with fewer output tokens can outperform smaller models with longer output sequences. To exploit this, they propose a multi-agent inference framework that leverages large models for core tasks while selectively incorporating reasoning tokens from smaller models to improve performance.

Key Contribution

Forget scaling laws: a large VLM strategically paired with a smaller model's reasoning tokens can rival the performance of a much larger, monolithic model.

Abstract

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Related Papers