Apr 27, 2026arXiv:2604.24110

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides, Nikos Chrisochoides

AI Summary

This paper benchmarks the latency and cost of a multi-agent LLM tutoring system (ITAS) built on Gemini 2.5 Flash and Google Vertex AI across different throughput tiers and concurrency levels. They find that Priority PayGo maintains consistent sub-4-second response times, while Standard PayGo degrades under classroom-scale concurrency, and Provisioned Throughput saturates beyond 20 concurrent users. Cost analysis suggests that pay-per-token tiers are economically viable, while Provisioned Throughput can be cost-competitive with traffic prediction and concentration.

Key Contribution

Multi-agent LLM systems can maintain sub-4-second response times even under classroom-scale concurrency, but only with the right throughput tier.

Abstract

Multi-agent LLM tutoring systems improve response quality through agent specialization, but each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face. We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment. Priority PayGo maintains flat sub-4-second response times across the full load range; Standard PayGo degrades substantially under classroom-scale concurrency; and Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users. Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling. Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization. These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Citation Metrics

Citations2

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Related Papers