Apr 14, 2026arXiv:2604.12213

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

AI Summary

The paper introduces MMA2A, an extension to the Agent-to-Agent (A2A) protocol that enables modality-native routing of information (voice, image, text) between agents based on declared agent capabilities. They demonstrate a 20 percentage point improvement in task accuracy on the CrossModal-CS benchmark compared to text-bottleneck baselines, but only when the receiving agent is capable of exploiting the richer multimodal context. This highlights the necessity of both protocol-level routing and agent-level reasoning for effective multimodal collaboration in multi-agent systems.

Key Contribution

Routing multimodal information in its native form between agents boosts task accuracy by 20%, but only if the receiving agent can actually understand it.

Abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $\Delta$TCA: [8, 32] pp; McNemar's exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References15

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Related Papers