Apr 9, 2026arXiv:2604.08140

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Longgang Zhang, Xiaowei Fu, Fuxiang Huang

AI Summary

This paper introduces the Byte-Grounded Traffic Description (BGTD) benchmark, a novel dataset combining raw network traffic bytes with structured expert annotations, to address the lack of rich semantic information in existing network traffic datasets. They propose mmTraffic, an end-to-end traffic-language representation framework that uses a jointly-optimized perception-cognition architecture with a traffic encoder and LLM generator for multimodal reasoning. Experiments show mmTraffic generates high-fidelity, human-readable traffic interpretation reports while maintaining competitive classification accuracy.

Key Contribution

LLMs can now autonomously generate human-readable explanations of encrypted network traffic, bridging the gap between raw byte sequences and semantic understanding.

Abstract

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Related Papers