Tsinghua AIBJTUBUPTFudanHUSTHygon Information Technology Co.PKUResearch InstituteUnicom Data IntelligenceZhejiang LabJun 18, 2026arXiv:2606.20295

Token-Operations-Oriented Inference Optimization Techniques for Large Models

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI Summary

This paper introduces a novel four-layer technical architecture for optimizing inference in large models, focusing on token-oriented strategies that enhance operational efficiency. By integrating Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion, the authors systematically review current technologies and their applicability in real-world scenarios. The key finding indicates that these optimizations can significantly reduce token production costs while improving service efficiency and stability, paving the way for more operable large model services.

Key Contribution

Token-oriented inference optimizations can cut production costs and boost efficiency, transforming large model services from merely callable to fully operable.

Abstract

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

Inference & Quantization Multimodal Models Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Token-Operations-Oriented Inference Optimization Techniques for Large Models

Related Papers