Search papers, labs, and topics across Lattice.
This paper introduces YouZhi-LLM, an efficient financial large language model designed to overcome high-concurrency deployment challenges caused by KV cache memory overhead. By implementing a layer-adaptive GQA-to-MLA transition framework, YouZhi-LLM optimizes KV-cache compression while maintaining low perplexity, achieving up to a 35% reduction in perplexity degradation compared to uniform baselines. Evaluations show significant improvements in financial benchmark scores and concurrency, with YouZhi-7B and YouZhi-14B achieving 12.3% and 7.0% accuracy gains, respectively, alongside substantial increases in maximum concurrency.
YouZhi-LLM achieves unprecedented concurrency and accuracy in financial LLMs by dramatically reducing KV-cache overhead, setting a new standard for deployment efficiency.
Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.