Apple MLCUHKSYSUJan 20, 2026arXiv:2601.13655

Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs

Guangba Yu, Zirui Wang, Yujie Huang, Renyi Zhong, Yuedong Zhong, Yilun Wang, Michael R. Lyu

AI Summary

This paper presents an empirical study of 705 real-world failures reported by users of open-source LLMs (DeepSeek, Llama, and Qwen) to understand reliability challenges in user-managed deployments. The analysis reveals that reliability bottlenecks shift from model defects to the deployment stack's systemic fragility in white-box orchestration. Key findings include diagnostic divergence between runtime crashes and incorrect functionality, systemic homogeneity of root causes across different model series, and lifecycle escalation of barriers from fine-tuning to inference.

Key Contribution

Forget algorithmic flaws: the real reliability bottleneck for open-source LLMs lies in the fragile deployment stack, not the model architecture itself.

Abstract

The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape.

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs

Related Papers