Search papers, labs, and topics across Lattice.
This paper presents an empirical study of 705 real-world failures reported by users of open-source LLMs (DeepSeek, Llama, and Qwen) to understand reliability challenges in user-managed deployments. The analysis reveals that reliability bottlenecks shift from model defects to the deployment stack's systemic fragility in white-box orchestration. Key findings include diagnostic divergence between runtime crashes and incorrect functionality, systemic homogeneity of root causes across different model series, and lifecycle escalation of barriers from fine-tuning to inference.
Forget algorithmic flaws: the real reliability bottleneck for open-source LLMs lies in the fragile deployment stack, not the model architecture itself.
The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape.