Search papers, labs, and topics across Lattice.
This paper introduces a multi-dimensional quality scoring framework for decentralized LLM inference, decomposing quality into dimensions like model priors, structure, semantics, alignment, and agreement. Through systematic auditing of QA and summarization tasks, the authors identify task-dependent and negatively correlated dimensions that degrade performance. By removing unreliable dimensions and recalibrating weights, they achieve a calibrated composite score that outperforms single-evaluator and consensus baselines, and further demonstrate its effectiveness when integrated with Proof of Quality mechanisms under adversarial conditions.
Seemingly intuitive quality metrics for LLM outputs can actually hurt performance in decentralized inference, unless you carefully audit and calibrate them.
Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.