Search papers, labs, and topics across Lattice.
The paper introduces a confidence-driven model selection strategy that dynamically chooses between LLMs of varying sizes based on confidence estimates derived from likelihood of knowing the correct answer and response accuracy. This approach aims to minimize computational costs by delegating only uncertain or complex tasks to larger models, while simpler tasks are handled by smaller, more efficient models. Experiments on MMLU and GPT-4o API calls demonstrate that this method achieves comparable accuracy to the largest model with a 20-40% reduction in computational costs and a 60% reduction in token usage, respectively.
Slash your LLM inference costs by up to 60% without sacrificing accuracy by dynamically routing tasks to smaller models based on confidence estimates.
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.