Search papers, labs, and topics across Lattice.
This paper introduces a "Privacy Guard" framework that uses a local Small Language Model (SLM) to perform abstractive summarization and Automatic Prompt Optimization (APO), routing high-risk queries to specialized models. This approach aims to reduce both operational costs by minimizing token payloads and privacy risks by eliminating sensitive inference vectors. Experiments on a 1,000-sample dataset demonstrate a 45% reduction in operational expenses, complete redaction of personal secrets, and an 85% preference rate for APO-compressed responses.
Cutting LLM costs and ensuring zero data leakage might be two sides of the same contextual compression coin.
The large-scale adoption of Large Language Models (LLMs) forces a trade-off between operational cost (OpEx) and data privacy. Current routing frameworks reduce costs but ignore prompt sensitivity, exposing users and institutions to leakage risks towards third-party cloud providers. We formalise the"Inseparability Paradigm": advanced context management intrinsically coincides with privacy management. We propose a local"Privacy Guard"-- a holistic contextual observer powered by an on-premise Small Language Model (SLM) -- that performs abstractive summarisation and Automatic Prompt Optimisation (APO) to decompose prompts into focused sub-tasks, re-routing high-risk queries to Zero-Trust or NDA-covered models. This dual mechanism simultaneously eliminates sensitive inference vectors (Zero Leakage) and reduces cloud token payloads (OpEx Reduction). A LIFO-based context compacting mechanism further bounds working memory, limiting the emergent leakage surface. We validate the framework through a 2x2 benchmark (Lazy vs. Expert users; Personal vs. Institutional secrets) on a 1,000-sample dataset, achieving a 45% blended OpEx reduction, 100% redaction success on personal secrets, and -- via LLM-as-a-Judge evaluation -- an 85% preference rate for APO-compressed responses over raw baselines. Our results demonstrate that Token Parsimony and Zero Leakage are mathematically dual projections of the same contextual compression operator.