Search papers, labs, and topics across Lattice.
The paper introduces a framework and benchmark for evaluating tool-aware planning capabilities of LLMs in contact center scenarios, focusing on decomposing complex business insight queries into executable steps using structured (Text2SQL/Snowflake) and unstructured (RAG/transcripts) tools. They present a data curation methodology using an evaluator-optimizer loop to generate high-quality plan lineages and conduct a large-scale evaluation of 14 LLMs. Results indicate that LLMs struggle with complex queries and long plans, highlighting gaps in tool understanding and the importance of simpler plans, while plan lineage provides mixed benefits.
LLMs can only decompose complex data analysis queries into executable tool-based plans with ~85% accuracy, highlighting persistent gaps in tool understanding and planning ability for contact center AI.
We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.