Feb 16, 2026arXiv:2602.14643

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Luís Silva, Diogo Gonçalves, Catarina Farinha, Clara Matos, Luís Ungaro

AI Summary

The paper introduces Arbor, a framework for navigating decision trees with LLMs by decomposing the process into node-level tasks to improve reliability in critical conversation flows. Arbor represents decision trees as edge lists and uses a DAG-based orchestration to retrieve outgoing edges, evaluate transitions, and generate responses in separate steps. Experiments on clinical triage conversations show that Arbor significantly improves turn accuracy, reduces latency, and lowers costs compared to single-prompt baselines across various foundation models.

Key Contribution

Achieve 29% higher accuracy, 57% lower latency, and 14x cost reduction in LLM-driven decision trees by decomposing the process into node-level tasks, outperforming monolithic prompting.

Abstract

Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Related Papers