Search papers, labs, and topics across Lattice.
This paper addresses the challenge of learning optimal routing policies in multi-layer hierarchical inference systems with partial, policy-dependent feedback, where feedback is only available at the terminal layer. The authors demonstrate that standard importance-weighted contextual bandit methods become unstable due to decaying feedback probability along the hierarchy. They propose a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization to achieve unbiased loss estimation and stable learning, providing regret guarantees relative to the best fixed routing policy.
Naive importance weighting falls apart in deep hierarchical inference systems with sparse feedback, but a variance-reduced EXP4 algorithm can restore stability and near-optimal routing.
Hierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.