Search papers, labs, and topics across Lattice.
The paper introduces ECO, a novel offline self-play paradigm for Neural Combinatorial Optimization (NCO) that addresses the inefficiency of online methods. ECO employs a two-phase approach involving supervised warm-up followed by iterative Direct Preference Optimization (DPO), coupled with a Mamba-based architecture designed for efficiency. The method also incorporates a heuristic-based bootstrapping mechanism to stabilize policy improvement, achieving competitive performance on TSP and CVRP while significantly improving memory utilization and training throughput.
Ditch online training for neural combinatorial solvers: ECO leverages offline self-play with Mamba to achieve state-of-the-art performance with significantly improved memory utilization and training throughput.
We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.