Search papers, labs, and topics across Lattice.
This paper introduces a distributed Variational Quantum Linear Solver (D-VQLS) framework using NVIDIA CUDA-Q to address the scalability bottleneck of VQLS, which suffers from O(L^2) circuit evaluations per optimizer iteration. They combine this with a fast Walsh-Hadamard transform (FWHT)-based Pauli decomposition to reduce the number of LCU terms from O(2^n) to O(1) for sparse matrices. Results on a 10-qubit tridiagonal Toeplitz system show a 256x reduction in circuit complexity while maintaining high solution fidelity, validated on the NERSC Perlmutter supercomputer with near-ideal scaling.
Forget intractable quantum linear algebra: this distributed algorithm and circuit compression technique slashes circuit complexity by 256x while preserving solution fidelity, opening the door to practical quantum advantage on near-term hardware.
The Variational Quantum Linear Solver (VQLS), a hybrid quantum-classical algorithm for solving linear systems, faces a practical scalability bottleneck: the Linear Combination of Unitaries (LCU) decomposition requires O(L^2) circuit evaluations per optimizer iteration, where $L$ can grow as 4^n for n-qubit systems for the worst case scenario. We address this computational bottleneck through two complementary strategies. First, we present a distributed VQLS (D-VQLS) framework, built on NVIDIA CUDA-Q, that enables asynchronous, scalable distribution of the O(L^2) cost-function evaluations. Second, a fast Walsh--Hadamard transform (FWHT)-based Pauli decomposition with 1% coefficient thresholding curbs the exponential growth of LCU terms, reducing L from O}(2^n) to O(1) for n>6 qubits and compressing the per-iteration circuit complexity from O(n * 4^n) to O(n) for sparse, structured matrices. For a 10-qubit tridiagonal Toeplitz system, this yields a 256x reduction, from 23 million to 90,112 circuits per iteration, while preserving over $99.99\%$ solution fidelity. Additionally, to inform feasibility on early fault-tolerant QPUs, the paper provides resource estimates -- gate counts, qubit requirements, and circuit evaluations per iteration -- for VQLS applied to arbitrary matrices. The D-VQLS framework is validated on the NERSC Perlmutter supercomputer using multi-node, multi-GPU ideal state-vector simulations, achieving over 99.99% fidelity against classical solutions on tridiagonal Toeplitz and Hele--Shaw flow benchmarks, with near-ideal strong scaling up to 24 GPUs and 95.3% weak scaling efficiency at 96 GPUs processing 360,448 circuits per iteration for a 10-qubit system. Systematic profiling identifies the optimal resource allocation for distributed quantum circuit workloads, yielding a 2.52x speedup for the configurations studied.