Search papers, labs, and topics across Lattice.
This paper re-evaluates the performance of Graph Neural Networks (GNNs) for Bitcoin fraud detection on the Elliptic Bitcoin Dataset under temporal distribution shift. Through a seed-matched inductive-versus-transductive comparison, the authors demonstrate that GNNs underperform a simple Random Forest baseline when evaluated under a strictly inductive, leakage-free protocol. Furthermore, they show that the graph structure itself can be detrimental, as randomly wired graphs outperform the real transaction graph, suggesting that the dataset's topology can be misleading.
The widely-held belief that GNNs outperform feature-only methods for Bitcoin fraud detection crumbles under rigorous, leakage-free evaluation, revealing that the graph structure can actually hurt performance.
The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.