UvAMar 2, 2026arXiv:2603.01863

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu

AI Summary

The paper introduces Tide, an open-source, customizable synthetic dataset generator for anti-money laundering (AML) research that creates graph-based financial networks with both structural and temporal characteristics of money laundering schemes. The lack of realistic and accessible AML datasets hinders research, and Tide addresses this by allowing researchers to generate datasets tailored to specific research needs and illicit ratios. Experiments using two reference datasets with varying illicit ratios (0.10% and 0.19%) demonstrate that different state-of-the-art detection models (LightGBM and XGBoost) perform best under different conditions, highlighting the utility of Tide for benchmarking AML detection methods.

Key Contribution

Forget simplistic benchmarks: Tide generates realistic, customizable AML datasets that reveal surprising performance variations in fraud detection models depending on the prevalence of illicit activity.

Abstract

The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.

Data Curation & Synthetic Data Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

Related Papers