CarletonConcordia UniversityFeb 25, 2026arXiv:2602.21826

The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories

Tarek Ramadan, Tarek Ramadan, AbdelRahman Abdou, A. Abdou, Mohammad Mannan, Mohammad Mannan, Amr M. Youssef, Amr Youssef

AI Summary

The paper introduces an automated system for detecting sensitive information leaked through publicly accessible URLs, combining lexical URL filtering, dynamic rendering, OCR, and content classification. They applied this system to a dataset of over 6 million URLs from public scanning platforms, paste sites, and web archives. The study identified 12,331 potential exposures across authentication, financial, personal, and document-related domains, highlighting the prevalence of unintentional data leaks.

Key Contribution

Over 12,000 sensitive data leaks are discoverable across public URL repositories, revealing a significant and under-quantified attack surface.

Abstract

A large number of URLs are made public by various platforms for security analysis, archiving, and paste sharing -- such as VirusTotal, URLScan.io, Hybrid Analysis, the Wayback Machine, and RedHunt. These services may unintentionally expose links containing sensitive information, as reported in some news articles and blog posts. However, no large-scale measurement has quantified the extent of such exposures. We present an automated system that detects and analyzes potential sensitive information leaked through publicly accessible URLs. The system combines lexical URL filtering, dynamic rendering, OCR-based extraction, and content classification to identify potential leaks. We apply it to 6,094,475 URLs collected from public scanning platforms, paste sites, and web archives, identifying 12,331 potential exposures across authentication, financial, personal, and document-related domains. These findings show that sensitive information remains exposed, underscoring the importance of automated detection to identify accidental leaks.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Silent Spill: Measuring Sensitive Data Leaks Across Public URL Repositories

Related Papers