Search papers, labs, and topics across Lattice.
The paper introduces SkyScraper, a multi-agent system that iteratively geocodes news articles and generates captions for corresponding satellite image sequences to address the lack of multi-temporal event captioning datasets in remote sensing. SkyScraper uses agentic feedback to surface new multi-temporal events in satellite imagery, outperforming traditional geocoding methods by 5x in event detection. The authors curate a new multi-temporal captioning dataset with 5,000 sequences using this framework.
Multi-agent systems can find 5x more real-world events in satellite imagery than traditional methods, unlocking a wealth of training data for multi-temporal change detection.
Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.