Apr 8, 2026arXiv:2604.06829

WRAP++: Web discoveRy Amplified Pretraining

AI Summary

WRAP++ is introduced, a pretraining method that synthesizes QA data by discovering and leveraging cross-document relationships from web hyperlinks, unlike existing methods that operate on single documents. It identifies high-confidence relational motifs like dual-links and co-mentions to generate QA pairs requiring reasoning across documents, thereby creating relational knowledge not present in individual sources. Experiments on Wikipedia show that WRAP++ amplifies 8.4B tokens of raw text into 80B tokens of cross-document QA data, leading to significant performance improvements on SimpleQA for OLMo-based models at 7B and 32B scales compared to single-document approaches.

Key Contribution

Forget rewriting single web pages – WRAP++ unlocks a 10x data scale-up and boosts LLM knowledge by synthesizing QA pairs that demand reasoning across multiple linked documents.

Abstract

Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WRAP++: Web discoveRy Amplified Pretraining

Related Papers