Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Alexander Wettig1,2 Kyle Lo2 Sewon Min3,2 Hannaneh Hajishirzi2,4 Danqi Chen1 Luca Soldaini2
1Princeton University 2Ai2 3UC Berkeley 4University of Washington

Explore the Corpus

Explore different topics and formats and click to see examples! We visualize a pre-training corpus derived from CommonCrawl.
The areas show the number of tokens per domain. Move your mouse over a topic to see the distribution of formats for that topic or vice versa.

Disclaimer: The examples are randomly sampled and not manually curated. The web pages may contain offensive content and the domain classification may be subject to errors.

Click on a topic or format to view examples!

Downloads

We fine-tune small (140M parameter) domain classifiers which can be used to annotate data at scale. We train these models on annotations from Llama-3.1-405B-Instruct prompted with detailed descriptions of the domains. We release the models and training data on HuggingFace Hub.

Topic Classifier

Format Classifier

Corpus Annotations

Our 200B token corpus is available on HuggingfaceHub 🤗 at WebOrganizer/Corpus-200B. It is based on the DataComps-LM 1b-1x pool and was pre-processed using best practice techniques (RefinedWeb filters and BFF deduplication). We provide the processed documents together with token counts, quality scores, domain annotations, and k-means cluster assignments.

Motivation

Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained from crawling the web. We introduce WebOrganizer, a framework that organizes this vast amount of information into meaningful topic and format domains.

These domains expose the internal composition of monolithic web corpora and we also demonstrate that they are useful for curating better pre-training data for language models.

Takeaways

Better Data Mixing: Constructing domains has the advantage that we can systematically study the importance of each subset. We adapt the RegMix framework for learning which domains are most useful for improving performance on MMLU and HellaSwag.

Domain mixture optimization using RegMix

Our optimized domain mixtures improve downstream performance over the default proportions of the corpus across a range of tasks, and topics and formats mixing can be combined to further improve performance.

Main results table

Complementary to Quality Filters: Data curation with domains and quality filters have complimentary strengths - quality filtering can discard individual documents, while domain mixing can be finely calibrated towards downstream needs. Using them both together achieves the strongest perfomrance overall.

Domain mixture optimization using RegMix
Understanding Data Curation: Our framework also provides insights into existing data curation practices. We consider how quality filters will implicitly change the domain distributions, and observe both similarities and differences in the domain preferences of two quality filters, FineWeb-edu and DCLM-fasttext. We study how much we can approximate the performance of quality filters from their implicit domain mixtures. While their domain mixtures perform better than the default corpus proportions, there remains a gap in performance, suggesting that fine-grained quality filtering also contributes to the performance.

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  year={2025}
}