COMMON CRAWL FOUNDATION

Common Crawl Foundation

Organization Type: Nonprofit Foundation
Country: United States
www.commoncrawl.org

www.commoncrawl.org

Start Date: 2008
Archive interface language(s): English
Access methods: bulk download, random access via indexes
Harvesting methods: Monthly sample of the entire web

The Common Crawl Foundation provides an open dataset and archive of the Web, dating back to 2008. Our text-focused dataset is over 9 petabytes in size. It is accompanied by extensive metadata, such as search engine style ranking, and host-level rollups of detected languages.

Our data has been cited by over 10,000 research papers as of 2024. We are a part of the Amazon Web Services Open Data Sponsorship Program (AWS ODS), and our dataset is the most heavily used dataset in the AWS ODS program.