June 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-27/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the June 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

May 2015 Crawl Archive Available

The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-22/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the May 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

April 2015 Crawl Archive Available

The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-18/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the April 2015 Common Crawl Index, introduced last month by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

March 2015 Crawl Archive Available

The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-14/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the March 2015 Common Crawl Index, introduced last month by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

Please Donate To Common Crawl!

Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data. Imagine the questions we could answer and the problems we could solve if talented, creative technologists could freely access more big data.

At Common Crawl, we are passionate about getting big open data into the hands of talented and creative people. Increasing access to data enables everything from business innovation to groundbreaking research.

Common Crawl is proud of what we have accomplished in 2014 thanks to our dedicated team and the support of donors like you.

This year:

  • 19 academic publications using Common Crawl data were published
  • Several Open Educational Resources designed to teach big data tools and methods were created
  • 1.3 petabytes containing 18.5 billion web pages were added to the Common Crawl corpus
  • Numerous presentations and tutorials were given at international conferences, local meetup groups, and academic workshops in six countries

100% of our funding comes from donors like you — Thank you! We accomplish a great deal with a small, dedicated staff on a limited budget so your investment in us has a big impact.

More resources for Common Crawl means more access to data for everyone. In 2015 we have big plans to scale up crawling to more rapidly increase the Common Crawl corpus and to grow our educational programs and catalogue of tutorials in order to invest in the next generation of data-driven technologists.

Whether it’s $10 or $10,000, Common Crawl needs your donation today.

Donate here!

Thank you very much,
Lisa Green and the Common Crawl team