< Back to Blog
March 26, 2024

March/April 2024 Newsletter

Note: this post has been marked as obsolete.
We're excited to share an update on some of our recent projects and initiatives in this newsletter!
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Table of Contents

  • Web Graphs
  • AWS Performance Improvements
  • New Collaborators
  • New Staff Members
  • New Board Member
  • Discord Server
  • Updated Legal Information
  • Crawl & Graph Errata
  • Improved Cadence
  • Acknowledgements

Web Graphs

Our Web Graph releases (which allow visualisation of the crawl metrics) are now done following each crawl release, giving better ranking information than before (graphs were previously released after every third crawl).  For convenience, similar to collinfo.json, we now offer graphinfo.json.  These can both be used to programmatically retrieve information for the latest release, such as:

$ curl -s https://index.commoncrawl.org/collinfo.json | jq '.[0].id'
"CC-MAIN-2024-10"
$ curl -s https://index.commoncrawl.org/graphinfo.json | jq '.[0].id'
"cc-main-2023-24-sep-nov-feb"

AWS Performance Improvements

We’re pleased to announce stable performance on our S3 bucket on AWS for the last 4 consecutive months.  Information on our infrastructure’s performance can be seen on our new Status Page.

CloudFront Performance this Week

S3 Performance this Week

We see that 503 Slow Down responses have been reduced dramatically, meaning egress of our data is now happening much more smoothly than around November of 2023.  As ever, we appreciate your politeness when querying our APIs!

New Collaborators

We are proud to have collaborated with a broad range of organisations on various projects and initiatives.  Examples can be found on our Collaborators page.  Want to get involved?  Get in touch!

New Staff Members

Long–term fans of the Common Crawl Foundation know that for quite a few years, Sebastian Nagel was our only engineer. Starting in May 2023, our team has grown to include many more members from diverse backgrounds. Check out our Team page for biographies and further information.

New Board Member

We are happy to welcome Eva Ho to our Board of Directors.  Eva is a serial entrepreneur and founder, and has vast experience in the non–profit sector. Eva has been our advisor since January 2012!

Discord Server

We’re excited to announce our new Discord Server, to augment our online discussions in our Mailing List on Google Groups, and elsewhere.  Jump in and join our discussions on Open Data and the wide world of web crawling!

Updated Legal Information

We’ve improved our legal documentation on our website, and in the S3 bucket.  For visibility and transparency we’ve created a public repository on GitHub for tracking changes to our main legal documents, in particular: our Terms of Use and Privacy Policy.

Crawl & Graph Errata

Our website now features details of errata which may affect various releases (crawls and graphs).

Each blog post announcing a release now features a list of errata that affect the release being announced.  We’re always on the lookout for any extra information that can aid others in using our data; if you have any related information to report we’d be pleased to hear from you!

Improved Cadence

As of Q1 2024, we are increasing the frequency of our crawls to once per month.  As mentioned above, we are now also releasing Web Graphs following each crawl, too.  Our first crawl of this year was CC-MAIN-2024-10, and the resulting web graph release was cc-main-2023-24-sep-nov-feb, which spans the previous three crawl releases.  We hope that you find this fresher ranking data helpful!

Acknowledgements

Thanks to our community members for their ongoing input, and for helping us identify areas for improvement, and to our Collaborators, without whom we could not do what we do.  We’d also like to thank our emeritus members for their support and expertise.

This release was authored by:
Thom is Principal Technologist at the Common Crawl Foundation.
Thom Vaughan
Greg is the Chief Technology Officer at the Common Crawl Foundation.
Greg Lindahl