Search results

Common Crawl - Blog - Introducing the Host Index

Introducing the Host Index. Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable. Greg Lindahl.

Common Crawl - Blog - Common Crawl URL Index

Common Crawl URL Index. Note: this post has been marked as obsolete. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool.

Common Crawl - Blog - Announcing the Common Crawl Index!

We are pleased to announce a new index and query api system for Common Crawl. The raw index data is available, per crawl, at: s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/. There is now an index for the Jan 2015 and Feb 2015 crawls.

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Analysis of the NCSU Library URLs in the Common Crawl Index. Note: this post has been marked as obsolete. Last week we announced the Common Crawl URL Index.

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

Index to WARC Files and URLs in Columnar Format. We're happy to announce the release of an index to WARC files and URLs in a columnar format.

Common Crawl - Erratum - Missing fetch_status fields

In our columnar index for this crawl, the `. content_mime_type. ` is missing and `. fetch_status. ` is always -1. In the cdx index (columnar: `. content_mime_type. `), fields `. mime. ` and `. status. ` are missing. Affected Crawls. The Data. Overview.

Common Crawl - Erratum - Redirect target URL in URL indexes may be a relative URL

Redirect target URL in URL indexes may be a relative URL. Originally reported by. Sebastian Nagel.

Common Crawl - Blog - September 2015 Crawl Archive Now Available

The CommonCrawl Url Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2015-40/. For more information on working with the url index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

The CommonCrawl Url Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2015-48/. For more information on working with the url index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - November/December 2021 crawl archive now available

The column url_host_name_reversed was added to the. columnar index. It holds the host name in reverse domain name notation (com.example.www) which is more efficient to query. In order to make use of the new column please use the. updated table schema.

Common Crawl - Blog - URL Search Tool!

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

Missing content_truncated flag in URL indexes. The flag in our URL indexes (CDX and columnar) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-36/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - September 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-39/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-40/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - January 2018 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-05/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - September 2018 crawl archive now available

(HTTP status 304) in the URL indexes do not include a field for the payload "digest" anymore.

Common Crawl - Blog - September 2019 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-39/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - May 2022 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-21/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - May 2021 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-21/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - January/February 2023 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-06/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - February/March 2021 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-10/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - June 2018 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-26/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - February 2019 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-09/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-30/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-50/.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-26/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - January 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-04/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - January 2020 crawl archive now available

Improvements and Fixes. date time values in the column "fetch_time" of the. columnar index. are now stored using the "int64" data type. For details and compatibility issues please see. cc-index-table#7.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-22/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - May 2017 Crawl Archive Now Available

WARC files and the URL index now contain the detected MIME type (based on the actual content) in addition to the "Content-Type" sent in the HTTP response (. details. ).

Common Crawl - Blog - March 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-13/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

The CommonCrawl Url Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-07/. For more information on working with the url index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - July 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-30/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - June 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-26/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - December 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-51/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-18/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - April 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-17/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - August 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-34/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - February 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-09/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - October 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-43/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - February 2018 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-09/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2016-44/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - July 2020 crawl archive now available

The URL index fields "redirect" and "mime" haven't been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case.

Common Crawl - Blog - November 2017 Crawl Archive Now Available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-47/. For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API.

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

ISO-639-3 code. are shown in the URL index as a new field, e.g. "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage: On github you'll find the.

Common Crawl - Blog - December 2019 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-51/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - October 2019 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-43/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - November/December 2020 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-50/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - October 2021 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-43/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - September 2020 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-40/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - March/April 2020 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-16/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - March/April 2023 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-14/. Also the. columnar index. has been updated to contain this Crawl. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata.

Common Crawl - Blog - May 2015 Crawl Archive Available

May 2015 Common Crawl Index. , constructed by. Ilya Kreymer. , creator of. https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya's. guest blog post.

Common Crawl - Blog - August 2020 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - October 2020 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - January 2022 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-05/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - August 2022 crawl archive now available

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-33/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!

Common Crawl - Blog - June 2015 Crawl Archive Available

June 2015 Common Crawl Index. , constructed by. Ilya Kreymer. , creator of. https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya's. guest blog post.

Common Crawl - Blog - March 2015 Crawl Archive Available

March 2015 Common Crawl Index. , introduced last month by. Ilya Kreymer. , creator of. https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya's. guest blog post.