Search results
The Columnar Index Is Now the URL Index. We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. Common Crawl Foundation.…
The column url_host_name_reversed was added to the. columnar index. It holds the host name in reverse domain name notation (com.example.www) which is more efficient to query. In order to make use of the new column please use the. updated table schema.…
Please note that we previously referred to this as the "Columnar Index.". What is the URL Index? The URL Index is one of the indexes available for querying the Common Crawl corpus.…
The following improvements and fixes to the data formats have been made: the. columnar index. contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields.…
Improvements and Fixes. date time values in the column "fetch_time" of the. columnar index. are now stored using the "int64" data type. For details and compatibility issues please see. cc-index-table#7.…
Look up a URL in the CDX index. const. cdx =. await. fetch(. 'https://index.commoncrawl.org/CC-MAIN-2026-17-index'. +. '?url=example.com&output=json&limit=1'. ).then(. r. =>. r.text()); const. { filename, offset, length } =. JSON. .parse(cdx.split(.…
The URL index fields "redirect" and "mime" haven't been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case.…
ISO-639-3 code. are shown in the URL index as a new field, e.g. "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage: On github you'll find the.…
In our URL Index (previously called the "Columnar Index") for this crawl, the `. content_mime_type. ` is missing and `. fetch_status. ` is always -1. In the cdx index (columnar: `. content_mime_type. `), fields `. mime. ` and `. status. ` are missing.…
When the HTTP “Location” header includes a relative URL, the corresponding “redirect” field in the CDX Index and “fetch_redirect” field in the URL Index (previously called the "Columnar Index") will also store a relative URL.…
Starting with crawl CC-MAIN-2018-39 we added a language classification field (‘content-languages’) to the URL Index (previously called the "Columnar Index"), WAT files, and WARC metadata for all subsequent crawls.…
The flag in our indexes (CDX Index and URL Index) that indicates whether or not a WARC record payload was truncated was added in CC-MAIN-2019-47. This indicator is missing in our indexes for all previous crawl releases.…
The "length" in the CDX index is the length of the gzip-compressed WARC record. The name in the columnar index. warc_record_length. reflects this better. It is also worth noting that PDFs end with. %%EOF. perhaps followed by a linefeed. Affected Crawls.…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-43/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-39/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
URL Index Subsets with Fewer than 900 Partitions per Crawl. Originally reported by. Sebastian Nagel.…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-51/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-16/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-43/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-50/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-40/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-05/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-14/. Also the. columnar index. has been updated to contain this Crawl. The Data. Overview. CDXJ Index. URL Index. Web Graphs. Latest Crawl. Crawl Stats.…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-17/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-25/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-04/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-21/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-27/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-33/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-21/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-06/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-39/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-49/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-13/. Also the. columnar index. has been updated to contain this crawl.…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-51/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-13/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-22/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-22/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-04/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-47/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-10/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-10/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-30/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-40/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-30/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-26/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-24/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-43/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-18/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-31/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-35/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-26/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-17/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
We've added two new fields to the URL indexes (CDX and columnar): the redirect target location is stored in the CDX JSON field "redirect" resp. the column "fetch_redirect".…
Using Common Crawl's Columnar Index. One of the goals of this project was to demonstrate a practical research workflow using Common Crawl's data infrastructure, so it's worth describing the pipeline in some detail.…
The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-09/. Also the. columnar index. has been updated to contain this crawl. Please. donate. to Common Crawl if you appreciate our free datasets!…
Monthly release index files in the. CDXJ Index. and. URL Index. , the latter (previously called the "Columnar Index") queryable with. Amazon Athena. Web Graphs. Common Crawl publishes host-level and domain-level.…
Index to WARC Files and URLs in Columnar Format. We're happy to announce the release of an index to WARC files and URLs in a columnar format.…