We’re happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast. So far, we’ve tested two of them: Apache Spark and AWS Athena. The latter makes it possible to run SQL queries on the columnar data even without launching a server. Below you’ll find examples how to query the data with Athena. Examples and instructions for SparkSQL are in preparation. But you are free to use any other tool: the columnar index is free to access or download for anybody. You’ll find all files on:
Running SQL Queries with Athena
AWS Athena is a serverless service to analyze data on S3 using SQL. With Presto under the hood you even get a long list of extra functions including lambda expressions. Usage of Athena is not free but it has an attractive price model, you pay only for the scanned data (currently $5.0 per TiB). The index table of a single monthly crawl has about 300 GB. That defines the upper bound, but most queries require only part of the data to be scanned.
Let’s start and register the Common Crawl index as database table in Athena:
1. open the Athena query editor. Make sure you’re in the us-east-1 region where all the Common Crawl data is located. You need an AWS account to access Athena, please follow the AWS Athena user guide how to register and set up Athena.
2. to create a database (here called “ccindex”) enter the command
CREATE DATABASE ccindex
and press “Run query”
3. make sure that the database “ccindex” is selected and proceed with “New Query”
4. create the table by executing the following SQL statement:
CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
PARTITIONED BY (
STORED AS parquet
It will create a table “ccindex” with a schema that fits the data on S3. The two “PARTITIONED BY” columns are actually subdirectories, one for every monthly crawl and the WARC subset. Partitions allow us to update the table every month and also help to limit the costs to query the data. Please note that the table schema may evolve over time, the most recent schema version is available on github.
5. to make Athena recognize the data partitions on S3, you have to execute the SQL statement:
MSCK REPAIR TABLE ccindex
Note that this command is also necessary to make newer crawls appear in the table. Every month we’ll add a new partition (a “directory”, e.g.,
crawl=CC-MAIN-2018-09/). The new partition is not visible and searchable unless it has been discovered by the repair table command. If you run the command you’ll see which partitions have been newly discovered, e.g.:
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=warc
Now you’re ready to run the first query. We’ll count the number of pages per domain within a single top-level domain. As before press “Run query” after you’ve entered the query into the query editor frame:
SELECT COUNT(*) AS count,
WHERE crawl = 'CC-MAIN-2018-05'
AND subset = 'warc'
AND url_host_tld = 'no'
GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESC
The result appears seconds later and only 2.12 MB of data have been scanned! Pretty fine, the query has cost less than one cent. We’ve filtered the data by a partition (a monthly crawl) and selected a small (.no) top-level domain. It’s a good practice to start developing more complex queries with such filters applied to keep the costs for trials low.
But let’s continue with a second example which demonstrates the power of Presto functions – we try to find domains which provide multi-lingual content. On possible way is to look for ISO-639-1 language codes in the URL, e.g.,
https://example.com/about/en/page.html. You can find the full SQL expression on github. For demonstration purposes we restrict the search to a single and small TLD (.va for Vatican State). The magic is done by
UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)
which first extracts all two-letter path elements (e.g.,
/en/) and unrolls the elements into a new column “url_path_lang” (if two or more path elements are found, you get multiple rows). Now we count pages and unique languages and let Presto/Athena also create a histogram of language codes:
You can find more SQL examples and resources on the cc-index-table project page on github. We’ll also working to provide examples to process the table using SparkSQL. First experiments are also promising: you get results within minutes even on a small Spark cluster. That’s not seconds as for Athena but you’re more flexible, esp. regarding the output format – Athena supports only CSV. Please also check the Athena release notes and the current list of limitations to find out which Presto version is used and which functions are supported.
We hope the new data format will help you to get value from the Common Crawl archives, in addition to the existing services.