< Back to Blog
January 21, 2025

Introducing cc-downloader

Note: this post has been marked as obsolete.
Introducing a command-line tool written in Rust for downloading data from Common Crawl.
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Today we are happy to announce cc-downloader, an experimental command-line tool for downloading Common Crawl data via https. cc-downloader is intended to be a user-friendly and polite downloader. It was made in response to the significant increase in downloads of our data in recent months. That was very exciting to see at first, especially in terms of the large rise in interest for our dataset. But it also makes it harder for some users to successfully download our data due to quirks of downloading from a high-traffic storage bucket.

cc-downloader is our solution to this problem, enabling our users to continue downloading our data via https without issues. We have designed cc-downloader with a polite retry mechanism that allows our users to make sure that every single file requested is downloaded. It also implements jitter and exponential backoff strategies, in order to avoid overwhelming our infrastructure.

If you wish to install cc-downloader, we have released pre-compiled binaries for all major operating systems and architectures. cc-downloader is written in Rust and is distributed as a “crate”, so if you have cargo installed, you can install cc-downloader with the following command:

cargo install cc-downloader

Once you have installed cc-downloader, you’ll see that it has 2 sub-commands:

First, download-paths downloads the file paths list for a given crawl and subset from our bucket, to a given destination folder path in your file system:

cc-downloader download-paths CC-MAIN-2024-46 wet path/to/folder

This paths file will be (in this case) path/to/folder/wet.paths.gz.

Next, download reads this file paths list and concurrently downloads the files to a given destination folder in your file system:

cc-downloader download path/to/folder/wet.paths.gz path/to/folder

This will preserve the tree structure that we use internally by default.

cc-downloader is still under active development, so if you find any issues or would like to submit a feature request, please visit its GitHub repository at https://github.com/commoncrawl/cc-downloader/.

Contributions are always welcome! We hope that with this tool our users will find it easier to download and use our data.

Finally, if you’re encountering any problems with cc-downloader that look like high traffic, you can check out our current traffic levels by looking at our Infrastructure Status Webpage.

This release was authored by:
No items found.