Today we are happy to announce cc-downloader
, an experimental command-line tool for downloading Common Crawl data via https
. cc-downloader
is intended to be a user-friendly and polite downloader. It was made in response to the significant increase in downloads of our data in recent months. That was very exciting to see at first, especially in terms of the large rise in interest for our dataset. But it also makes it harder for some users to successfully download our data due to quirks of downloading from a high-traffic storage bucket.
cc-downloader
is our solution to this problem, enabling our users to continue downloading our data via https
without issues. We have designed cc-downloader
with a polite retry mechanism that allows our users to make sure that every single file requested is downloaded. It also implements jitter and exponential backoff strategies, in order to avoid overwhelming our infrastructure.
If you wish to install cc-downloader
, we have released pre-compiled binaries for all major operating systems and architectures. cc-downloader
is written in Rust and is distributed as a “crate”, so if you have cargo installed, you can install cc-downloader
with the following command:
cargo install cc-downloader
Once you have installed cc-downloader
, you’ll see that it has 2 sub-commands:
First, download-paths
downloads the file paths list for a given crawl and subset from our bucket, to a given destination folder path in your file system:
cc-downloader download-paths CC-MAIN-2024-46 wet path/to/folder
This paths file will be (in this case) path/to/folder/wet.paths.gz
.
Next, download
reads this file paths list and concurrently downloads the files to a given destination folder in your file system:
cc-downloader download path/to/folder/wet.paths.gz path/to/folder
This will preserve the tree structure that we use internally by default.
cc-downloader
is still under active development, so if you find any issues or would like to submit a feature request, please visit its GitHub repository at https://github.com/commoncrawl/cc-downloader/.
Contributions are always welcome! We hope that with this tool our users will find it easier to download and use our data.
Finally, if you’re encountering any problems with cc-downloader
that look like high traffic, you can check out our current traffic levels by looking at our Infrastructure Status Webpage.