The Data
Overview
Web Graphs
Latest Crawl
Statistics
Errata
Resources
Get Started
Blog
Examples
Use Cases
CCBot
Infra Status
FAQ
Community
Research Papers
Mailing List Archive
Hugging Face
Discord
Collaborators
About
Team
Mission
Impact
Privacy Policy
Terms of Use
Search
Contact Us
Examples Using
Our Data
Need More Help?
Take a look at our
Getting Started
page or connect with others on our
Developer List.
A Node.js client for the commoncrawl.org index
Subhash Choudhary
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
Ilya Kreymer
A distributed system for mining Common Crawl using SQS, AWS-EC2 and S3
Akshay Bhat
A free version of Helium Scraper that scrapes data from the Common Crawl database.
Juan Soldi
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Greg Lindahl
Alexandria Search
alexandria.org
All Around The World: The Common Crawl Dataset – Attack Surface Research
Aliz Hammond
Analysing Petabytes of Websites
Mark Litwintschik
Analyze Common Crawl index – http://index.commoncrawl.org/
Tom Morris
Analyzing 4 Billions of Tags with R and Spark
Javier Luraschi
Analyzing Performance and Cost of Large-Scale Data Processing with AWS Lambda
Chris Madden, Aaron Bawcom (Candid Partners)
Analyzing crime reported in the U.S. using data derived from Common Crawl, New York Times API and Twitter data
Sai Saket Regulapati
Analyzing the Common Crawl using Map-Reduce
Stefan Koch
Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R
hrbrmstr
Bill Tracker – Online Sentiment Towards Congressional Bills
Albert Wavering
C4 Dataset Script
Jianbin Chang
CCrawlDNS – CommonCrawl data set subdomain extracter
Laurent Gaffié
Categorizing World Wide Web
Jay Pavagadhi
CitizensFoundation/ac-keyword-scanner
Róbert Viðar Bjarnason
Clustering communities on web crawl data
Oluwaseyi Talabi, M. Rafay Aleem, Prashanth Rao, Nandita Dwivedi
Cmon Crawl: Common Crawl Extractor
Hynek Kydlíček
Common Crawl Document Download
Dominik Stadler
Common Crawl Index Athena
Edward Ross
Common Crawl News 20200110212037-00310 – A single Web ARChive (WARC) file from Common Crawl News
Gabriel Altay
Common Crawl On Laptop – Extracting Subset Of Data
Chillar Anand
Common Crawl Scala Example
Soner Altin
Common Crawl URL Index
Jason Ronallo
Common Crawl WARC/WET/WAT examples and processing code for Java + Hadoop
Stephen Merity
Common web archive utility code
the IIPC
CommonCrawl Host-IP Mapper
Mingwei Zhang
Next
Do you like what you see here?
If you need further answers don't hesitate to get in touch.
Get in touch