Processing Pipeline | CommonCrawl

Processing Pipeline

Our Crawl Pipeline

Our internal data center hosts various sets of servers, including a Hadoop cluster and a bank of dedicated crawlers. We crawl the web via our connection to a Tier 1 ISP. We transfer all download content to our internal HDFS cluster. We then run various MapReduce jobs to process the content, and then archive the content into compressed ARC files, each of which is approximately 100MB in size. These files are then uploaded to one of our S3 buckets within the Amazon cloud.