Processing Pipeline
Our Crawl Pipeline
Our internal data center hosts various sets of servers, including a Hadoop cluster and a bank of dedicated crawlers. We crawl the web via our connection to a Tier 1 ISP. We transfer all download content to our internal HDFS cluster. We then run various MapReduce jobs to process the content, and then archive the content into compressed ARC files, each of which is approximately 100MB in size. These files are then uploaded to one of our S3 buckets within the Amazon cloud.