A Look Inside Our 210TB 2012 Web Corpus
Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!
Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.
From the conclusion section of the paper:
View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://commoncrawl/index2012 and the code on GitHub.