Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!
Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.
From the conclusion section of the paper: