June 2019 crawl archive now available

The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

The June crawl contains page captures of 880 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
  • a random sample of 2.0 billion outlinks taken from May crawl WAT files

Starting with this crawl the WAT extraction has been improved by properly decoding HTML character entities in URLs and strings. For details, please see the issue report “WAT: unescape XML/HTML character entities”.

Archive Location and Download

The June crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-26/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-26/segment.paths.gz100
WARC filesCC-MAIN-2019-26/warc.paths.gz5600049.42
WAT filesCC-MAIN-2019-26/wat.paths.gz5600017.24
WET filesCC-MAIN-2019-26/wet.paths.gz560007.59
Robots.txt filesCC-MAIN-2019-26/robotstxt.paths.gz560000.14
Non-200 responses filesCC-MAIN-2019-26/non200responses.paths.gz560001.52
URL index filesCC-MAIN-2019-26/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-26/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May 2019 crawl archive now available

The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th.

The May crawl contains page captures of 825 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
  • a random sample of 1.6 billion outlinks taken from WAT files of the April crawl

Archive Location and Download

The May crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-22/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-22/segment.paths.gz100
WARC filesCC-MAIN-2019-22/warc.paths.gz5600051.48
WAT filesCC-MAIN-2019-22/wat.paths.gz5600017.75
WET filesCC-MAIN-2019-22/wet.paths.gz560007.7
Robots.txt filesCC-MAIN-2019-22/robotstxt.paths.gz560000.17
Non-200 responses filesCC-MAIN-2019-22/non200responses.paths.gz560001.84
URL index filesCC-MAIN-2019-22/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-22/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The software which builds the graph from WAT and WARC files has been extended to extract more links from the HTML <head> element:

  • more links are taken from <metadata> elements, e.g, the thumbnail meta name, Open Graph or twitter:* properties
  • links from <script> elements are now included

Note that previous web graph releases already include all kinds of links: not only <a href="..."> but also links to images and multi-media content, links from <form> elements, canonical links, and many more.

While the domain-level graph shows almost the same size and metrics as the previous one released three months ago, the host-level graph has increased in size by 85 million nodes but is less densely connected. The growth in the number of nodes is mainly caused by a link spam cluster of 190 million hosts distributed over 15k domains. Thanks to the webgraph these domains (e.g., 24340.tw) are detected and the crawler is advised not to visit them again.

Host-level graph

The graph consists of 492 million nodes and 3.0 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 426 million dangling nodes (87%) and the largest strongly connected component contains 52 million (10.5%) nodes.

You can download the graph and the ranks of all 492 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/Apr 2019 host-level webgraph

SizeFileDescription
3.36 GBcc-main-2019-feb-mar-apr-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
14.40 GBcc-main-2019-feb-mar-apr-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 56 edges files
6.33 GBcc-main-2019-feb-mar-apr-host.graphgraph in BVGraph format
2 kBcc-main-2019-feb-mar-apr-host.properties
7.02 GBcc-main-2019-feb-mar-apr-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-feb-mar-apr-host-t.properties
1 kBcc-main-2019-feb-mar-apr-host.statsWebGraph statistics
7.85 GBcc-main-2019-feb-mar-apr-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 91 million nodes and 1.89 billion edges. 51% or 46 million nodes are dangling nodes, the largest strongly connected component covers 38 million or 42% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/domain/.

Download files of the Common Crawl Feb/Mar/Apr 2019 domain-level webgraph

SizeFileDescription
0.63 GBcc-main-2019-feb-mar-apr-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.48 GBcc-main-2019-feb-mar-apr-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.01 GBcc-main-2019-feb-mar-apr-domain.graphgraph in BVGraph format
2 kBcc-main-2019-feb-mar-apr-domain.properties
4.02 GBcc-main-2019-feb-mar-apr-domain-t.graphtranspose of the graph
2 kBcc-main-2019-feb-mar-apr-domain-t.properties
1 kBcc-main-2019-feb-mar-apr-domain.statsWebGraph statistics
1.98 GBcc-main-2019-feb-mar-apr-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/Apr 2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12909644410.020470com.googleapis
22800798230.012308com.facebook
32680149420.013202com.google
42486204240.006929com.twitter
52469642850.006794com.youtube
62417404860.006211org.w
72232517690.003651com.instagram
82227754670.004565org.gmpg
921663616130.002903com.linkedin
102125374080.003880com.googletagmanager
1120994008220.001629com.gravatar
1220770844110.003144com.cloudflare
1320763426120.002915org.wordpress
1420723350150.002103com.wordpress
1520597380190.001856com.pinterest
1620589106260.001344org.wikipedia
1720538196140.002455com.bootstrapcdn
1820340120180.001857com.apple
1920168332280.001208com.vimeo
2020100244400.000945com.blogspot
2120076842210.001721com.jquery
2219900496440.000861gl.goo
2319874514490.000756be.youtu
2419845858240.001528com.adobe
2519808478270.001240com.microsoft
2619798758500.000749com.amazon
2719710384600.000607com.tumblr
2819689216530.000667com.wp
2919584236340.001016com.amazonaws
3019567898250.001456com.macromedia
3119563206880.000433com.yahoo
3219557094510.000734com.flickr
3319526652420.000880com.google-analytics
3419499686800.000502ly.bit
3519489648320.001034com.googlesyndication
3619472580620.000580org.mozilla
3719466578230.001557com.gstatic
3819459466310.001093net.cloudfront
3919428390200.001823com.github
4019302390660.000558me.wp
4119278286390.000949net.doubleclick
4219253848460.000802com.paypal
4319222312990.000316com.googleusercontent
4419214440820.000487com.medium
4519194374410.000882com.squarespace
4619181944850.000440com.weebly
4719164390790.000520org.w3
48191628801270.000234com.nytimes
4919140860860.000440io.github
50191386961020.000307com.reddit
5119125448920.000375org.creativecommons
52190520881540.000166net.slideshare
53190501261620.000162com.theguardian
54190479641390.000189com.imgur
5519010700570.000626com.bing
56190078041360.000202com.forbes
57189750241660.000158net.sourceforge
58189693442170.000110com.businessinsider
5918964518640.000566org.schema
60189305622020.000115com.myspace
61189294121610.000162com.blogger
62189290982060.000113com.techcrunch
63189290961880.000132com.android
64189077161010.000313com.mailchimp
65188870382510.000097com.tinyurl
6618886912540.000649com.baidu
67188815982490.000098com.wired
6818879314910.000411de.google
69188721463540.000068com.photobucket
70188700821820.000140com.stackoverflow
71188447181000.000316org.ampproject
7218842202380.000953org.apache
73188292062660.000090com.bbc
74188242021030.000307com.shopify
75188214923610.000068com.quora
76188183043150.000076com.appspot
7718801506370.000974com.fontawesome
78187978721130.000275com.ytimg
7918796778360.000976com.addthis
80187760402090.000112com.oracle
81187664845580.000045org.chromium
82187614663530.000069com.googleblog
83187537183800.000064com.theverge
84187287225260.000047org.ieee
85187261665100.000048edu.washington
86187248984620.000053com.economist
8718724374960.000330com.statcounter
8818720414980.000317com.soundcloud
89187173401510.000171org.ietf
90187140605530.000046edu.yale
91187063623190.000075com.githubusercontent
92187032143000.000078com.ted
9318695606610.000589eu.europa
94186945324410.000056com.venturebeat
95186918882350.000103com.hubspot
96186885426550.000043com.tinypic
97186802521440.000180com.spotify
98186737381410.000185com.yelp
99186718301330.000213com.issuu
100186625163950.000063com.cisco
10118657986930.000354co.t
10218652360950.000340com.sharethis
103186484404380.000056com.deviantart
104186449447020.000040edu.princeton
105186440362650.000090com.sciencedirect
106186357323600.000068me.about
107186317104600.000053org.arxiv
108186276502790.000086org.npr
109186158441790.000141org.wikimedia
110186085307510.000038google.blog
111186064023410.000071com.theatlantic
112185968923450.000071com.mozilla
113185925025720.000044edu.ucla
114185879044540.000054com.mysql
115185844161340.000211com.dropbox
116185818909630.000033com.jetbrains
117185793861210.000250com.whatsapp
118185766762940.000081com.example
11918575790810.000499net.jsdelivr
120185748122710.000089com.fastcompany
121185670243310.000072com.typeform
122185604724000.000062com.zdnet
123185564064680.000052com.wikihow
12418554544300.001112ru.yandex
125185527845400.000046com.thenextweb
126185516785020.000049com.git-scm
1271855082410630.000030com.chrome
128185466501690.000156com.salesforce
129185429823750.000065uk.co.blogspot
130185377704430.000055com.about
131185356581170.000263org.networkadvertising
132185356084880.000050com.pixabay
133185260842120.000112com.dribbble
134185257902010.000116com.stumbleupon
1351852465814340.000021com.diigo
136185093344990.000049com.ubuntu
137185019307410.000038org.eclipse
138184975965050.000049com.slate
139184972522080.000112com.googlecode
14018490168580.000611com.wix
141184866044250.000058com.moz
142184811861910.000127com.cnn
143184754721220.000242com.stripe
144184754241810.000141uk.co.bbc
145184643406520.000043com.stackexchange
146184628643690.000066com.entrepreneur
147184589602750.000087com.nbcnews
148184533762530.000095gov.ca
149184445045950.000044com.withgoogle
150184404305180.000048com.qz
151184394145420.000046com.trello
152184298922140.000111edu.stanford
1531842623810710.000030edu.illinois
1541842482410400.000030edu.gatech
155184242902930.000081com.foursquare
1561842171815090.000020org.wikibooks
157184181785730.000044com.searchengineland
158184174205160.000048com.unity3d
159184147686700.000042org.sciencemag
160184120402670.000090com.npmjs
161184016684630.000053gov.loc
162183978569260.000034com.sap
16318397328160.002068com.wixstatic
1641839622410970.000029edu.rutgers
165183942701560.000165org.bbb
166183929542130.000111es.google
167183926826610.000042com.variety
168183912961550.000166com.twimg
169183819365380.000046com.libsyn
170183803405460.000046com.evernote
171183800641740.000152com.imdb
172183788962110.000112com.wsj
17318377090330.001018net.fbcdn
174183731241420.000185gov.privacyshield
175183691367050.000040com.techtarget
17618368040450.000851com.fb
1771836597811900.000026edu.utah
178183659601460.000180org.archive
179183654863770.000065com.getpocket
180183585924390.000056gov.fda
181183574821940.000125com.optimizely
182183505424190.000060au.com.google
183183462469040.000034com.econsultancy
184183461302100.000112net.windows
1851834586013150.000024com.douban
186183451484490.000055org.freecodecamp
1871833827413210.000023com.discogs
188183381746130.000043uk.ac.ox
1891833578010190.000031com.nike
1901833267812070.000026org.tensorflow
19118325660750.000531com.vk
192183248042870.000082edu.mit
193183239089530.000033com.buffer
1941832365412030.000026com.aljazeera
1951832118411330.000028ca.utoronto
196183179828130.000036com.netlify
1971831589410790.000030com.nvidia
198183142545310.000047net.azurewebsites
199183116123500.000069com.msn
200183107489840.000032org.kernel
2011830706411230.000028it.scoop
20218305946940.000346com.paypalobjects
203182991787230.000039com.indeed
204182988708070.000036com.mixcloud
205182965842360.000103com.live
2061829111611210.000028org.postgresql
207182890648100.000036com.neilpatel
208182805403650.000067com.discordapp
2091826930012170.000026ms.1drv
210182691469420.000034com.business2community
211182676223030.000078com.reuters
212182661663890.000064gov.nasa
2131826334414520.000021com.makeuseof
214182618801530.000168gov.nih
215182616604440.000055com.udacity
2161825726212890.000024com.hostgator
217182539949960.000032com.chron
218182524162640.000091com.ibm
2191824336610240.000031com.socialmediaexaminer
2201824059411260.000028com.trendmicro
221182380162030.000114com.washingtonpost
2221823694811170.000028com.computerworld
223182351545680.000045com.images-amazon
224182325441800.000141com.etsy
2251823129213400.000023io.itch
226182262267730.000037co.g
2271821810412060.000026edu.osu
228182150609680.000033com.yoast
2291821167012770.000024com.hbo
230182103261900.000130com.ebay
231182084883040.000077com.cnet
232182075422910.000082edu.harvard
2331820731217660.000017com.pearltrees
234182063949560.000033com.mediafire
235182060367150.000039site.business
23618201166560.000629net.akamaihd
2371820005410550.000030com.healthline
238181984904830.000051com.usnews
239181966621960.000120com.huffingtonpost
2401819255411440.000027com.bustle
2411819024210010.000032com.me
242181880005570.000045org.d3js
243181860741650.000159com.eventbrite
24418185850870.000437com.list-manage
2451818573017370.000018com.panoramio
246181844503840.000064com.mashable
247181817124650.000053edu.berkeley
248181811048030.000036co.ibb
249181807962770.000087com.bloomberg
250181779227600.000037com.adjust
251181770088220.000036com.ecwid
252181740043490.000070com.mapbox
253181721948910.000034gov.wa
2541817215210280.000031org.aarp
2551817022411580.000027edu.brookings
256181695021920.000127org.iana
2571816577212500.000025com.dw
2581816565412450.000025com.medicalnewstoday
259181635842180.000109net.php
260181635744200.000059me.telegram
261181630482960.000081org.acm
262181623542070.000112org.gnu
2631815819814760.000021com.sas
264181521166110.000043me.paypal
2651815024815610.000020com.dezeen
2661815007210880.000029com.cio
267181491007990.000036co.elastic
2681814852412930.000024uk.org.tate
269181476844020.000062com.latimes
270181464121990.000118uk.co.amazon
271181461724550.000054com.bigcommerce
2721814515815650.000020be.blogspot
2731814446812600.000025com.hackernoon
274181430803160.000076uk.co.telegraph
2751814180814280.000022com.googlesource
2761814169014480.000021edu.iastate
2771813888416980.000018org.edublogs
2781813739215590.000020com.mathworks
2791813402412340.000025gov.michigan
280181329643720.000066com.livejournal
2811813282811400.000028com.xrea
2821813209216230.000019li.paper
28318128158470.000767com.qq
2841812788016790.000018com.dummies
285181265861430.000183com.unpkg
2861812385610150.000031com.searchenginejournal
287181228249390.000034com.searchenginewatch
2881812080618020.000017fr.unblog
289181191642860.000083com.go
290181145986580.000042com.livechatinc
291181130781490.000173com.opera
292181123607570.000037au.gov.nsw
2931811032611620.000027va.vatican
2941810461614740.000021jp.ac.u-tokyo
2951810436010160.000031uk.co.pinterest
296181030343740.000065com.elsevier
2971809771214120.000022com.activecampaign
298180976143110.000076com.meetup
2991809710619250.000016com.jigsy
3001809426810840.000029uk.gov.nationalarchives
3011809263212190.000026us.mn.state
3021809171812460.000025com.firebaseapp
3031809141612470.000025com.convinceandconvert
3041809015813670.000023us.fl.state
3051808316617000.000018org.emojipedia
306180792665290.000047com.adage
3071807907611920.000026org.maven
3081807760413410.000023gov.mo
30918073450430.000878net.facebook
310180700407120.000039gov.dot
311180698661640.000160uk.co.google
31218069354890.000415com.godaddy
313180682561720.000155com.zendesk
314180664182220.000106com.typepad
315180660742780.000087com.usatoday
316180620783240.000074com.mapquest
3171805767014230.000022gov.ky
3181805210616870.000018com.manta
319180502864260.000058org.hbr
320180501864900.000050net.researchgate
321180494083270.000073com.getclicky
3221804939613980.000022com.convertkit
3231804828621550.000014it.justpaste
3241804804813480.000023com.creativebloq
3251804759215440.000020org.aclweb
326180450828890.000034com.wordstream
3271804306415390.000020ly.snip
328180430142270.000105com.giphy
3291804092420140.000015me.websta
330180405703560.000068com.sxsw
331180400588400.000035edu.psu
3321803726613650.000023gov.maryland
3331803281618130.000017ca.yelp
3341803111011870.000026com.fastcodesign
3351803067816850.000018io.material
3361803005214940.000021org.amnesty
337180272202690.000089org.python
338180260665220.000048org.mediawiki
339180260024790.000051com.buzzfeed
3401802477211500.000027com.findlaw
341180231067830.000037com.arstechnica
342180227103820.000064com.oreilly
3431802218016670.000019edu.toronto
3441801942817030.000018com.healthgrades
3451801823220780.000015tl.page
346180165864770.000051edu.cornell
347180131063420.000071com.springer
348180089624040.000062it.placehold
3491800636815660.000020com.raywenderlich
350180056383830.000064com.nypost
351180050809370.000034com.contentmarketinginstitute
352180027543250.000073int.who
353180019545130.000048org.nodejs
3541799968215240.000020gov.mt
3551799797413450.000023us.pa.state
356179966523290.000073com.cnbc
3571799346613630.000023gov.oregon
3581799055810660.000030com.bandsintown
359179886843920.000063com.gmail
3601798855017340.000018com.wayfair
361179844803320.000072fr.free
362179808782370.000102org.drupal
363179793269100.000034com.angieslist
364179791464500.000055com.kickstarter
3651797902421540.000014com.brandyourself
366179788003990.000062uk.co.dailymail
3671797827613520.000023com.quicksprout
368179776062600.000093uk.org.ico
369179772424690.000052gov.whitehouse
3701797463210140.000031com.speakerdeck
371179727802400.000102com.rawgit
372179645006590.000042com.intel
373179539388990.000034com.wikia
37417951944830.000482com.googleadservices
375179514965060.000049com.box
3761794564214180.000022com.huffpost
3771794487810860.000029net.leadpages
378179432785080.000049com.cbsnews
379179432123230.000075com.time
3801794260619760.000015com.zynga
381179406342420.000101com.getbootstrap
382179405148110.000036com.superpages
3831794020415830.000019com.impactbnd
384179401441770.000144jp.co.yahoo
38517937886670.000548net.jsfiddle
3861793606212710.000024com.smallbiztrends
3871793602615110.000020org.gnupg
3881793488614610.000021co.leadpages
389179345003140.000076com.staticflickr
3901793379812250.000025com.googlegroups
3911793325014880.000021com.thumbtack
392179323023670.000066com.ft
3931793146420510.000015com.ewtn
394179298783700.000066com.office
3951792868615760.000019com.kaggle
396179269341370.000200com.wixsite
3971792207619960.000015org.spie
3981791951016410.000019com.thecut
3991791891812610.000025com.ebayimg
4001791784018060.000017com.googledrive
401179177483910.000063com.aol
4021791751215730.000019org.jenkins-ci
403179153344340.000056com.fortune
4041791115222750.000014net.organicfacts
405179106803570.000068com.unsplash
4061791019621180.000015it.polito
4071790270617530.000018com.mindbodygreen
408178999425990.000043com.proofpoint
4091789948411610.000027edu.ucsd
4101789818222390.000014net.furaffinity
411178959267660.000037com.engadget
412178958601310.000218com.weibo
413178956162290.000104com.surveymonkey
4141789512415560.000020com.crashlytics
4151789164615220.000020com.toptal
416178910642980.000079com.skype
4171789043413100.000024com.avvo
4181788975420120.000015com.doctoroz
4191788927813510.000023io.fabric
4201788840416520.000019com.thoughtworks
421178883441190.000253com.jimdo
422178842963390.000071com.w3schools
423178828263810.000064org.un
4241788236019120.000016com.mysanantonio
4251788001214260.000022com.carto
4261787783214780.000021com.grammarly
427178769289330.000034com.pexels
4281787577414850.000021org.sqlite
429178755861320.000214com.youtube-nocookie
430178730609860.000032com.gizmodo
4311786842617470.000018gov.arts
432178681389920.000032edu.upenn
4331786605810740.000030org.vim
4341786484218120.000017com.instapaper
435178623546900.000040com.vice
436178618986740.000041gov.nist
43717857890700.000536org.reactjs
4381785742818770.000016gov.la
4391785739017120.000018com.politifact
440178568468260.000036com.blackberry
4411785638215840.000019com.ogilvy
442178561227190.000039com.msdn
4431785530222490.000014edu.utep
4441785512415780.000019com.citysearch
445178541508930.000034edu.umich
446178525382230.000106net.behance
4471785012021560.000014com.dynamics
448178490383900.000063com.booking
4491784600022730.000014com.asmallorange
4501784380613110.000024com.curbed
451178427124780.000051com.herokuapp
452178422982160.000111com.automattic
4531784209815410.000020org.aiga
454178420809230.000034org.worldbank
455178418881470.000176com.aspnetcdn
4561784127818500.000017com.deepmind
4571784102012280.000025com.sprinklr
4581784068210580.000030com.thinkwithgoogle
4591783927023600.000013it.clyp
4601783861215400.000020com.instapage
461178379522720.000088com.digg
4621783754016140.000019com.cmswire
463178366364470.000055com.goodreads
4641783520220330.000015au.com.huffingtonpost
465178341946810.000041com.symantec
466178327943850.000064com.dailymotion
4671783250016060.000019com.vendio
4681783200022040.000014net.openreview
469178317088650.000035net.openid
4701783053222060.000014com.kvue
471178300321710.000155com.feedburner
4721782934814790.000021gov.wi
4731782669220560.000015com.kudzu
4741782652620470.000015com.stamen
4751782637812750.000024com.merriam-webster
4761782499615100.000020com.csoonline
4771781922819010.000016it.binged
4781781812813940.000022com.coschedule
4791781737821870.000014com.writersdigest
480178170486990.000040org.bitbucket
481178157208760.000035edu.columbia
4821781570018970.000016google.ai
4831781460612440.000025com.auth0
4841781410811120.000029edu.utexas
4851781351011030.000029org.weforum
4861781180817570.000018com.merchantcircle
4871781177627320.000013com.bitballoon
4881781161821210.000015edu.dukeupress
4891781084620180.000015com.ingress
490178086941480.000175com.tripadvisor
4911780854818580.000016com.king5
492178081803070.000077com.wiley
4931780378219580.000016com.nngroup
4941780373814570.000021com.vanityfair
495178010283370.000072com.hp
496177979941250.000236jp.co.google
497177975583200.000075com.scribd
498177956843360.000072com.tripod
499177947427010.000040io.codepen
5001779463021620.000014io.prototypr
501177945284460.000055com.aliyuncs
502177940529720.000033uk.co.guardian
503177936625660.000045com.samsung
504177932864510.000055com.slack
505177930346850.000041org.eff
506177919365470.000046com.webs
507177899944740.000052com.atlassian
508177898001980.000119de.amazon
5091778976428150.000012edu.alamo
5101778887215200.000020com.jeffbullas
5111778557218440.000017ca.ubc
512177836664520.000054com.newrelic
5131777899817760.000017com.financialexpress
5141777748410510.000030com.yellowpages
5151777716416470.000019org.owasp
5161777695012090.000026org.whatbrowser
5171777270414220.000022org.tigris
5181777179417230.000018com.thermofisher
519177711044290.000057com.businesswire
5201776937416640.000019org.wikidata
521177692202050.000113com.bandcamp
522177684041950.000122com.constantcontact
5231776707014440.000021com.pcworld
524177662828610.000035com.dropboxusercontent
5251776352612330.000025edu.purdue
526177624442970.000080com.wufoo
527177620307340.000038com.createjs
528177618103960.000063com.force
529177598465650.000045in.co.google
530177594663640.000067org.doi
5311775770621930.000014com.hotfrog
532177572348630.000035com.foxnews
5331775672614020.000022org.letsencrypt
534177559562000.000117org.icann
535177559084180.000060com.inc
5361775582415280.000020com.invisionapp
5371775537423220.000013com.yellowbook
538177550842950.000081gov.cdc
5391775245211350.000028org.altervista
5401774995421670.000014com.khou
5411774958021060.000015com.quickanddirtytips
5421774925414160.000022org.sonatype
5431774917624220.000013es.iac
544177491621700.000156ru.mail
5451774810212810.000024com.storify
5461774556411850.000026us.imageshack
5471774543423590.000013org.hg
548177438566960.000040com.psychologytoday
5491774345812510.000025com.upwork
5501774332410520.000030com.ycombinator
5511774222816460.000019com.kinsta
5521774220410270.000031com.hootsuite
5531774177212040.000026ca.blogspot
5541774145022840.000014com.theminimalists
5551773832812540.000025com.ifttt
556177327282990.000079com.prnewswire
5571773264020860.000015jp.riken
5581773073619380.000016at.tugraz
559177306528410.000035com.docker
5601773011013370.000023in.blogspot
5611772810221320.000014com.theoutline
5621772742211720.000027com.indiegogo
563177241289540.000033com.alexa
5641772392235500.000012com.twitpic
565177233245760.000044com.windowsphone
5661772308411730.000027com.homeadvisor
5671772260016940.000018uk.co.metro
5681772003827450.000013com.idt
5691771940424560.000013com.23hq
5701771798414710.000021org.khanacademy
5711771619618210.000017org.elasticsearch
572177159787670.000037com.indiatimes
5731771546219890.000015com.shoutmeloud
574177145864800.000051com.nature
575177136984140.000060edu.cmu
5761771311218560.000017com.city-data
5771771235621160.000015com.kgw
5781771172412380.000025org.pewresearch
579177113809750.000033com.sfgate
5801771118816720.000018gov.nh
5811771107019830.000015google.design
582177084604570.000053com.gitlab
583177083845070.000049uk.co.independent
5841770818016780.000018org.polymer-project
5851770811221290.000014org.designmuseum
586177080802190.000108jp.ne.hatena
587177071742240.000106to.amzn
5881770428611430.000027edu.wisc
589177037085270.000047com.statista
590177026768020.000036com.netflix
5911770244412080.000026com.firefox
5921770164029440.000012edu.brown
5931770031617770.000017com.tutsplus
5941769933221970.000014ca.uwaterloo
5951769703021890.000014com.company
5961769651216090.000019com.martechtoday
597176964145600.000045org.pbs
5981769617816320.000019com.fiverr
5991769393623530.000013com.instructables
6001769383612260.000025com.clicky
601176935162440.000101com.wpengine
6021769338410090.000031com.uservoice
6031769024831330.000012net.digitalcongo
604176881044960.000049us.icio
6051768809817240.000018us.nm.state
6061768573229200.000012com.wvec
6071768563028310.000012com.growtix
6081768481217860.000017us.ma.state
6091768453211160.000028uk.ac.cam
6101768433023720.000013com.warriorplus
611176841349570.000033com.shutterstock
6121768363613140.000024uk.co.theregister
6131768218810070.000032es.agpd
6141768215821680.000014com.what3words
6151768038218760.000016com.itsnicethat
616176798703260.000073org.joomla
6171767637421530.000014com.dreamgrow
6181767574012120.000026com.playstation
6191767491223250.000013org.webpagetest
6201767466418150.000017io.pantheon
6211767395230260.000012org.nalip
6221767307013820.000022com.digitaltrends
6231767280222560.000014com.googlelabs
624176727981060.000299net.2mdn
625176717345090.000049tv.twitch
6261767117411680.000027com.steamcommunity
6271767082020610.000015com.targetmarketingmag
628176706921780.000144me.line
6291767058628010.000013co.edureka
6301767036022300.000014eu.i-scoop
6311767022819290.000016com.wral
6321766987219740.000015us.wi.state
6331766857822290.000014net.wrightflyer
6341766682223550.000013gov.cabq
635176665343400.000071com.bitly
636176662083680.000066cn.com.sina
6371766583013760.000022com.intuit
6381766548613390.000023kr.or.kisa
6391766546410430.000030com.newsweek
6401766527811520.000027edu.northwestern
6411766428223840.000013edu.uah
6421766365818160.000017com.rabbitmq
6431766288820670.000015com.wfaa
6441766282213120.000024com.ning
6451766249819230.000016ch.ethz
6461766165216220.000019com.sharefile
6471766125212590.000025com.pcmag
648176604684070.000061edu.nyu
6491765978810290.000031gov.fcc
650176589923480.000070org.opensource
65117658162740.000531me.ogp
6521765804824000.000013com.wikidot
6531765734416100.000019com.com
654176572461870.000133com.eepurl
6551765701412270.000025com.ssrn
656176569886940.000040com.xinhuanet
6571765406417800.000017org.scala-lang
6581765318814080.000022edu.unc
6591765256818940.000016org.iihs
660176521047710.000037org.plos
6611765173212740.000024tv.ustream
6621765138210680.000030ly.ow
6631765096621940.000014com.almanac
6641765052620550.000015com.gamespot
6651765022023350.000013com.bibliocommons
666176494446600.000042com.feedly
667176486947970.000036com.deloitte
668176465769590.000033gov.senate
6691764629022180.000014org.onegreenplanet
6701764534621250.000014com.yourdomain
671176451684330.000057com.squareup
6721764498218860.000016com.mariadb
6731764341415480.000020org.postimg
6741764297812910.000024org.cambridge
6751764250623810.000013com.marksdailyapple
676176420722610.000091com.histats
6771764150416660.000019com.digitaloceanspaces
6781764147412670.000024com.canva
6791764142413920.000022im.gitter
6801764132611980.000026com.techrepublic
6811764073427490.000013com.themonitor
6821764068815770.000019uk.co.thesun
6831764056016180.000019com.nba
6841763954421700.000014com.winemag
6851763827614310.000022com.mcafee
686176382689130.000034gov.justice
687176356947220.000039com.steampowered
688176336148860.000035com.timeanddate
689176335664450.000055com.adweek
690176316848340.000035com.aliexpress
691176301963020.000078com.netdna-ssl
6921763011017640.000017us.oh.state
6931762937612410.000025com.optinmonster
6941762864413890.000022org.js
6951762862422400.000014jp.ac.kobe-u
696176277566570.000042gov.noaa
6971762657617820.000017org.openweathermap
698176258668530.000035com.marketwatch
6991762521423710.000013com.winefolly
7001762465415850.000019org.golang
701176238843430.000071ca.google
7021762388211710.000027com.hollywoodreporter
7031762339426420.000013org.travelblog
7041762181229150.000012me.pxlme
7051762174217180.000018com.crunchbase
7061762110424170.000013com.thedrinksbusiness
7071762091812530.000025com.mlb
7081762084421830.000014com.designobserver
7091761979819570.000016com.whitepages
7101761886013080.000024fr.lemonde
7111761727616500.000019com.pastebin
7121761602026670.000013com.backyardchickens
713176159963780.000065com.themeisle
714176153242470.000099io.polyfill
7151761467236740.000011org.torproject
7161761446211960.000026com.politico
717176125989650.000033de.blogspot
7181761246821430.000014com.programmableweb
719176123807770.000037gov.house
7201761235023780.000013uk.ac.hud
721176122263130.000076com.fc2
722176095723510.000069jp.co.rakuten
7231760942612840.000024se.haxx
724176091704010.000062com.smugmug
7251760904821910.000014com.azfamily
726176073521260.000236info.aboutads
7271760703250500.000007com.formula1
7281760632029480.000012com.locationrebel
729176040202520.000097com.marriott
730176033541850.000134com.xing
7311760315615430.000020org.doxygen
732176029564910.000050com.snapchat
7331760190227710.000013com.trendland
7341760064010730.000030com.americanexpress
7351760063611150.000028com.redhat
7361760060623940.000013com.sitejabber
7371760043623110.000014com.galvanize
7381760009042850.000009com.dreamstime
7391759969020350.000015com.insiderpages
7401759912614190.000022kr.flic
7411759906611100.000029gov.uspto
742175990608370.000035br.com.uol
743175960145300.000047com.163
744175958762900.000082gov.ftc
745175954724950.000049com.nasdaq
7461759512627530.000013com.lookuppage
7471759355011340.000028fr.blogspot
7481759257012780.000024com.prezi
7491759171226590.000013com.avsforum
750175913284100.000061mp.mailchi
7511759060820200.000015edu.arizona
752175902307930.000036com.nielsen
7531758973823640.000013com.chamberofcommerce
7541758941421470.000014com.towardsdatascience
7551758909010500.000030com.sciencedaily
756175881149780.000033io.readthedocs
757175878442830.000083com.dedecms
7581758750415490.000020uk.co.wired
7591758657812520.000025com.dell
7601758581014350.000021com.billboard
761175856604210.000059com.criteo
7621758552422830.000014org.zenit
7631758518810620.000030org.change
7641758484013040.000024edu.academia
765175838185880.000044com.newyorker
7661758220035910.000012com.sophos
7671758218017410.000018de.welt
768175814883520.000069net.themeforest
7691758130422930.000014org.gwtproject
7701758066227880.000013io.setosa
7711758065612760.000024st.prom
7721758061414330.000021fm.last
7731758054017300.000018com.fifa
7741758053026870.000013com.storeboard
7751758028221690.000014au.com.truelocal
7761758019422970.000014com.2findlocal
7771758007010930.000029com.visualstudio
7781757974011110.000029com.500px
779175795382500.000097jp.co.amazon
7801757858022640.000014net.webhostingsecretrevealed
7811757497617520.000018org.rubyonrails
78217574924590.000607com.messenger
7831757488616900.000018com.mtv
7841757466222770.000014com.newsbank
7851757378210950.000029de.heise
7861757338419110.000016com.ibtimes
7871757061016160.000019com.problogger
7881757012619950.000015com.ehow
7891756978419980.000015mp.j
7901756858010360.000031com.cbslocal
7911756837023980.000013com.wcnc
7921756824410960.000029com.investopedia
7931756778029300.000012edu.unl
7941756710423170.000014ly.cl
795175663446870.000041com.caniuse
796175663024310.000057com.verisign
7971756612014550.000021com.hotmail
7981756591421810.000014au.com.yellowpages
7991756560014530.000021com.rollingstone
8001756557221150.000015com.local
801175644282310.000104fr.google
802175636882150.000111it.google
8031756328820030.000015com.smartblogger
8041756286816630.000019org.coursera
8051756247222200.000014gov.louisvilleky
8061756208212020.000026com.domain
807175608685970.000043com.nationalgeographic
8081756010421230.000015com.theinnovationenterprise
8091755964028540.000012ke.co.blogspot
8101755880222650.000014io.kubernetes
8111755870223190.000014net.brownbook
8121755817218590.000016de.zeit
8131755810613440.000023com.freepik
8141755762022050.000014com.goinswriter
815175574327310.000039com.tandfonline
8161755694614800.000021edu.jhu
8171755652220710.000015com.riddle
8181755637612560.000025com.vox
8191755560211270.000028com.smashingmagazine
8201755466017560.000018edu.msu
821175544228380.000035com.uk
8221755430429580.000012org.dyndns
8231755391424030.000013com.wsoctv
8241755377624060.000013com.independent
8251755377613870.000022com.nymag
8261755298818090.000017com.posterous
8271755083411890.000026com.digitalocean
828175505168830.000035com.gofundme
829175498042550.000095com.myshopify
8301754935627460.000013com.spoke
8311754912220640.000015com.chambermaster
8321754830211790.000027de.spiegel
8331754818817840.000017com.ikea
8341754815422630.000014com.bizcommunity
8351754809427300.000013com.communitywalk
8361754751623990.000013com.ibmbigdatahub
8371754748619060.000016com.thewritepractice
8381754684615990.000019org.filezilla-project
8391754681018990.000016com.techradar
8401754667819630.000015com.visioncritical
8411754615419690.000015com.brafton
8421754585216270.000019com.codeplex
843175453384280.000057com.sohu
844175443163350.000072com.jotform
8451754371417790.000017com.lawyers
8461754344223160.000014edu.hbs
8471754305814010.000022edu.usc
848175429641520.000169com.addtoany
8491754280827050.000013com.nation2
8501754260213170.000023edu.uchicago
8511754237621820.000014com.w3techs
8521754149613710.000023sh.brew
8531754127213250.000023com.strikingly
8541754026219880.000015org.aclu
8551754023625790.000013com.kens5
856175399064530.000054jp.ne.sakura
8571753989410920.000029com.prweb
8581753981024280.000013com.tractorsupply
8591753938235400.000012com.gyazo
8601753924027170.000013com.yelloyello
8611753882012310.000025com.elpais
8621753866238640.000010com.rottentomatoes
8631753829621380.000014net.hockeyapp
8641753791216970.000018com.howstuffworks
8651753668628050.000012com.lacartes
8661753628816280.000019io.getmdl
8671753539223430.000013com.citysquares
868175342227610.000037net.daum
8691753346024070.000013com.kmov
8701753071028160.000012com.mothering
871175304264840.000051com.iconfinder
8721752968628730.000012org.rethinkingschools
8731752881014560.000021org.wiktionary
874175285327070.000040com.emarketer
875175285122590.000094me.t
8761752836828850.000012com.asus
8771752749419040.000016com.rt
878175274829930.000032com.oup
8791752724813830.000022com.theglobeandmail
8801752471212700.000024co.vine
8811752442027680.000013org.foodrevolution
8821752403223650.000013com.wpxi
883175232329730.000033com.airbnb
884175229549700.000033gov.usa
8851752287227020.000013com.njmonthly
8861752260210110.000031org.unesco
8871752204828000.000013org.thebestschools
8881752126823090.000014com.ezlocal
889175211241730.000153com.bluehost
890175210402280.000105com.maxcdn
8911752073624010.000013com.cbs
8921751997014050.000022org.example
8931751974628060.000012com.calmclinic
894175196549640.000033gov.copyright
8951751913021340.000014edu.ncsu
8961751762639000.000010com.domaintools
8971751726427990.000013com.trepup
8981751719018490.000017edu.indiana
8991751623614100.000022org.unicode
9001751492827340.000013com.mykaratestore
9011751477018600.000016com.adespresso
902175147424870.000050org.whatwg
9031751445037820.000011gd.is
9041751338418370.000017re.cli
9051751313431620.000012com.000webhostapp
906175129429070.000034com.alibaba
9071751291616240.000019com.britannica
9081751273612100.000026com.reverbnation
909175122566090.000043com.patreon
9101751153040220.000010edu.iu
911175111247940.000036com.yandex
912175110745250.000047com.outlook
9131751021011510.000027org.fao
9141750997627560.000013co.wanelo
9151750894616930.000018com.udemy
9161750881411880.000026gov.usgs
917175085889410.000034com.ggpht
9181750850613090.000024uk.co.mirror
9191750842211560.000027edu.umn
920175077303180.000075nl.google
921175050542580.000094com.disqus
9221750475413130.000024com.pwc
923175046389610.000033com.pinimg
9241750445818000.000017com.html5rocks
9251750414811240.000028com.sun
9261750346812000.000026com.uber
9271750179246430.000008com.mysite
9281750175623080.000014org.gimp
9291750172220660.000015com.packtpub
9301750169026760.000013com.pages10
9311750146824210.000013com.tuck
9321750099226150.000013org.swi-prolog
9331750021818350.000017edu.virginia
9341749988628970.000012be.brussels
9351749978411060.000029au.net.abc
936174996782260.000105com.googletagservices
9371749699635310.000012ch.cern
9381749628225800.000013com.ktvb
939174961025010.000049com.bigcartel
9401749579819300.000016com.nfl
9411749579413000.000024com.showmelocal
9421749451813580.000023org.pnas
9431749401421750.000014uk.co.realbusiness
9441749401027280.000013ly.visual
9451749244221090.000015com.discovery
9461749238426220.000013org.virginiadot
9471749197810690.000030com.us
9481749185218290.000017edu.cuny
9491749129816810.000018com.podbean
9501749117211820.000026com.accenture
9511749114027550.000013com.pushwoosh
9521749087625880.000013com.yellowbot
9531749065229030.000012com.watchuseek
9541749051014510.000021com.thehill
9551749044228340.000012com.callupcontact
9561749015623970.000013com.echelman
9571749000038530.000010org.greenpeace
9581748917417830.000017com.screencast
9591748847620430.000015com.webnode
9601748835611190.000028com.lifehacker
961174870609910.000032org.iso
962174870487280.000039com.gartner
9631748580416800.000018com.hulu
9641748578419270.000016co.gcdn
9651748460015250.000020com.windows
9661748386620010.000015com.birdeye
967174830349510.000034ru.google
9681748297039090.000010org.bitcoin
9691748132624130.000013com.topsy
9701748103624290.000013com.texasbar
971174804548820.000035com.stitcher
9721747894029240.000012com.talkbass
9731747858422780.000014ca.ualberta
974174785048430.000035gg.discord
9751747845028550.000012com.cylex-usa
9761747751436730.000011nl.xs4all
9771747739629250.000012info.ufacity
978174773821040.000304com.namecheap
9791747712228490.000012com.louisville
9801747631624600.000013uk.gov.westsussex
9811747568626680.000013com.salespider
9821747549015680.000019com.nokia
9831747547610340.000031com.digiday
9841747526228890.000012org.stnicholascenter
9851747505628500.000012au.com.hotfrog
9861747479812570.000025org.webkit
9871747449224300.000013net.blog5
9881747423622710.000014tv.periscope
989174739028770.000035uk.co.tripadvisor
9901747380229000.000012org.phys
9911747318816120.000019edu.umd
992174731048840.000035gov.ny
9931747212219810.000015ru.narod
994174716482330.000104jp.ameblo
9951747160651780.000007net.minecraft
996174707841630.000162com.youku
9971747048216770.000018org.gnome
998174701661840.000137com.nginx
9991747015614470.000021com.splashthat
10001746987826810.000013com.bleacherreport

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

April 2019 crawl archive now available

The crawl archive for April 2019 is now available! It contains 2.5 billion web pages or 198 TiB of uncompressed content, crawled between April 18th and 26th.

The April crawl contains page captures of 750 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Nov/Dec/Jan 2018/2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 3 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
  • a random sample of 1 billion outlinks taken from WAT files of the March crawl

The following minor changes to the crawler configuration have been made:

  • the crawler now sends again an Accept-Language HTTP header, requesting English content
  • the configuration has been tweaked to include less non-HTML content

Archive Location and Download

The April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-18/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-18/segment.paths.gz100
WARC filesCC-MAIN-2019-18/warc.paths.gz5600044.86
WAT filesCC-MAIN-2019-18/wat.paths.gz5600016.32
WET filesCC-MAIN-2019-18/wet.paths.gz560006.96
Robots.txt filesCC-MAIN-2019-18/robotstxt.paths.gz560000.16
Non-200 responses filesCC-MAIN-2019-18/non200responses.paths.gz560001.67
URL index filesCC-MAIN-2019-18/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-18/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March 2019 crawl archive now available

The crawl archive for March 2019 is now available! It contains 2.55 billion web pages or 210 TiB of uncompressed content, crawled between March 18th and 27th.

The March crawl contains page captures of 660 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Nov/Dec/Jan 2018/2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains
  • a random sample of outlinks taken from WAT files of the February crawl

Archive Location and Download

The March crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-13/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-13/segment.paths.gz100
WARC filesCC-MAIN-2019-13/warc.paths.gz5600049.09
WAT filesCC-MAIN-2019-13/wat.paths.gz5600017.37
WET filesCC-MAIN-2019-13/wet.paths.gz560007.47
Robots.txt filesCC-MAIN-2019-13/robotstxt.paths.gz560000.17
Non-200 responses filesCC-MAIN-2019-13/non200responses.paths.gz560001.63
URL index filesCC-MAIN-2019-13/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-13/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2019 crawl archive now available

The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th.

The February crawl contains page captures of 750 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Nov/Dec/Jan 2018/2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 5 links (“hops”) away from the homepages of the top 50 million hosts and domains
  • a random sample of outlinks taken from WAT files of the January crawl

The number of sampled URLs per domain depends on the domain’s harmonic centrality rank in the webgraph data set – higher ranking domain are allowed to “contribute” more URLs.

The way our crawler handles politeness limits per host and/or pay-level domain has been improved:
First, limits are now configurable and are based on the harmonic centrality rank of a domain.
Second, we now also put a limit on the number of hosts/subdomains per domain. This limit is also based on the domain rank and ranges from 500,000 subdomains for top-ranking domains (think of blogspot.com) to less than 100 for low-ranking domains. While the the number of hosts covered in the February crawl dropped to 50 millions from 60 millions in January, we see a positive impact on the total amount of pages crawled for large domains. Technically, every host requires a DNS lookup and a robots.txt fetch even if only a single page is fetched from this host and the performance of the crawler improves if resources are focused on few 100,000 subdomains and not spread over millions of hosts. We also hope that a limit on the number of hosts per domain makes the crawler more robust against link spam. The set of sampled subdomains for large domains will vary from month to month to guarantee a good overall coverage if multiple monthly crawls are combined.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-09/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-09/segment.paths.gz100
WARC filesCC-MAIN-2019-09/warc.paths.gz6400059.86
WAT filesCC-MAIN-2019-09/wat.paths.gz6400018.23
WET filesCC-MAIN-2019-09/wet.paths.gz640007.62
Robots.txt filesCC-MAIN-2019-09/robotstxt.paths.gz640000.17
Non-200 responses filesCC-MAIN-2019-09/non200responses.paths.gz640001.79
URL index filesCC-MAIN-2019-09/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-09/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 – 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 407 million nodes and 4.2 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 323 million dangling nodes (79%) and the largest strongly connected component contains 63 million (15%) nodes.

You can download the graph and the ranks of all 407 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Nov/Dec/Jan 2018-19 host-level webgraph

SizeFileDescription
2.90 GBcc-main-2018-19-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 42 vertices files
18.84 GBcc-main-2018-19-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 84 edges files
7.81 GBcc-main-2018-19-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2018-19-nov-dec-jan-host.properties
8.16 GBcc-main-2018-19-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-19-nov-dec-jan-host-t.properties
1 kBcc-main-2018-19-nov-dec-jan-host.statsWebGraph statistics
7.50 GBcc-main-2018-19-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 90 million nodes and 1.69 billion edges. 53% or 48 million nodes are dangling nodes, the largest strongly connected component covers 37 million or 41% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2018-19 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2018-19-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.76 GBcc-main-2018-19-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.68 GBcc-main-2018-19-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2018-19-nov-dec-jan-domain.properties
3.82 GBcc-main-2018-19-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2018-19-nov-dec-jan-domain-t.properties
1 kBcc-main-2018-19-nov-dec-jan-domain.statsWebGraph statistics
1.96 GBcc-main-2018-19-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2018-2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12720328820.012818com.facebook
22708181610.017236com.googleapis
32553310830.010690com.google
42426790640.007625com.twitter
52400138450.006755com.youtube
62318722660.006532org.w
72160578680.003925com.instagram
82138665870.004753org.gmpg
920954100110.003053com.linkedin
1020252174120.002871org.wordpress
1120166276150.002217com.wordpress
1220071538240.001532com.gravatar
1320054574220.001673com.pinterest
1420035420270.001366org.wikipedia
1519689680210.001831com.apple
1619669598130.002776com.bootstrapcdn
1719590352360.000986com.blogspot
1819579602280.001308com.vimeo
1919357866410.000827be.youtu
2019345240140.002221com.cloudflare
2119288382370.000940gl.goo
2219267938290.001236com.microsoft
2319181584250.001444com.adobe
2419148714420.000817com.amazon
2519143492170.002000com.googletagmanager
2619087530490.000656com.tumblr
2719040054230.001572com.macromedia
2819024404510.000647com.wp
2919009908160.002181com.flickr
3018980982710.000509ly.bit
3118864572740.000479com.yahoo
3218847172390.000879com.amazonaws
3318818456380.000888com.paypal
3418798784200.001840com.github
3518762366650.000584org.mozilla
3618761330260.001413com.gstatic
3718756286640.000596me.wp
3818648308970.000312com.googleusercontent
3918645512400.000867net.cloudfront
4018636726850.000364com.soundcloud
41186286241090.000267com.nytimes
4218621690810.000425com.weebly
4318599952540.000633net.doubleclick
4418588240440.000760org.w3
4518584910870.000346co.t
46185756341010.000303com.reddit
4718568330680.000521com.medium
48185605521500.000157org.wikimedia
49185521041110.000257com.dropbox
5018520572830.000402org.creativecommons
51185093761340.000192org.archive
5218508730330.001034io.github
5318458414770.000443com.bing
54184430421460.000176net.slideshare
55184423421240.000215com.imgur
5618429066310.001176ru.yandex
5718423958820.000417de.google
5818405610300.001179net.fbcdn
59183989441580.000150edu.stanford
60183792942410.000099com.bbc
61183757502150.000111com.tinyurl
6218368156340.001028org.apache
6318362114940.000316com.mailchimp
64183382961830.000127com.wired
65183223341360.000190com.blogger
6618280996630.000599eu.europa
67182778041300.000200com.issuu
68182718242190.000109com.bloomberg
69182574221820.000127com.myspace
7018254210800.000425com.jquery
7118249798780.000433com.baidu
72182303643470.000069com.appspot
73182207681370.000188com.eventbrite
74182148201250.000212com.yelp
75182091941380.000185com.spotify
76182086441430.000180org.ietf
77182020761890.000125com.oracle
78182006281720.000139com.android
79181960042480.000095org.npr
80181949383310.000072com.theverge
8118188710320.001108com.squarespace
82181800643070.000077com.googleblog
83181686961730.000139org.gnu
84181682821150.000241com.youtube-nocookie
85181664003520.000068com.quora
8618166372840.000388com.statcounter
87181622703550.000068com.deviantart
88181476903140.000076com.buzzfeed
89181321482810.000083org.python
90181305442840.000082me.about
91181228524260.000057com.slate
92181208744430.000055org.ieee
93181098883570.000068uk.co.independent
94181042201170.000228com.whatsapp
95180942922790.000085com.w3schools
9618092538720.000495org.schema
97180870664480.000054edu.upenn
9818080120450.000737com.fontawesome
99180765784760.000051edu.ucla
100180723664240.000057edu.washington
101180720786410.000045org.chromium
102180686244680.000052uk.ac.ox
103180674043860.000063com.newyorker
104180667281860.000125net.behance
105180577002820.000083com.example
106180540904000.000061org.arxiv
107180529301040.000285com.ytimg
108180498521920.000123com.dribbble
109180291322220.000109gov.ca
110180267161400.000184com.forbes
111180250303740.000065gov.loc
112180134542280.000103com.fastcompany
113180081562530.000092com.foursquare
114180070623800.000064com.about
115180054981790.000132com.cnn
116180052621570.000150com.theguardian
117180052544660.000052com.evernote
118180027083790.000064com.git-scm
119180018923370.000071au.com.google
120180013164900.000050edu.princeton
121179975762470.000096com.typeform
122179950204690.000052com.withgoogle
123179911206480.000044com.storify
124179869525250.000047com.stackexchange
125179854826520.000044google.blog
1261798224490.003675com.godaddy
127179767822290.000103com.nbcnews
128179749721610.000148uk.co.bbc
129179732943320.000072uk.co.blogspot
130179711883960.000061com.tandfonline
131179575024080.000060com.mysql
132179460286320.000045ca.blogspot
133179435224790.000051com.libsyn
134179402781960.000120es.google
135179349264910.000050com.tinypic
136179335224820.000051com.ubuntu
137179325347480.000039com.nike
138179312944020.000061org.bitbucket
139179309762760.000085org.doi
140179296343360.000072com.getpocket
141179275966760.000043com.jetbrains
142179097102780.000085com.mozilla
143179090406970.000041com.sap
144179005944490.000054com.googlecode
14517899774730.000484com.list-manage
146178952401850.000126com.huffingtonpost
147178941466350.000045tv.ustream
14817893688860.000351com.paypalobjects
149178904464590.000053com.trello
150178868182690.000086edu.mit
151178820641520.000154net.sourceforge
152178798782450.000096com.githubusercontent
153178775544980.000049com.chrome
154178643689530.000033edu.gatech
155178596704470.000054com.docker
156178587947190.000040com.ssrn
157178585745970.000045co.g
15817857678900.000329com.wix
159178566262050.000116com.washingtonpost
1601784955410820.000029com.diigo
161178473143600.000067gov.fda
162178451281270.000205org.bbb
163178403189610.000033com.flipboard
164178390929390.000034it.scoop
165178382388190.000037com.nvidia
166178367022970.000080com.reuters
167178363562960.000080com.mapquest
168178309965700.000046com.pingdom
169178303622770.000085com.go
170178279801990.000119org.debian
171178222381980.000119com.wsj
172178222068980.000035com.fastcodesign
17317819482350.001004com.fb
174178159788140.000037site.business
175178149322340.000101com.techcrunch
176178078842500.000094com.usatoday
177178073441480.000171gov.nih
178178071561530.000154com.etsy
179178043865430.000047org.eclipse
1801779656010250.000031com.hbo
18117791748460.000708net.akamaihd
182177909862000.000118com.live
183177902109770.000032ms.1drv
184177895409750.000033nl.blogspot
185177822962160.000111com.businessinsider
186177798064150.000059com.unity3d
187177727764710.000052com.cdbaby
188177698467100.000041se.haxx
189177689301630.000142org.iana
19017766540980.000311com.shopify
191177656102830.000082com.herokuapp
192177626142800.000084edu.harvard
193177605702930.000080net.windows
194177602147470.000039org.unicode
195177581781100.000264com.jimdo
196177581683440.000070com.msn
197177578642870.000081uk.co.telegraph
198177568962090.000112com.typepad
199177553261470.000174com.opera
2001775232410840.000029com.creativebloq
201177507928520.000036edu.rutgers
202177501906620.000043gov.wa
203177500429440.000034com.history
204177487563620.000066gov.nasa
205177487048440.000037edu.illinois
206177439985090.000049au.gov.nsw
207177375546310.000045gov.dot
2081773056010240.000031edu.pitt
209177303901910.000124com.imdb
21017727110950.000315net.jsdelivr
211177265783730.000065com.mashable
21217721654670.000526com.vk
21317719938470.000677net.facebook
214177190501950.000121uk.co.amazon
215177176261050.000279com.google-analytics
216177159863130.000076com.cnet
2171771261611920.000027org.wikibooks
218177113462380.000100com.ibm
219177086749060.000035ca.utoronto
220177060483720.000065com.ted
221177031649300.000034au.com.blogspot
222176966348090.000038com.ecwid
223176928044220.000058uk.co.pinterest
224176888509540.000033com.theknot
225176833809710.000033edu.osu
226176768923680.000066com.latimes
227176759482310.000103net.php
2281767486610230.000031com.dw
229176737229720.000033org.hrw
230176691761810.000128com.stackoverflow
2311766639610630.000030io.itch
232176632362620.000090com.npmjs
233176546049170.000035us.mn.state
234176544823870.000063uk.co.dailymail
235176540423060.000077com.time
236176530781750.000137com.twimg
237176513642140.000112com.surveymonkey
238176484624930.000050net.researchgate
2391763963812870.000024com.kinja
240176360609130.000035gov.defense
241176346084230.000058edu.cornell
242176328728040.000038com.citrix
24317631522180.001984com.wixstatic
2441762797213490.000023com.instapaper
245176273864560.000053io.readthedocs
246176227389030.000035com.vogue
247176224843390.000071me.telegram
248176222747380.000040org.postgresql
2491761973212110.000026com.dezeen
250176195408420.000037com.citysearch
251176178104400.000056com.ft
252176162106880.000042org.kernel
253176159329690.000033com.yellowpages
254176158501440.000179uk.co.google
255176151322750.000085org.acm
256176119381600.000148com.zendesk
257176084144200.000058com.kickstarter
2581760708210600.000030com.strava
259176067624190.000058edu.berkeley
2601760625210450.000030gov.mo
261176041263330.000072com.cnbc
26217602550520.000636com.qq
263175987906700.000043com.adjust
2641759698210260.000031gov.oregon
265175966842990.000080com.meetup
2661759478810160.000031org.tensorflow
267175941783120.000077com.mapbox
268175924521590.000150com.salesforce
269175865243530.000068com.gmail
2701757759410660.000030com.googlesource
2711757471611760.000027edu.kit
272175746563270.000073com.springer
27317574172550.000629net.jsfiddle
274175713428480.000037com.wikia
2751757021011230.000028gov.ky
276175700306850.000042com.matterport
2771756918410550.000030com.hackernoon
278175690723820.000064com.fortune
279175681963970.000061com.photobucket
280175658683760.000065com.giphy
281175619183490.000069com.nypost
282175615286640.000043com.angieslist
2831755858811030.000029gov.wi
284175584509080.000035com.xrea
285175581781870.000125com.ebay
286175577868700.000036com.pixabay
2871755578410350.000031org.wnyc
288175540246270.000045com.economist
289175532462850.000082com.hubspot
290175529048580.000036edu.columbia
291175524823170.000076org.un
292175512503940.000062org.hbr
293175477688240.000037com.arstechnica
294175472525210.000048com.livechatinc
295175446489670.000033com.missingkids
296175428121350.000191com.feedburner
297175426465630.000046com.nationalgeographic
298175422108390.000037edu.yale
299175418909600.000033org.ohchr
300175397608260.000037org.aarp
301175389685500.000046com.scribd
3021753625410370.000031gov.maryland
303175355529870.000032gov.michigan
3041753487811700.000027gov.mt
305175327283540.000068com.oreilly
306175289141160.000238com.addthis
307175249624100.000060com.theatlantic
3081752289411640.000027org.amnesty
309175228387670.000039com.engadget
3101752274210480.000030us.pa.state
3111752258814460.000022com.jigsy
3121752073412750.000025com.healthgrades
313175202166790.000042com.intel
314175172944040.000061gov.whitehouse
3151751710612500.000025com.manta
316175151706890.000042com.vice
317175150684120.000059com.unsplash
318175078183110.000077com.wiley
319175064941280.000204com.wixsite
320175032206370.000045com.wikihow
3211749983613020.000024com.merchantcircle
322174964423410.000070com.livejournal
323174949523420.000070com.booking
3241749463213950.000022io.soup
325174932303700.000065com.skype
326174906185180.000048com.samsung
327174905166550.000044com.zdnet
328174877723980.000061com.entrepreneur
329174859983000.000080com.staticflickr
330174854683430.000070com.prnewswire
3311748425413060.000024ca.yelp
3321748425412160.000026com.contently
333174835542720.000085int.who
334174830448280.000037com.qz
335174771203590.000067com.office
336174765984720.000052com.cisco
3371747658014240.000022com.gimletmedia
3381747646015400.000020com.designobserver
339174750422940.000080com.hp
340174748062600.000090gov.cdc
341174714962360.000101com.disqus
3421747099413760.000023us.wi.state
343174677866400.000045com.cbsnews
344174674125170.000048com.statista
345174673261260.000208com.weibo
346174663707290.000040co.elastic
347174657805510.000046ca.pinterest
348174657388320.000037edu.psu
3491746225812120.000026org.tigris
3501746055212960.000024com.thoughtworks
351174541024070.000060com.inc
352174526944920.000050org.mediawiki
353174501423400.000071com.dailymotion
354174493223890.000063com.aol
355174484269760.000033com.gizmodo
3561744737612780.000025org.emojipedia
3571744575210810.000029net.leadpages
358174455405000.000049gov.nist
3591744288014590.000021com.zynga
360174427003610.000067org.ampproject
3611744235012180.000026us.nm.state
3621744229814530.000021com.activerain
3631744180210460.000030com.bandsintown
364174394724840.000051com.nature
365174393285200.000048com.venturebeat
366174389745720.000046com.box
367174389641780.000135com.constantcontact
368174386342130.000112to.amzn
369174339649700.000033com.thenextweb
3701743374213220.000024com.superpages
371174320185080.000049com.symantec
372174249085520.000046org.nodejs
373174245382420.000099org.drupal
374174236061800.000131com.tripadvisor
375174232906980.000041com.deloitte
3761742249810440.000030us.fl.state
377174222482510.000094com.digg
378174198969910.000032edu.utexas
379174194209590.000033com.googlegroups
3801741856410930.000029com.pexels
3811741849213290.000024ly.snip
382174181083220.000075fr.free
383174173223080.000077com.sciencedirect
384174133822030.000117com.bandcamp
385174132286330.000045com.moz
3861741270414080.000022com.whitepages
387174102187320.000040com.psychologytoday
3881740755414800.000021com.digitaltrends
3891740409215390.000020edu.scad
3901739982610560.000030org.weforum
391173977623300.000072com.sxsw
392173949762020.000117de.amazon
393173948024640.000052com.goodreads
394173937209370.000034org.eff
395173928367540.000039com.indiatimes
3961739110811470.000028com.thinkwithgoogle
3971738592014420.000022org.khanacademy
398173800969010.000035com.shutterstock
399173795468290.000037edu.umich
400173779746580.000043com.raywenderlich
401173760583750.000065com.businesswire
4021737591413520.000023edu.usc
403173758362700.000086ca.google
404173736782260.000104com.stumbleupon
4051737300213710.000023com.mysanantonio
406173685542040.000116com.automattic
407173680548910.000035au.net.abc
408173656248640.000036org.worldbank
4091736468613500.000023edu.unc
4101736437011130.000028org.example
4111736273813750.000023it.eventbrite
4121736169012430.000025com.merriam-webster
4131736002615500.000020edu.hmc
414173575609120.000035uk.co.guardian
415173567648710.000036com.netflix
416173541164460.000055com.slack
4171735206214380.000022me.websta
4181735074212610.000025com.kaggle
419173502705440.000047org.pbs
420173471425060.000049com.webs
4211734161213380.000023com.ning
4221734102413390.000023com.speakerdeck
4231733871215980.000020au.com.yelp
4241733725016020.000020org.themoth
4251733683212720.000025com.canva
4261733653013840.000023com.pcworld
4271733562211860.000027com.indiegogo
4281733461612470.000025edu.toronto
4291733310425370.000014com.instructables
4301733156615170.000021com.brandyourself
431173310387220.000040org.unesco
4321733047611710.000027com.pcmag
433173303449560.000033com.marketwatch
434173293909450.000033com.foxnews
435173257725260.000047tv.twitch
4361732177811960.000026org.mozillazine
4371732092015520.000020org.owasp
4381731970813740.000023com.googleapps
4391731964411460.000028co.leadpages
4401731950216040.000020com.yellowbook
4411731946013910.000022org.coursera
4421731907413160.000024edu.academia
443173181943180.000075com.tripod
444173180849960.000032edu.ucsd
445173169887630.000039com.gartner
446173165649150.000035com.sfgate
447173153187180.000040com.blackberry
4481731436011170.000028org.haskell
449173142583640.000066it.placehold
4501731194215490.000020edu.utep
4511731115612240.000026gov.nh
4521731023012630.000025edu.northwestern
4531730631011900.000027de.spiegel
454173039363210.000075com.getclicky
455173022923280.000073com.rawgit
456173017183710.000065edu.nyu
4571730043410870.000029org.maven
458172997543480.000069edu.cmu
4591729840810650.000030edu.wisc
460172949649570.000033com.dropboxusercontent
461172949462950.000080com.smugmug
4621729090813350.000024com.googledrive
463172900869230.000034gov.fcc
464172896045340.000047com.outlook
4651728879612770.000025edu.uchicago
466172876666340.000045com.windowsphone
4671728480413270.000024gov.la
4681728452416890.000019org.maximumfun
469172843603050.000078net.datatables
470172843205710.000046com.lifehacker
471172838465010.000049in.co.google
472172837885800.000046gov.noaa
4731728349216440.000020edu.uah
474172827708020.000038com.steampowered
4751727909214270.000022com.invisionapp
476172744667040.000041com.msdn
4771727438010880.000029org.vim
478172742421690.000141jp.co.yahoo
479172740324730.000052com.cargocollective
4801727398411420.000028com.ycombinator
481172724382350.000101gov.ftc
4821727119613640.000023org.iihs
483172709848960.000035gov.census
4841727053412450.000025com.upwork
4851727049422280.000017com.ehow
486172704101000.000305org.networkadvertising
487172676706960.000041com.webmd
4881726742014220.000022edu.purdue
489172673523350.000072com.stripe
4901726661624270.000015com.techradar
4911726654610520.000030org.sciencemag
4921726482611270.000028org.altervista
4931726478211510.000028io.material
4941726361214820.000021com.fifa
4951726282621610.000018com.crunchbase
4961726242814510.000021com.technologyreview
497172617128220.000037gov.senate
498172609989930.000032ly.ow
4991726009811690.000027com.playstation
5001725995412370.000026com.target
501172582689160.000035com.clicky
5021725783215380.000020uk.co.wired
503172567423910.000062com.force
504172560469260.000034com.java
5051725523415370.000020com.gettyimages
5061725487418090.000019us.countrystudies
5071725483219650.000018com.semrush
5081725155811850.000027org.gnupg
5091725059811220.000028com.politico
5101725045216620.000019com.womentechmakers
511172501309020.000035gov.uspto
512172479968510.000036org.whatbrowser
5131724755821550.000018com.vanityfair
514172451441420.000180ru.mail
515172434584350.000056com.snapchat
5161724252211290.000028com.istockphoto
517172420962170.000110com.bitly
518172419743840.000064com.adweek
5191724180816980.000019com.ikea
520172412782680.000087com.wufoo
521172382361620.000144com.eepurl
5221723309811090.000029org.archlinux
5231723291013340.000024fr.lemonde
5241723191013310.000024com.econsultancy
5251723140811380.000028com.udemy
526172312681080.000268jp.co.google
5271723083213890.000022com.today
5281722803217170.000019com.yellowbot
5291722748212270.000026com.intuit
530172273769730.000033org.iso
5311722679615670.000020com.aliexpress
5321722625814680.000021au.com.smh
5331722546811970.000026co.vine
534172252789580.000033com.hootsuite
5351722435414320.000022com.underconsideration
5361722303016330.000020uk.ac.hud
5371722281814290.000022com.com
538172212488740.000036com.nielsen
5391721994217550.000019com.communitywalk
5401721967028060.000013com.123rf
541172173621700.000141com.xing
542172161989410.000034com.livestream
543172153529500.000033com.timeanddate
544172145668920.000035de.blogspot
545172144986870.000042com.proofpoint
546172143423160.000076org.joomla
5471721401013030.000024org.pnas
548172139769490.000033com.americanexpress
5491721322611400.000028org.fao
550172117642460.000096com.wpengine
5511721106010110.000031uk.ac.cam
5521721092813440.000023com.snap
5531721064612700.000025us.ma.state
554172101584300.000056com.barnesandnoble
555172101104270.000057com.squareup
556172096087720.000039gov.justice
5571720747613410.000023com.billboard
5581720700410200.000031com.alibaba
5591720549611980.000026net.noscript
5601720452013970.000022org.letsencrypt
5611720367023860.000016ca.uwaterloo
5621720316417110.000019com.espn
5631720126210330.000031io.fabric
5641719936423220.000016ca.ubc
565171989129800.000032com.variety
5661719506611430.000028com.bostonglobe
5671719485614160.000022com.homestars
5681719460824010.000015com.tutsplus
5691719403421920.000018edu.msu
5701719384617060.000019com.bitballoon
571171927486680.000043com.feedly
5721719245412920.000024in.blogspot
5731719150410860.000029fr.blogspot
5741719146422720.000017com.fiverr
5751718999422260.000017edu.indiana
5761718912214790.000021uk.co.thesun
577171878428840.000036gov.nps
5781718781016710.000019com.mcafee
579171866788200.000037com.gofundme
5801718635828590.000012com.twitpic
5811718320611480.000028com.dell
5821718299628920.000012com.codecademy
5831718272614260.000022com.city-data
5841718268614600.000021io.bitbucket
585171815107080.000041com.photoshelter
5861718139433340.000010com.dreamstime
5871718132024930.000015com.newscientist
5881718045413630.000023com.nytco
589171793084630.000052us.icio
5901717855212020.000026com.yandex
591171782342060.000116com.histats
5921717607623420.000016uk.ac.ed
5931717550210400.000031gov.fbi
594171746348460.000037com.500px
595171745724310.000056cn.com.sina
596171739506300.000045com.mobirise
5971717342812170.000026org.jenkins-ci
5981717285224480.000015ca.ualberta
5991717092816180.000020com.googlelabs
6001717015621640.000018com.socialmediaexaminer
6011717003410590.000030com.wayfair
6021716944412380.000026uk.co.mirror
6031716940813000.000024us.oh.state
604171694068060.000038com.buffer
605171692981550.000152it.google
606171688665290.000047com.format
6071716820013460.000023org.threejs
608171676247740.000039com.uk
6091716745014480.000022org.spie
6101716566012940.000024kr.flic
6111716551611320.000028edu.umn
6121716521413040.000024com.iconarchive
613171639042330.000103com.myshopify
614171628864340.000056com.nasdaq
615171620527090.000041com.uservoice
6161716157623530.000016com.screencast
617171613529050.000035br.com.uol
618171608502980.000080nl.google
6191716006212190.000026com.scientificamerican
6201715944228000.000013ly.visual
621171588829640.000033com.prweb
6221715885611780.000027com.smashingmagazine
6231715861813660.000023com.nymag
624171571463810.000064com.dmca
6251715699613240.000024com.hollywoodreporter
6261715537212730.000025com.warnerbros
627171551548620.000036net.openid
628171545787650.000039gov.copyright
6291715390612490.000025com.prezi
6301715200224390.000015com.aljazeera
631171510741490.000170gov.privacyshield
6321715082410420.000031com.airbnb
633171503409240.000034ca.cbc
6341714982012530.000025com.gigaom
6351714841411010.000029com.searchengineland
6361714808613200.000024net.recode
6371714739422140.000017com.searchenginejournal
6381714541411650.000027com.reverbnation
6391714464010580.000030com.redhat
640171435042730.000085com.fc2
6411714238634380.000010com.hubpages
6421714229813780.000023com.freepik
6431714189213330.000024com.nyt
644171418686990.000041com.patreon
645171418586460.000044gov.hhs
6461714171213570.000023com.kissmetrics
6471714111214850.000021com.rollingstone
6481714039810390.000031org.apa
649171393982230.000108fr.google
6501713801611910.000027com.crashlytics
6511713709640710.000008com.answers
6521713705614500.000022com.autodesk
6531713687213480.000023com.theglobeandmail
6541713681612100.000026com.indeed
655171365882570.000091com.getbootstrap
6561713576633770.000010com.domaintools
6571713455615330.000020edu.dukeupress
6581713449024740.000015edu.bu
6591713389812970.000024org.scala-lang
660171335247060.000041com.alexa
6611713317810470.000030com.sciencedaily
6621713150013230.000024com.vox
6631713145810800.000029gov.usgs
664171304221070.000269com.googleadservices
6651713005614190.000022com.elpais
6661712993817770.000019edu.alamo
667171297064740.000052br.com.google
6681712818623460.000016edu.asu
669171281683900.000062com.newrelic
6701712782615050.000021com.nba
671171273507880.000038gov.state
6721712710227780.000013com.macrumors
6731712699223930.000016edu.ncsu
6741712631213850.000023edu.jhu
6751712530627590.000013com.starwars
6761712449212250.000026us.imageshack
677171243503200.000075com.netdna-ssl
6781712399616750.000019org.virginiadot
6791712250823340.000016ch.ethz
6801712202423010.000016com.msnbc
6811712193415710.000020com.nokia
682171216927050.000041com.mckinsey
6831712126811820.000027org.gentoo
684171202944290.000057gov.irs
6851711958220460.000018com.css-tricks
686171194304170.000059com.bigcartel
6871711821213540.000023com.thehill
6881711761617990.000019edu.virginia
68917117124530.000634com.messenger
6901711693017530.000019com.fixr
691171167229920.000032io.codepen
6921711584822030.000018com.zazzle
6931711553813990.000022com.gallup
694171153529070.000035com.adage
695171153205020.000049fr.amazon
696171148841940.000121com.youku
6971711479033230.000010com.rottentomatoes
6981711446412690.000025com.businessweek
6991711435012680.000025com.uber
7001711297813070.000024com.nydailynews
701171122943250.000073com.bizjournals
7021711221018490.000018com.smartguy
7031711120815530.000020com.hotfrog
7041711065428880.000012edu.brown
7051711009215300.000021uk.co.lrb
7061710968615270.000021edu.umd
7071710856224180.000015tv.periscope
7081710733212060.000026int.coe
709171065609820.000032org.oecd
7101710647410040.000032org.change
7111710471614930.000021com.searchenginewatch
7121710462213900.000022it.binged
7131710456014960.000021io.prototypr
714171044645410.000047gov.sec
715171038761390.000185de.bund
7161710367621650.000018com.posterous
717171036306720.000043com.emarketer
7181710315625000.000015au.com.news
7191710300218530.000018edu.ucdavis
7201710240623290.000016com.blogs
7211710152623120.000016com.nfl
7221710109824780.000015com.cbs
7231710094018220.000019com.hulu
7241709983013280.000024com.pwc
7251709941813580.000023ly.plot
7261709873411800.000027com.firebaseapp
727170985283260.000073me.fb
7281709826612360.000026org.cambridge
7291709762013590.000023fm.last
7301709753612560.000025uk.co.theregister
7311709734014300.000022com.kudzu
7321709721422980.000016org.aclu
7331709714815410.000020org.ushistory
734170969702860.000082com.naver
735170959308900.000035gov.sba
7361709592426090.000014com.wikidot
737170956604650.000052gov.epa
7381709557617080.000019com.akamai
7391709482215720.000020org.jstor
740170945142550.000092com.marriott
7411709437211490.000028org.redcross
742170931703500.000068net.themeforest
7431709194224910.000015com.lonelyplanet
7441709166424410.000015mp.j
7451708935815180.000021au.com.truelocal
7461708922223330.000016com.discovery
7471708906414880.000021com.domain
7481708808610060.000031com.cbslocal
7491708740227730.000013org.phys
7501708553412950.000024gov.nyc
7511708531211340.000028io.bower
7521708525422270.000017org.rubyonrails
753170852126770.000043uk.co.tripadvisor
7541708396822860.000017com.urbandictionary
7551708372230810.000011com.fivethirtyeight
7561708339014630.000021com.insiderpages
7571708248416940.000019org.twinery
758170813441900.000124jp.ne.hatena
7591708030017950.000019org.milaap
7601707914016780.000019es.iac
7611707881611680.000027com.accenture
7621707784216900.000019com.2findlocal
763170772848720.000036com.att
7641707717822400.000017de.zeit
765170771366530.000044gov.ny
766170755929140.000035com.chicagotribune
7671707542417450.000019com.planetware
768170753002370.000100jp.co.amazon
7691707505822740.000017edu.umass
7701707469410700.000029com.investopedia
7711707442416560.000020com.wsoctv
7721707417819020.000018org.postimg
7731707377221570.000018uk.ac.ucl
7741707241418570.000018com.linkcentre
7751707231415850.000020edu.vassar
7761707132023320.000016com.ibtimes
7771707123219080.000018com.chron
7781707119421510.000018edu.cuny
7791707073610430.000030gov.va
7801707065015680.000020com.zillow
7811707060630840.000011com.lynda
7821707017616670.000019com.phnompenhpost
7831706903410020.000032com.formstack
7841706819413450.000023re.cli
785170677688310.000037com.sagepub
7861706704428420.000013com.animoto
7871706698814610.000021ca.kijiji
7881706668212640.000025com.xkcd
7891706621215640.000020com.warriorplus
7901706603211200.000028com.business2community
7911706597414730.000021org.sigcomm
792170658046730.000043org.openstreetmap
7931706467416540.000020com.tiki-toki
7941706385615540.000020jp.ac.kobe-u
7951706349026740.000013com.kaspersky
7961706271017500.000019com.trendland
797170626424780.000051com.atlassian
798170619889830.000032com.zoho
7991706127816630.000019fr.estrepublicain
800170598144510.000053gov.usda
8011705816629860.000012com.9to5mac
8021705763014770.000021com.theoutline
803170572988110.000038gov.usa
8041705589031230.000011uk.bl
8051705559213720.000023com.strikingly
8061705527617560.000019edu.ufl
807170549702560.000091com.elegantthemes
8081705480823210.000016com.apnews
809170546844540.000053com.pinimg
8101705467415550.000020org.gwtproject
81117054664930.000317com.namecheap
812170545585300.000047com.gotowebinar
8131705426023450.000016org.gimp
814170542586470.000044gov.ed
815170541181760.000136org.icann
8161705376416760.000019ws.snack
8171705358815190.000021com.hotmail
8181705348625140.000015com.ifttt
8191705323414890.000021net.hockeyapp
8201705179034650.000010com.virustotal
821170515863690.000066org.opensource
8221705153415130.000021com.acninc
8231705057229500.000012org.moma
824170505426840.000042ca.amazon
8251704954213800.000023com.stitcher
826170489149940.000032org.plos
8271704846227910.000013edu.unl
8281704840613100.000024com.over-blog
8291704800017460.000019com.mercurynews
8301704745427620.000013com.topsy
8311704693217900.000019com.khamsat
8321704659643890.000007com.lmgtfy
8331704615628530.000012com.sophos
8341704527417200.000019com.ignimgs
8351704499613920.000022us.zoom
836170443502740.000085com.maxcdn
8371704346226760.000013edu.gmu
8381704326610080.000031com.oup
839170432509470.000033com.accuweather
8401704247018550.000018net.wrightflyer
8411704223824870.000015edu.utah
8421704217811280.000028com.mixcloud
8431704194410990.000029org.doxygen
8441704193824670.000015com.producthunt
8451704116823150.000016com.thestar
8461704080622300.000017edu.arizona
8471704025414910.000021com.sky
8481703927224730.000015org.openoffice
849170388066910.000042com.163
8501703780017020.000019com.howstuffworks
8511703694615510.000020com.company
8521703689422010.000018com.pastebin
8531703649822690.000017ru.narod
8541703643013980.000022io.pantheon
8551703635816350.000020com.discordapp
8561703537032750.000010org.greenpeace
8571703461822310.000017com.deadline
8581703444614720.000021com.local
8591703408828730.000012com.campaignmonitor
860170335921930.000121jp.ameblo
8611703233628890.000012org.bitcoin
8621703199413510.000023com.socialmediatoday
8631703117415890.000020it.blogspot
8641703097612930.000024edu.si
8651703096871750.000005org.audacityteam
866170307208410.000037com.yp
8671703036822420.000017com.livestrong
8681703033424500.000015com.bestbuy
8691702945813130.000024com.globo
870170293661660.000142me.line
8711702854618520.000018tv.royanews
8721702790225350.000014com.mentalfloss
8731702709012980.000024com.gumroad
8741702695018630.000018com.boston
8751702688826170.000014com.getresponse
8761702484414350.000022com.cafepress
8771702472812080.000026com.forrester
878170220927030.000041com.usnews
879170216489990.000032com.walmart
8801702069414490.000022org.wiktionary
881170206724370.000056com.criteo
8821702027016310.000020au.com.whitepages
8831701631014440.000022ca.calgaryseocompany
884170162724210.000058com.adroll
8851701575610190.000031de.heise
8861701487814410.000022com.technorati
8871701463218080.000019de.welt
8881701459215650.000020com.bizcommunity
8891701408414010.000022mil.army
8901701294828250.000013com.fox
8911701222217290.000019com.contentmarketinginstitute
8921701171625610.000014com.yolasite
893170116185120.000048com.udacity
8941701106221700.000018com.podbean
8951701102214810.000021de.bundesverfassungsgericht
896170108362210.000109me.t
897170105981120.000255info.aboutads
8981701053830710.000011com.googlepages
8991700990217380.000019com.pushwoosh
900170093707010.000041com.gitlab
9011700910412330.000026org.sonatype
9021700873634930.000010org.notepad-plus-plus
9031700834033010.000010edu.uic
9041700824616680.000019com.waze
905170071548080.000038es.com.blogspot
9061700706414580.000021com.tiddlywiki
9071700690011000.000029com.digiday
9081700664816580.000020com.lulu
909170060868070.000038uk.co.eventbrite
9101700535229100.000012com.ndtv
9111700512616830.000019com.ssllabs
9121700466615830.000020com.sproutsocial
9131700422418300.000019me.pxlme
9141700414215280.000021com.neilpatel
9151700374214070.000022int.wipo
9161700361215020.000021org.filezilla-project
917170024724520.000053com.custhelp
9181700178622380.000017org.raspberrypi
9191700087816850.000019com.quandl
9201700060628830.000012edu.tufts
9211700011223660.000016com.salon
9221699915432790.000010org.metmuseum
9231699866034800.000010com.spreaker
9241699854625430.000014com.fineartamerica
9251699643216520.000020net.brownbook
9261699625812890.000024com.bmj
9271699481225420.000014uk.co.express
9281699454832680.000010in.lnkd
9291699349811890.000027com.techtarget
9301699183630270.000012edu.hawaii
9311699176012540.000025org.pewresearch
9321699169228760.000012com.fitbit
9331699165833920.000010org.edx
9341699112621540.000018uk.co.huffingtonpost
9351699065610300.000031com.fotolia
9361699025613110.000024com.optimizely
937169902127270.000040com.geocities
9381698944014100.000022com.mariadb
9391698938810680.000030com.infusionsoft
9401698849832100.000011com.popsci
941169879128270.000037gov.house
9421698779034670.000010cc.tiny
9431698662817660.000019com.spoke
9441698646626660.000014nl.uva
9451698592617270.000019org.unfe
9461698588010280.000031es.amazon
9471698571816470.000020uk.gov.westsussex
9481698555816810.000019com.chamberofcommerce
9491698538635840.000009gd.is
9501698528213080.000024net.java
951169852386540.000044com.houzz
9521698515610900.000029gov.archives
9531698416233130.000010com.avast
9541698394822160.000017com.examiner
9551698380216450.000020com.thefabricator
956169837905050.000049com.redbubble
9571698329618240.000019com.computerworld
9581698220434540.000010com.klout
959169810869340.000034com.delicious
9601697894233290.000010org.kiva
961169783764530.000053com.teamviewer
9621697828020480.000018com.cio
9631697783821710.000018com.thedailybeast
964169775984110.000059mp.mailchi
9651697737224350.000015br.com.blogspot
9661697661611660.000027com.netdna-cdn
967169764968890.000036com.arcgis
9681697605828440.000013com.createspace
9691697602841820.000008net.deviantart
9701697558017610.000019com.yelloyello
9711697551615760.000020gov.cabq
972169754904800.000051com.iconfinder
9731697507414310.000022au.com.yellowpages
9741697372412570.000025io.getmdl
9751697278012280.000026com.thedrum
9761697220416960.000019com.us
9771697175624220.000015org.linuxfoundation
9781696911253780.000006com.depositphotos
9791696900823280.000016com.ign
9801696790015630.000020org.gmplib
9811696788828320.000013edu.caltech
9821696725228950.000012com.infoq
9831696676422020.000018edu.uci
9841696618613320.000024com.xbox
9851696606614250.000022com.techrepublic
9861696602812620.000025com.glassdoor
9871696550615990.000020com.apachelounge
988169654548950.000035org.unicef
9891696511629900.000012com.discogs
9901696450029130.000012es.abc
991169631667430.000039com.biomedcentral
9921696267428630.000012nl.xs4all
9931696250214230.000022org.heart
9941696183226100.000014org.olympic
995169607362520.000093com.ssl-images-amazon
9961695997226800.000013de.bild
9971695979026720.000013com.nbc
9981695929816910.000019com.realtytimes
9991695922814560.000021com.mediafire
10001695908015690.000020com.galvanize

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2019 crawl archive now available

The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th.

The January crawl contains page captures of 850 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Aug/Sep/Oct 2018 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 50 million hosts and domains
  • a random sample of outlinks taken from WAT files of the December crawl

The number of sampled URLs per domain depends on the domain’s harmonic centrality rank in the webgraph data set – higher ranking domain are allowed to “contribute” more URLs.

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-04/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-04/segment.paths.gz100
WARC filesCC-MAIN-2019-04/warc.paths.gz6400058.86
WAT filesCC-MAIN-2019-04/wat.paths.gz6400018.88
WET filesCC-MAIN-2019-04/wet.paths.gz640007.98
Robots.txt filesCC-MAIN-2019-04/robotstxt.paths.gz640000.18
Non-200 responses filesCC-MAIN-2019-04/non200responses.paths.gz640001.65
URL index filesCC-MAIN-2019-04/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-04/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

December 2018 crawl archive now available

The crawl archive for December 2018 is now available! It contains 3.1 billion web pages or 250 TiB of uncompressed content, crawled between December 9th and 19th.

The December crawl contains page captures of 735 million URLs not contained in any crawl archive before. New URLs stem from:

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Aug/Sep/Oct 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 50 million domains of the webgraph dataset
  • a random sample of outlinks taken from WAT files of the November crawl
  • 30 million external links sampled from Wikipedia data dumps

Archive Location and Download

The December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-51/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-51/segment.paths.gz100
WARC filesCC-MAIN-2018-51/warc.paths.gz6384065.31
WAT filesCC-MAIN-2018-51/wat.paths.gz6384020.01
WET filesCC-MAIN-2018-51/wet.paths.gz638408.43
Robots.txt filesCC-MAIN-2018-51/robotstxt.paths.gz638400.22
Non-200 responses filesCC-MAIN-2018-51/non200responses.paths.gz638401.71
URL index filesCC-MAIN-2018-51/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-51/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November 2018 crawl archive now available

The crawl archive for November 2018 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between November 12th and 22nd.

The November crawl contains 640 million new URLs, not contained in any crawl archive before. New URLs stem from:

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Aug/Sep/Oct 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 10 links (“hops”) away from the home pages of the top 40 million domains of the webgraph dataset
  • a random sample of outlinks taken from WAT files of the October crawl
  • 50 million external links sampled from Wikipedia data dumps

Archive Location and Download

The November crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-47/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-47/segment.paths.gz100
WARC filesCC-MAIN-2018-47/warc.paths.gz5600054.16
WAT filesCC-MAIN-2018-47/wat.paths.gz5600017.36
WET filesCC-MAIN-2018-47/wet.paths.gz560007.42
Robots.txt filesCC-MAIN-2018-47/robotstxt.paths.gz560000.2
Non-200 responses filesCC-MAIN-2018-47/non200responses.paths.gz560001.92
URL index filesCC-MAIN-2018-47/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-47/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.