February 2019 crawl archive now available

The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th.

The February crawl contains page captures of 750 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Nov/Dec/Jan 2018/2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 5 links (“hops”) away from the homepages of the top 50 million hosts and domains
  • a random sample of outlinks taken from WAT files of the January crawl

The number of sampled URLs per domain depends on the domain’s harmonic centrality rank in the webgraph data set – higher ranking domain are allowed to “contribute” more URLs.

The way our crawler handles politeness limits per host and/or pay-level domain has been improved:
First, limits are now configurable and are based on the harmonic centrality rank of a domain.
Second, we now also put a limit on the number of hosts/subdomains per domain. This limit is also based on the domain rank and ranges from 500,000 subdomains for top-ranking domains (think of blogspot.com) to less than 100 for low-ranking domains. While the the number of hosts covered in the February crawl dropped to 50 millions from 60 millions in January, we see a positive impact on the total amount of pages crawled for large domains. Technically, every host requires a DNS lookup and a robots.txt fetch even if only a single page is fetched from this host and the performance of the crawler improves if resources are focused on few 100,000 subdomains and not spread over millions of hosts. We also hope that a limit on the number of hosts per domain makes the crawler more robust against link spam. The set of sampled subdomains for large domains will vary from month to month to guarantee a good overall coverage if multiple monthly crawls are combined.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-09/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-09/segment.paths.gz100
WARC filesCC-MAIN-2019-09/warc.paths.gz6400059.86
WAT filesCC-MAIN-2019-09/wat.paths.gz6400018.23
WET filesCC-MAIN-2019-09/wet.paths.gz640007.62
Robots.txt filesCC-MAIN-2019-09/robotstxt.paths.gz640000.17
Non-200 responses filesCC-MAIN-2019-09/non200responses.paths.gz640001.79
URL index filesCC-MAIN-2019-09/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-09/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 – 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 407 million nodes and 4.2 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 323 million dangling nodes (79%) and the largest strongly connected component contains 63 million (15%) nodes.

You can download the graph and the ranks of all 407 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Nov/Dec/Jan 2018-19 host-level webgraph

SizeFileDescription
2.90 GBcc-main-2018-19-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 42 vertices files
18.84 GBcc-main-2018-19-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 84 edges files
7.81 GBcc-main-2018-19-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2018-19-nov-dec-jan-host.properties
8.16 GBcc-main-2018-19-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-19-nov-dec-jan-host-t.properties
1 kBcc-main-2018-19-nov-dec-jan-host.statsWebGraph statistics
7.50 GBcc-main-2018-19-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 90 million nodes and 1.69 billion edges. 53% or 48 million nodes are dangling nodes, the largest strongly connected component covers 37 million or 41% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-19-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2018-19 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2018-19-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.76 GBcc-main-2018-19-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.68 GBcc-main-2018-19-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2018-19-nov-dec-jan-domain.properties
3.82 GBcc-main-2018-19-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2018-19-nov-dec-jan-domain-t.properties
1 kBcc-main-2018-19-nov-dec-jan-domain.statsWebGraph statistics
1.96 GBcc-main-2018-19-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2018-2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12720328820.012818com.facebook
22708181610.017236com.googleapis
32553310830.010690com.google
42426790640.007625com.twitter
52400138450.006755com.youtube
62318722660.006532org.w
72160578680.003925com.instagram
82138665870.004753org.gmpg
920954100110.003053com.linkedin
1020252174120.002871org.wordpress
1120166276150.002217com.wordpress
1220071538240.001532com.gravatar
1320054574220.001673com.pinterest
1420035420270.001366org.wikipedia
1519689680210.001831com.apple
1619669598130.002776com.bootstrapcdn
1719590352360.000986com.blogspot
1819579602280.001308com.vimeo
1919357866410.000827be.youtu
2019345240140.002221com.cloudflare
2119288382370.000940gl.goo
2219267938290.001236com.microsoft
2319181584250.001444com.adobe
2419148714420.000817com.amazon
2519143492170.002000com.googletagmanager
2619087530490.000656com.tumblr
2719040054230.001572com.macromedia
2819024404510.000647com.wp
2919009908160.002181com.flickr
3018980982710.000509ly.bit
3118864572740.000479com.yahoo
3218847172390.000879com.amazonaws
3318818456380.000888com.paypal
3418798784200.001840com.github
3518762366650.000584org.mozilla
3618761330260.001413com.gstatic
3718756286640.000596me.wp
3818648308970.000312com.googleusercontent
3918645512400.000867net.cloudfront
4018636726850.000364com.soundcloud
41186286241090.000267com.nytimes
4218621690810.000425com.weebly
4318599952540.000633net.doubleclick
4418588240440.000760org.w3
4518584910870.000346co.t
46185756341010.000303com.reddit
4718568330680.000521com.medium
48185605521500.000157org.wikimedia
49185521041110.000257com.dropbox
5018520572830.000402org.creativecommons
51185093761340.000192org.archive
5218508730330.001034io.github
5318458414770.000443com.bing
54184430421460.000176net.slideshare
55184423421240.000215com.imgur
5618429066310.001176ru.yandex
5718423958820.000417de.google
5818405610300.001179net.fbcdn
59183989441580.000150edu.stanford
60183792942410.000099com.bbc
61183757502150.000111com.tinyurl
6218368156340.001028org.apache
6318362114940.000316com.mailchimp
64183382961830.000127com.wired
65183223341360.000190com.blogger
6618280996630.000599eu.europa
67182778041300.000200com.issuu
68182718242190.000109com.bloomberg
69182574221820.000127com.myspace
7018254210800.000425com.jquery
7118249798780.000433com.baidu
72182303643470.000069com.appspot
73182207681370.000188com.eventbrite
74182148201250.000212com.yelp
75182091941380.000185com.spotify
76182086441430.000180org.ietf
77182020761890.000125com.oracle
78182006281720.000139com.android
79181960042480.000095org.npr
80181949383310.000072com.theverge
8118188710320.001108com.squarespace
82181800643070.000077com.googleblog
83181686961730.000139org.gnu
84181682821150.000241com.youtube-nocookie
85181664003520.000068com.quora
8618166372840.000388com.statcounter
87181622703550.000068com.deviantart
88181476903140.000076com.buzzfeed
89181321482810.000083org.python
90181305442840.000082me.about
91181228524260.000057com.slate
92181208744430.000055org.ieee
93181098883570.000068uk.co.independent
94181042201170.000228com.whatsapp
95180942922790.000085com.w3schools
9618092538720.000495org.schema
97180870664480.000054edu.upenn
9818080120450.000737com.fontawesome
99180765784760.000051edu.ucla
100180723664240.000057edu.washington
101180720786410.000045org.chromium
102180686244680.000052uk.ac.ox
103180674043860.000063com.newyorker
104180667281860.000125net.behance
105180577002820.000083com.example
106180540904000.000061org.arxiv
107180529301040.000285com.ytimg
108180498521920.000123com.dribbble
109180291322220.000109gov.ca
110180267161400.000184com.forbes
111180250303740.000065gov.loc
112180134542280.000103com.fastcompany
113180081562530.000092com.foursquare
114180070623800.000064com.about
115180054981790.000132com.cnn
116180052621570.000150com.theguardian
117180052544660.000052com.evernote
118180027083790.000064com.git-scm
119180018923370.000071au.com.google
120180013164900.000050edu.princeton
121179975762470.000096com.typeform
122179950204690.000052com.withgoogle
123179911206480.000044com.storify
124179869525250.000047com.stackexchange
125179854826520.000044google.blog
1261798224490.003675com.godaddy
127179767822290.000103com.nbcnews
128179749721610.000148uk.co.bbc
129179732943320.000072uk.co.blogspot
130179711883960.000061com.tandfonline
131179575024080.000060com.mysql
132179460286320.000045ca.blogspot
133179435224790.000051com.libsyn
134179402781960.000120es.google
135179349264910.000050com.tinypic
136179335224820.000051com.ubuntu
137179325347480.000039com.nike
138179312944020.000061org.bitbucket
139179309762760.000085org.doi
140179296343360.000072com.getpocket
141179275966760.000043com.jetbrains
142179097102780.000085com.mozilla
143179090406970.000041com.sap
144179005944490.000054com.googlecode
14517899774730.000484com.list-manage
146178952401850.000126com.huffingtonpost
147178941466350.000045tv.ustream
14817893688860.000351com.paypalobjects
149178904464590.000053com.trello
150178868182690.000086edu.mit
151178820641520.000154net.sourceforge
152178798782450.000096com.githubusercontent
153178775544980.000049com.chrome
154178643689530.000033edu.gatech
155178596704470.000054com.docker
156178587947190.000040com.ssrn
157178585745970.000045co.g
15817857678900.000329com.wix
159178566262050.000116com.washingtonpost
1601784955410820.000029com.diigo
161178473143600.000067gov.fda
162178451281270.000205org.bbb
163178403189610.000033com.flipboard
164178390929390.000034it.scoop
165178382388190.000037com.nvidia
166178367022970.000080com.reuters
167178363562960.000080com.mapquest
168178309965700.000046com.pingdom
169178303622770.000085com.go
170178279801990.000119org.debian
171178222381980.000119com.wsj
172178222068980.000035com.fastcodesign
17317819482350.001004com.fb
174178159788140.000037site.business
175178149322340.000101com.techcrunch
176178078842500.000094com.usatoday
177178073441480.000171gov.nih
178178071561530.000154com.etsy
179178043865430.000047org.eclipse
1801779656010250.000031com.hbo
18117791748460.000708net.akamaihd
182177909862000.000118com.live
183177902109770.000032ms.1drv
184177895409750.000033nl.blogspot
185177822962160.000111com.businessinsider
186177798064150.000059com.unity3d
187177727764710.000052com.cdbaby
188177698467100.000041se.haxx
189177689301630.000142org.iana
19017766540980.000311com.shopify
191177656102830.000082com.herokuapp
192177626142800.000084edu.harvard
193177605702930.000080net.windows
194177602147470.000039org.unicode
195177581781100.000264com.jimdo
196177581683440.000070com.msn
197177578642870.000081uk.co.telegraph
198177568962090.000112com.typepad
199177553261470.000174com.opera
2001775232410840.000029com.creativebloq
201177507928520.000036edu.rutgers
202177501906620.000043gov.wa
203177500429440.000034com.history
204177487563620.000066gov.nasa
205177487048440.000037edu.illinois
206177439985090.000049au.gov.nsw
207177375546310.000045gov.dot
2081773056010240.000031edu.pitt
209177303901910.000124com.imdb
21017727110950.000315net.jsdelivr
211177265783730.000065com.mashable
21217721654670.000526com.vk
21317719938470.000677net.facebook
214177190501950.000121uk.co.amazon
215177176261050.000279com.google-analytics
216177159863130.000076com.cnet
2171771261611920.000027org.wikibooks
218177113462380.000100com.ibm
219177086749060.000035ca.utoronto
220177060483720.000065com.ted
221177031649300.000034au.com.blogspot
222176966348090.000038com.ecwid
223176928044220.000058uk.co.pinterest
224176888509540.000033com.theknot
225176833809710.000033edu.osu
226176768923680.000066com.latimes
227176759482310.000103net.php
2281767486610230.000031com.dw
229176737229720.000033org.hrw
230176691761810.000128com.stackoverflow
2311766639610630.000030io.itch
232176632362620.000090com.npmjs
233176546049170.000035us.mn.state
234176544823870.000063uk.co.dailymail
235176540423060.000077com.time
236176530781750.000137com.twimg
237176513642140.000112com.surveymonkey
238176484624930.000050net.researchgate
2391763963812870.000024com.kinja
240176360609130.000035gov.defense
241176346084230.000058edu.cornell
242176328728040.000038com.citrix
24317631522180.001984com.wixstatic
2441762797213490.000023com.instapaper
245176273864560.000053io.readthedocs
246176227389030.000035com.vogue
247176224843390.000071me.telegram
248176222747380.000040org.postgresql
2491761973212110.000026com.dezeen
250176195408420.000037com.citysearch
251176178104400.000056com.ft
252176162106880.000042org.kernel
253176159329690.000033com.yellowpages
254176158501440.000179uk.co.google
255176151322750.000085org.acm
256176119381600.000148com.zendesk
257176084144200.000058com.kickstarter
2581760708210600.000030com.strava
259176067624190.000058edu.berkeley
2601760625210450.000030gov.mo
261176041263330.000072com.cnbc
26217602550520.000636com.qq
263175987906700.000043com.adjust
2641759698210260.000031gov.oregon
265175966842990.000080com.meetup
2661759478810160.000031org.tensorflow
267175941783120.000077com.mapbox
268175924521590.000150com.salesforce
269175865243530.000068com.gmail
2701757759410660.000030com.googlesource
2711757471611760.000027edu.kit
272175746563270.000073com.springer
27317574172550.000629net.jsfiddle
274175713428480.000037com.wikia
2751757021011230.000028gov.ky
276175700306850.000042com.matterport
2771756918410550.000030com.hackernoon
278175690723820.000064com.fortune
279175681963970.000061com.photobucket
280175658683760.000065com.giphy
281175619183490.000069com.nypost
282175615286640.000043com.angieslist
2831755858811030.000029gov.wi
284175584509080.000035com.xrea
285175581781870.000125com.ebay
286175577868700.000036com.pixabay
2871755578410350.000031org.wnyc
288175540246270.000045com.economist
289175532462850.000082com.hubspot
290175529048580.000036edu.columbia
291175524823170.000076org.un
292175512503940.000062org.hbr
293175477688240.000037com.arstechnica
294175472525210.000048com.livechatinc
295175446489670.000033com.missingkids
296175428121350.000191com.feedburner
297175426465630.000046com.nationalgeographic
298175422108390.000037edu.yale
299175418909600.000033org.ohchr
300175397608260.000037org.aarp
301175389685500.000046com.scribd
3021753625410370.000031gov.maryland
303175355529870.000032gov.michigan
3041753487811700.000027gov.mt
305175327283540.000068com.oreilly
306175289141160.000238com.addthis
307175249624100.000060com.theatlantic
3081752289411640.000027org.amnesty
309175228387670.000039com.engadget
3101752274210480.000030us.pa.state
3111752258814460.000022com.jigsy
3121752073412750.000025com.healthgrades
313175202166790.000042com.intel
314175172944040.000061gov.whitehouse
3151751710612500.000025com.manta
316175151706890.000042com.vice
317175150684120.000059com.unsplash
318175078183110.000077com.wiley
319175064941280.000204com.wixsite
320175032206370.000045com.wikihow
3211749983613020.000024com.merchantcircle
322174964423410.000070com.livejournal
323174949523420.000070com.booking
3241749463213950.000022io.soup
325174932303700.000065com.skype
326174906185180.000048com.samsung
327174905166550.000044com.zdnet
328174877723980.000061com.entrepreneur
329174859983000.000080com.staticflickr
330174854683430.000070com.prnewswire
3311748425413060.000024ca.yelp
3321748425412160.000026com.contently
333174835542720.000085int.who
334174830448280.000037com.qz
335174771203590.000067com.office
336174765984720.000052com.cisco
3371747658014240.000022com.gimletmedia
3381747646015400.000020com.designobserver
339174750422940.000080com.hp
340174748062600.000090gov.cdc
341174714962360.000101com.disqus
3421747099413760.000023us.wi.state
343174677866400.000045com.cbsnews
344174674125170.000048com.statista
345174673261260.000208com.weibo
346174663707290.000040co.elastic
347174657805510.000046ca.pinterest
348174657388320.000037edu.psu
3491746225812120.000026org.tigris
3501746055212960.000024com.thoughtworks
351174541024070.000060com.inc
352174526944920.000050org.mediawiki
353174501423400.000071com.dailymotion
354174493223890.000063com.aol
355174484269760.000033com.gizmodo
3561744737612780.000025org.emojipedia
3571744575210810.000029net.leadpages
358174455405000.000049gov.nist
3591744288014590.000021com.zynga
360174427003610.000067org.ampproject
3611744235012180.000026us.nm.state
3621744229814530.000021com.activerain
3631744180210460.000030com.bandsintown
364174394724840.000051com.nature
365174393285200.000048com.venturebeat
366174389745720.000046com.box
367174389641780.000135com.constantcontact
368174386342130.000112to.amzn
369174339649700.000033com.thenextweb
3701743374213220.000024com.superpages
371174320185080.000049com.symantec
372174249085520.000046org.nodejs
373174245382420.000099org.drupal
374174236061800.000131com.tripadvisor
375174232906980.000041com.deloitte
3761742249810440.000030us.fl.state
377174222482510.000094com.digg
378174198969910.000032edu.utexas
379174194209590.000033com.googlegroups
3801741856410930.000029com.pexels
3811741849213290.000024ly.snip
382174181083220.000075fr.free
383174173223080.000077com.sciencedirect
384174133822030.000117com.bandcamp
385174132286330.000045com.moz
3861741270414080.000022com.whitepages
387174102187320.000040com.psychologytoday
3881740755414800.000021com.digitaltrends
3891740409215390.000020edu.scad
3901739982610560.000030org.weforum
391173977623300.000072com.sxsw
392173949762020.000117de.amazon
393173948024640.000052com.goodreads
394173937209370.000034org.eff
395173928367540.000039com.indiatimes
3961739110811470.000028com.thinkwithgoogle
3971738592014420.000022org.khanacademy
398173800969010.000035com.shutterstock
399173795468290.000037edu.umich
400173779746580.000043com.raywenderlich
401173760583750.000065com.businesswire
4021737591413520.000023edu.usc
403173758362700.000086ca.google
404173736782260.000104com.stumbleupon
4051737300213710.000023com.mysanantonio
406173685542040.000116com.automattic
407173680548910.000035au.net.abc
408173656248640.000036org.worldbank
4091736468613500.000023edu.unc
4101736437011130.000028org.example
4111736273813750.000023it.eventbrite
4121736169012430.000025com.merriam-webster
4131736002615500.000020edu.hmc
414173575609120.000035uk.co.guardian
415173567648710.000036com.netflix
416173541164460.000055com.slack
4171735206214380.000022me.websta
4181735074212610.000025com.kaggle
419173502705440.000047org.pbs
420173471425060.000049com.webs
4211734161213380.000023com.ning
4221734102413390.000023com.speakerdeck
4231733871215980.000020au.com.yelp
4241733725016020.000020org.themoth
4251733683212720.000025com.canva
4261733653013840.000023com.pcworld
4271733562211860.000027com.indiegogo
4281733461612470.000025edu.toronto
4291733310425370.000014com.instructables
4301733156615170.000021com.brandyourself
431173310387220.000040org.unesco
4321733047611710.000027com.pcmag
433173303449560.000033com.marketwatch
434173293909450.000033com.foxnews
435173257725260.000047tv.twitch
4361732177811960.000026org.mozillazine
4371732092015520.000020org.owasp
4381731970813740.000023com.googleapps
4391731964411460.000028co.leadpages
4401731950216040.000020com.yellowbook
4411731946013910.000022org.coursera
4421731907413160.000024edu.academia
443173181943180.000075com.tripod
444173180849960.000032edu.ucsd
445173169887630.000039com.gartner
446173165649150.000035com.sfgate
447173153187180.000040com.blackberry
4481731436011170.000028org.haskell
449173142583640.000066it.placehold
4501731194215490.000020edu.utep
4511731115612240.000026gov.nh
4521731023012630.000025edu.northwestern
4531730631011900.000027de.spiegel
454173039363210.000075com.getclicky
455173022923280.000073com.rawgit
456173017183710.000065edu.nyu
4571730043410870.000029org.maven
458172997543480.000069edu.cmu
4591729840810650.000030edu.wisc
460172949649570.000033com.dropboxusercontent
461172949462950.000080com.smugmug
4621729090813350.000024com.googledrive
463172900869230.000034gov.fcc
464172896045340.000047com.outlook
4651728879612770.000025edu.uchicago
466172876666340.000045com.windowsphone
4671728480413270.000024gov.la
4681728452416890.000019org.maximumfun
469172843603050.000078net.datatables
470172843205710.000046com.lifehacker
471172838465010.000049in.co.google
472172837885800.000046gov.noaa
4731728349216440.000020edu.uah
474172827708020.000038com.steampowered
4751727909214270.000022com.invisionapp
476172744667040.000041com.msdn
4771727438010880.000029org.vim
478172742421690.000141jp.co.yahoo
479172740324730.000052com.cargocollective
4801727398411420.000028com.ycombinator
481172724382350.000101gov.ftc
4821727119613640.000023org.iihs
483172709848960.000035gov.census
4841727053412450.000025com.upwork
4851727049422280.000017com.ehow
486172704101000.000305org.networkadvertising
487172676706960.000041com.webmd
4881726742014220.000022edu.purdue
489172673523350.000072com.stripe
4901726661624270.000015com.techradar
4911726654610520.000030org.sciencemag
4921726482611270.000028org.altervista
4931726478211510.000028io.material
4941726361214820.000021com.fifa
4951726282621610.000018com.crunchbase
4961726242814510.000021com.technologyreview
497172617128220.000037gov.senate
498172609989930.000032ly.ow
4991726009811690.000027com.playstation
5001725995412370.000026com.target
501172582689160.000035com.clicky
5021725783215380.000020uk.co.wired
503172567423910.000062com.force
504172560469260.000034com.java
5051725523415370.000020com.gettyimages
5061725487418090.000019us.countrystudies
5071725483219650.000018com.semrush
5081725155811850.000027org.gnupg
5091725059811220.000028com.politico
5101725045216620.000019com.womentechmakers
511172501309020.000035gov.uspto
512172479968510.000036org.whatbrowser
5131724755821550.000018com.vanityfair
514172451441420.000180ru.mail
515172434584350.000056com.snapchat
5161724252211290.000028com.istockphoto
517172420962170.000110com.bitly
518172419743840.000064com.adweek
5191724180816980.000019com.ikea
520172412782680.000087com.wufoo
521172382361620.000144com.eepurl
5221723309811090.000029org.archlinux
5231723291013340.000024fr.lemonde
5241723191013310.000024com.econsultancy
5251723140811380.000028com.udemy
526172312681080.000268jp.co.google
5271723083213890.000022com.today
5281722803217170.000019com.yellowbot
5291722748212270.000026com.intuit
530172273769730.000033org.iso
5311722679615670.000020com.aliexpress
5321722625814680.000021au.com.smh
5331722546811970.000026co.vine
534172252789580.000033com.hootsuite
5351722435414320.000022com.underconsideration
5361722303016330.000020uk.ac.hud
5371722281814290.000022com.com
538172212488740.000036com.nielsen
5391721994217550.000019com.communitywalk
5401721967028060.000013com.123rf
541172173621700.000141com.xing
542172161989410.000034com.livestream
543172153529500.000033com.timeanddate
544172145668920.000035de.blogspot
545172144986870.000042com.proofpoint
546172143423160.000076org.joomla
5471721401013030.000024org.pnas
548172139769490.000033com.americanexpress
5491721322611400.000028org.fao
550172117642460.000096com.wpengine
5511721106010110.000031uk.ac.cam
5521721092813440.000023com.snap
5531721064612700.000025us.ma.state
554172101584300.000056com.barnesandnoble
555172101104270.000057com.squareup
556172096087720.000039gov.justice
5571720747613410.000023com.billboard
5581720700410200.000031com.alibaba
5591720549611980.000026net.noscript
5601720452013970.000022org.letsencrypt
5611720367023860.000016ca.uwaterloo
5621720316417110.000019com.espn
5631720126210330.000031io.fabric
5641719936423220.000016ca.ubc
565171989129800.000032com.variety
5661719506611430.000028com.bostonglobe
5671719485614160.000022com.homestars
5681719460824010.000015com.tutsplus
5691719403421920.000018edu.msu
5701719384617060.000019com.bitballoon
571171927486680.000043com.feedly
5721719245412920.000024in.blogspot
5731719150410860.000029fr.blogspot
5741719146422720.000017com.fiverr
5751718999422260.000017edu.indiana
5761718912214790.000021uk.co.thesun
577171878428840.000036gov.nps
5781718781016710.000019com.mcafee
579171866788200.000037com.gofundme
5801718635828590.000012com.twitpic
5811718320611480.000028com.dell
5821718299628920.000012com.codecademy
5831718272614260.000022com.city-data
5841718268614600.000021io.bitbucket
585171815107080.000041com.photoshelter
5861718139433340.000010com.dreamstime
5871718132024930.000015com.newscientist
5881718045413630.000023com.nytco
589171793084630.000052us.icio
5901717855212020.000026com.yandex
591171782342060.000116com.histats
5921717607623420.000016uk.ac.ed
5931717550210400.000031gov.fbi
594171746348460.000037com.500px
595171745724310.000056cn.com.sina
596171739506300.000045com.mobirise
5971717342812170.000026org.jenkins-ci
5981717285224480.000015ca.ualberta
5991717092816180.000020com.googlelabs
6001717015621640.000018com.socialmediaexaminer
6011717003410590.000030com.wayfair
6021716944412380.000026uk.co.mirror
6031716940813000.000024us.oh.state
604171694068060.000038com.buffer
605171692981550.000152it.google
606171688665290.000047com.format
6071716820013460.000023org.threejs
608171676247740.000039com.uk
6091716745014480.000022org.spie
6101716566012940.000024kr.flic
6111716551611320.000028edu.umn
6121716521413040.000024com.iconarchive
613171639042330.000103com.myshopify
614171628864340.000056com.nasdaq
615171620527090.000041com.uservoice
6161716157623530.000016com.screencast
617171613529050.000035br.com.uol
618171608502980.000080nl.google
6191716006212190.000026com.scientificamerican
6201715944228000.000013ly.visual
621171588829640.000033com.prweb
6221715885611780.000027com.smashingmagazine
6231715861813660.000023com.nymag
624171571463810.000064com.dmca
6251715699613240.000024com.hollywoodreporter
6261715537212730.000025com.warnerbros
627171551548620.000036net.openid
628171545787650.000039gov.copyright
6291715390612490.000025com.prezi
6301715200224390.000015com.aljazeera
631171510741490.000170gov.privacyshield
6321715082410420.000031com.airbnb
633171503409240.000034ca.cbc
6341714982012530.000025com.gigaom
6351714841411010.000029com.searchengineland
6361714808613200.000024net.recode
6371714739422140.000017com.searchenginejournal
6381714541411650.000027com.reverbnation
6391714464010580.000030com.redhat
640171435042730.000085com.fc2
6411714238634380.000010com.hubpages
6421714229813780.000023com.freepik
6431714189213330.000024com.nyt
644171418686990.000041com.patreon
645171418586460.000044gov.hhs
6461714171213570.000023com.kissmetrics
6471714111214850.000021com.rollingstone
6481714039810390.000031org.apa
649171393982230.000108fr.google
6501713801611910.000027com.crashlytics
6511713709640710.000008com.answers
6521713705614500.000022com.autodesk
6531713687213480.000023com.theglobeandmail
6541713681612100.000026com.indeed
655171365882570.000091com.getbootstrap
6561713576633770.000010com.domaintools
6571713455615330.000020edu.dukeupress
6581713449024740.000015edu.bu
6591713389812970.000024org.scala-lang
660171335247060.000041com.alexa
6611713317810470.000030com.sciencedaily
6621713150013230.000024com.vox
6631713145810800.000029gov.usgs
664171304221070.000269com.googleadservices
6651713005614190.000022com.elpais
6661712993817770.000019edu.alamo
667171297064740.000052br.com.google
6681712818623460.000016edu.asu
669171281683900.000062com.newrelic
6701712782615050.000021com.nba
671171273507880.000038gov.state
6721712710227780.000013com.macrumors
6731712699223930.000016edu.ncsu
6741712631213850.000023edu.jhu
6751712530627590.000013com.starwars
6761712449212250.000026us.imageshack
677171243503200.000075com.netdna-ssl
6781712399616750.000019org.virginiadot
6791712250823340.000016ch.ethz
6801712202423010.000016com.msnbc
6811712193415710.000020com.nokia
682171216927050.000041com.mckinsey
6831712126811820.000027org.gentoo
684171202944290.000057gov.irs
6851711958220460.000018com.css-tricks
686171194304170.000059com.bigcartel
6871711821213540.000023com.thehill
6881711761617990.000019edu.virginia
68917117124530.000634com.messenger
6901711693017530.000019com.fixr
691171167229920.000032io.codepen
6921711584822030.000018com.zazzle
6931711553813990.000022com.gallup
694171153529070.000035com.adage
695171153205020.000049fr.amazon
696171148841940.000121com.youku
6971711479033230.000010com.rottentomatoes
6981711446412690.000025com.businessweek
6991711435012680.000025com.uber
7001711297813070.000024com.nydailynews
701171122943250.000073com.bizjournals
7021711221018490.000018com.smartguy
7031711120815530.000020com.hotfrog
7041711065428880.000012edu.brown
7051711009215300.000021uk.co.lrb
7061710968615270.000021edu.umd
7071710856224180.000015tv.periscope
7081710733212060.000026int.coe
709171065609820.000032org.oecd
7101710647410040.000032org.change
7111710471614930.000021com.searchenginewatch
7121710462213900.000022it.binged
7131710456014960.000021io.prototypr
714171044645410.000047gov.sec
715171038761390.000185de.bund
7161710367621650.000018com.posterous
717171036306720.000043com.emarketer
7181710315625000.000015au.com.news
7191710300218530.000018edu.ucdavis
7201710240623290.000016com.blogs
7211710152623120.000016com.nfl
7221710109824780.000015com.cbs
7231710094018220.000019com.hulu
7241709983013280.000024com.pwc
7251709941813580.000023ly.plot
7261709873411800.000027com.firebaseapp
727170985283260.000073me.fb
7281709826612360.000026org.cambridge
7291709762013590.000023fm.last
7301709753612560.000025uk.co.theregister
7311709734014300.000022com.kudzu
7321709721422980.000016org.aclu
7331709714815410.000020org.ushistory
734170969702860.000082com.naver
735170959308900.000035gov.sba
7361709592426090.000014com.wikidot
737170956604650.000052gov.epa
7381709557617080.000019com.akamai
7391709482215720.000020org.jstor
740170945142550.000092com.marriott
7411709437211490.000028org.redcross
742170931703500.000068net.themeforest
7431709194224910.000015com.lonelyplanet
7441709166424410.000015mp.j
7451708935815180.000021au.com.truelocal
7461708922223330.000016com.discovery
7471708906414880.000021com.domain
7481708808610060.000031com.cbslocal
7491708740227730.000013org.phys
7501708553412950.000024gov.nyc
7511708531211340.000028io.bower
7521708525422270.000017org.rubyonrails
753170852126770.000043uk.co.tripadvisor
7541708396822860.000017com.urbandictionary
7551708372230810.000011com.fivethirtyeight
7561708339014630.000021com.insiderpages
7571708248416940.000019org.twinery
758170813441900.000124jp.ne.hatena
7591708030017950.000019org.milaap
7601707914016780.000019es.iac
7611707881611680.000027com.accenture
7621707784216900.000019com.2findlocal
763170772848720.000036com.att
7641707717822400.000017de.zeit
765170771366530.000044gov.ny
766170755929140.000035com.chicagotribune
7671707542417450.000019com.planetware
768170753002370.000100jp.co.amazon
7691707505822740.000017edu.umass
7701707469410700.000029com.investopedia
7711707442416560.000020com.wsoctv
7721707417819020.000018org.postimg
7731707377221570.000018uk.ac.ucl
7741707241418570.000018com.linkcentre
7751707231415850.000020edu.vassar
7761707132023320.000016com.ibtimes
7771707123219080.000018com.chron
7781707119421510.000018edu.cuny
7791707073610430.000030gov.va
7801707065015680.000020com.zillow
7811707060630840.000011com.lynda
7821707017616670.000019com.phnompenhpost
7831706903410020.000032com.formstack
7841706819413450.000023re.cli
785170677688310.000037com.sagepub
7861706704428420.000013com.animoto
7871706698814610.000021ca.kijiji
7881706668212640.000025com.xkcd
7891706621215640.000020com.warriorplus
7901706603211200.000028com.business2community
7911706597414730.000021org.sigcomm
792170658046730.000043org.openstreetmap
7931706467416540.000020com.tiki-toki
7941706385615540.000020jp.ac.kobe-u
7951706349026740.000013com.kaspersky
7961706271017500.000019com.trendland
797170626424780.000051com.atlassian
798170619889830.000032com.zoho
7991706127816630.000019fr.estrepublicain
800170598144510.000053gov.usda
8011705816629860.000012com.9to5mac
8021705763014770.000021com.theoutline
803170572988110.000038gov.usa
8041705589031230.000011uk.bl
8051705559213720.000023com.strikingly
8061705527617560.000019edu.ufl
807170549702560.000091com.elegantthemes
8081705480823210.000016com.apnews
809170546844540.000053com.pinimg
8101705467415550.000020org.gwtproject
81117054664930.000317com.namecheap
812170545585300.000047com.gotowebinar
8131705426023450.000016org.gimp
814170542586470.000044gov.ed
815170541181760.000136org.icann
8161705376416760.000019ws.snack
8171705358815190.000021com.hotmail
8181705348625140.000015com.ifttt
8191705323414890.000021net.hockeyapp
8201705179034650.000010com.virustotal
821170515863690.000066org.opensource
8221705153415130.000021com.acninc
8231705057229500.000012org.moma
824170505426840.000042ca.amazon
8251704954213800.000023com.stitcher
826170489149940.000032org.plos
8271704846227910.000013edu.unl
8281704840613100.000024com.over-blog
8291704800017460.000019com.mercurynews
8301704745427620.000013com.topsy
8311704693217900.000019com.khamsat
8321704659643890.000007com.lmgtfy
8331704615628530.000012com.sophos
8341704527417200.000019com.ignimgs
8351704499613920.000022us.zoom
836170443502740.000085com.maxcdn
8371704346226760.000013edu.gmu
8381704326610080.000031com.oup
839170432509470.000033com.accuweather
8401704247018550.000018net.wrightflyer
8411704223824870.000015edu.utah
8421704217811280.000028com.mixcloud
8431704194410990.000029org.doxygen
8441704193824670.000015com.producthunt
8451704116823150.000016com.thestar
8461704080622300.000017edu.arizona
8471704025414910.000021com.sky
8481703927224730.000015org.openoffice
849170388066910.000042com.163
8501703780017020.000019com.howstuffworks
8511703694615510.000020com.company
8521703689422010.000018com.pastebin
8531703649822690.000017ru.narod
8541703643013980.000022io.pantheon
8551703635816350.000020com.discordapp
8561703537032750.000010org.greenpeace
8571703461822310.000017com.deadline
8581703444614720.000021com.local
8591703408828730.000012com.campaignmonitor
860170335921930.000121jp.ameblo
8611703233628890.000012org.bitcoin
8621703199413510.000023com.socialmediatoday
8631703117415890.000020it.blogspot
8641703097612930.000024edu.si
8651703096871750.000005org.audacityteam
866170307208410.000037com.yp
8671703036822420.000017com.livestrong
8681703033424500.000015com.bestbuy
8691702945813130.000024com.globo
870170293661660.000142me.line
8711702854618520.000018tv.royanews
8721702790225350.000014com.mentalfloss
8731702709012980.000024com.gumroad
8741702695018630.000018com.boston
8751702688826170.000014com.getresponse
8761702484414350.000022com.cafepress
8771702472812080.000026com.forrester
878170220927030.000041com.usnews
879170216489990.000032com.walmart
8801702069414490.000022org.wiktionary
881170206724370.000056com.criteo
8821702027016310.000020au.com.whitepages
8831701631014440.000022ca.calgaryseocompany
884170162724210.000058com.adroll
8851701575610190.000031de.heise
8861701487814410.000022com.technorati
8871701463218080.000019de.welt
8881701459215650.000020com.bizcommunity
8891701408414010.000022mil.army
8901701294828250.000013com.fox
8911701222217290.000019com.contentmarketinginstitute
8921701171625610.000014com.yolasite
893170116185120.000048com.udacity
8941701106221700.000018com.podbean
8951701102214810.000021de.bundesverfassungsgericht
896170108362210.000109me.t
897170105981120.000255info.aboutads
8981701053830710.000011com.googlepages
8991700990217380.000019com.pushwoosh
900170093707010.000041com.gitlab
9011700910412330.000026org.sonatype
9021700873634930.000010org.notepad-plus-plus
9031700834033010.000010edu.uic
9041700824616680.000019com.waze
905170071548080.000038es.com.blogspot
9061700706414580.000021com.tiddlywiki
9071700690011000.000029com.digiday
9081700664816580.000020com.lulu
909170060868070.000038uk.co.eventbrite
9101700535229100.000012com.ndtv
9111700512616830.000019com.ssllabs
9121700466615830.000020com.sproutsocial
9131700422418300.000019me.pxlme
9141700414215280.000021com.neilpatel
9151700374214070.000022int.wipo
9161700361215020.000021org.filezilla-project
917170024724520.000053com.custhelp
9181700178622380.000017org.raspberrypi
9191700087816850.000019com.quandl
9201700060628830.000012edu.tufts
9211700011223660.000016com.salon
9221699915432790.000010org.metmuseum
9231699866034800.000010com.spreaker
9241699854625430.000014com.fineartamerica
9251699643216520.000020net.brownbook
9261699625812890.000024com.bmj
9271699481225420.000014uk.co.express
9281699454832680.000010in.lnkd
9291699349811890.000027com.techtarget
9301699183630270.000012edu.hawaii
9311699176012540.000025org.pewresearch
9321699169228760.000012com.fitbit
9331699165833920.000010org.edx
9341699112621540.000018uk.co.huffingtonpost
9351699065610300.000031com.fotolia
9361699025613110.000024com.optimizely
937169902127270.000040com.geocities
9381698944014100.000022com.mariadb
9391698938810680.000030com.infusionsoft
9401698849832100.000011com.popsci
941169879128270.000037gov.house
9421698779034670.000010cc.tiny
9431698662817660.000019com.spoke
9441698646626660.000014nl.uva
9451698592617270.000019org.unfe
9461698588010280.000031es.amazon
9471698571816470.000020uk.gov.westsussex
9481698555816810.000019com.chamberofcommerce
9491698538635840.000009gd.is
9501698528213080.000024net.java
951169852386540.000044com.houzz
9521698515610900.000029gov.archives
9531698416233130.000010com.avast
9541698394822160.000017com.examiner
9551698380216450.000020com.thefabricator
956169837905050.000049com.redbubble
9571698329618240.000019com.computerworld
9581698220434540.000010com.klout
959169810869340.000034com.delicious
9601697894233290.000010org.kiva
961169783764530.000053com.teamviewer
9621697828020480.000018com.cio
9631697783821710.000018com.thedailybeast
964169775984110.000059mp.mailchi
9651697737224350.000015br.com.blogspot
9661697661611660.000027com.netdna-cdn
967169764968890.000036com.arcgis
9681697605828440.000013com.createspace
9691697602841820.000008net.deviantart
9701697558017610.000019com.yelloyello
9711697551615760.000020gov.cabq
972169754904800.000051com.iconfinder
9731697507414310.000022au.com.yellowpages
9741697372412570.000025io.getmdl
9751697278012280.000026com.thedrum
9761697220416960.000019com.us
9771697175624220.000015org.linuxfoundation
9781696911253780.000006com.depositphotos
9791696900823280.000016com.ign
9801696790015630.000020org.gmplib
9811696788828320.000013edu.caltech
9821696725228950.000012com.infoq
9831696676422020.000018edu.uci
9841696618613320.000024com.xbox
9851696606614250.000022com.techrepublic
9861696602812620.000025com.glassdoor
9871696550615990.000020com.apachelounge
988169654548950.000035org.unicef
9891696511629900.000012com.discogs
9901696450029130.000012es.abc
991169631667430.000039com.biomedcentral
9921696267428630.000012nl.xs4all
9931696250214230.000022org.heart
9941696183226100.000014org.olympic
995169607362520.000093com.ssl-images-amazon
9961695997226800.000013de.bild
9971695979026720.000013com.nbc
9981695929816910.000019com.realtytimes
9991695922814560.000021com.mediafire
10001695908015690.000020com.galvanize

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2019 crawl archive now available

The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th.

The January crawl contains page captures of 850 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Aug/Sep/Oct 2018 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 50 million hosts and domains
  • a random sample of outlinks taken from WAT files of the December crawl

The number of sampled URLs per domain depends on the domain’s harmonic centrality rank in the webgraph data set – higher ranking domain are allowed to “contribute” more URLs.

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-04/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-04/segment.paths.gz100
WARC filesCC-MAIN-2019-04/warc.paths.gz6400058.86
WAT filesCC-MAIN-2019-04/wat.paths.gz6400018.88
WET filesCC-MAIN-2019-04/wet.paths.gz640007.98
Robots.txt filesCC-MAIN-2019-04/robotstxt.paths.gz640000.18
Non-200 responses filesCC-MAIN-2019-04/non200responses.paths.gz640001.65
URL index filesCC-MAIN-2019-04/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-04/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

December 2018 crawl archive now available

The crawl archive for December 2018 is now available! It contains 3.1 billion web pages or 250 TiB of uncompressed content, crawled between December 9th and 19th.

The December crawl contains page captures of 735 million URLs not contained in any crawl archive before. New URLs stem from:

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Aug/Sep/Oct 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 50 million domains of the webgraph dataset
  • a random sample of outlinks taken from WAT files of the November crawl
  • 30 million external links sampled from Wikipedia data dumps

Archive Location and Download

The December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-51/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-51/segment.paths.gz100
WARC filesCC-MAIN-2018-51/warc.paths.gz6384065.31
WAT filesCC-MAIN-2018-51/wat.paths.gz6384020.01
WET filesCC-MAIN-2018-51/wet.paths.gz638408.43
Robots.txt filesCC-MAIN-2018-51/robotstxt.paths.gz638400.22
Non-200 responses filesCC-MAIN-2018-51/non200responses.paths.gz638401.71
URL index filesCC-MAIN-2018-51/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-51/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November 2018 crawl archive now available

The crawl archive for November 2018 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between November 12th and 22nd.

The November crawl contains 640 million new URLs, not contained in any crawl archive before. New URLs stem from:

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Aug/Sep/Oct 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 10 links (“hops”) away from the home pages of the top 40 million domains of the webgraph dataset
  • a random sample of outlinks taken from WAT files of the October crawl
  • 50 million external links sampled from Wikipedia data dumps

Archive Location and Download

The November crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-47/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-47/segment.paths.gz100
WARC filesCC-MAIN-2018-47/warc.paths.gz5600054.16
WAT filesCC-MAIN-2018-47/wat.paths.gz5600017.36
WET filesCC-MAIN-2018-47/wet.paths.gz560007.42
Robots.txt filesCC-MAIN-2018-47/robotstxt.paths.gz560000.2
Non-200 responses filesCC-MAIN-2018-47/non200responses.paths.gz560001.92
URL index filesCC-MAIN-2018-47/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-47/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., the Feb/Mar/Apr 2017 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 903 million nodes and 5.25 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 819 million dangling nodes (91%) and the largest strongly connected component contains only 60 million (6.5%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 903 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-aug-sep-oct/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-aug-sep-oct/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

SizeFileDescription
5.66 GBcc-main-2018-aug-sep-oct-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 42 vertices files
23.60 GBcc-main-2018-aug-sep-oct-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 98 edges files
9.63 GBcc-main-2018-aug-sep-oct-host.graphgraph in BVGraph format
2 kBcc-main-2018-aug-sep-oct-host.properties
10.83 GBcc-main-2018-aug-sep-oct-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-aug-sep-oct-host-t.properties
1 kBcc-main-2018-aug-sep-oct-host.statsWebGraph statistics
13.47 GBcc-main-2018-aug-sep-oct-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 87 million nodes and 1.48 billion edges. 56% or 49 million nodes are dangling nodes, the largest strongly connected component covers 33.5 million or 38% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-aug-sep-oct/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-aug-sep-oct/domain/.

Download files of the Common Crawl Aug/Sep/Oct 2018 domain-level webgraph

SizeFileDescription
0.60 GBcc-main-2018-aug-sep-oct-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.95 GBcc-main-2018-aug-sep-oct-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.24 GBcc-main-2018-aug-sep-oct-domain.graphgraph in BVGraph format
2 kBcc-main-2018-aug-sep-oct-domain.properties
3.39 GBcc-main-2018-aug-sep-oct-domain-t.graphtranspose of the graph
2 kBcc-main-2018-aug-sep-oct-domain-t.properties
1 kBcc-main-2018-aug-sep-oct-domain.statsWebGraph statistics
1.89 GBcc-main-2018-aug-sep-oct-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 87 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Aug/Sept/Oct 2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12499327620.012750com.facebook
22467105610.017210com.googleapis
32345336630.010761com.google
42237157240.008252com.twitter
52213683650.006786com.youtube
62111524660.006404org.w
71959833890.003741com.instagram
81949947280.004616org.gmpg
919210640100.003396com.linkedin
1018557384140.002133com.wordpress
1118475084130.002729org.wordpress
1218409784250.001383org.wikipedia
1318366100240.001489com.gravatar
1418361154210.001623com.pinterest
1517992020120.002819com.bootstrapcdn
1617977192190.001774com.apple
1717931956320.001000com.blogspot
1817718492410.000758be.youtu
1917629132340.000918gl.goo
2017616074260.001308com.microsoft
2117591076160.001984com.googletagmanager
2217583026370.000840com.amazon
2317567490150.002013com.cloudflare
2417539566460.000658com.tumblr
2517512444230.001557com.adobe
2617478800280.001256com.vimeo
2717351964610.000482com.yahoo
2817328248220.001618com.macromedia
2917262608450.000669com.wp
3017247612360.000867com.paypal
3117245380200.001675com.github
3217243016270.001272com.gstatic
3317237096330.000928com.amazonaws
3417218638470.000617me.wp
3517175034520.000568org.mozilla
3617148704980.000318com.googleusercontent
3717124438400.000793co.t
3817114262750.000425com.weebly
39170882261150.000256com.nytimes
4017080324390.000796net.cloudfront
4117058308780.000424org.creativecommons
4217046746510.000584org.w3
43170344461500.000171org.wikimedia
4416978138650.000465com.medium
4516977018590.000490com.flickr
4616969818600.000482ly.bit
4716934330380.000830io.github
48169255641390.000188net.slideshare
49169079181490.000171com.theguardian
5016886648710.000433com.jquery
51168480681330.000196com.imgur
52168380701650.000145com.myspace
5316825278530.000551eu.europa
54168139681590.000152com.imdb
5516800572350.000899net.fbcdn
56167515371270.000208com.issuu
57167312961130.000267org.apache
5816730104300.001101net.doubleclick
59167130841880.000124com.tinyurl
60166901642780.000085com.theverge
61166862241050.000288com.reddit
62166815261200.000229com.yelp
6316673945170.001886com.wixstatic
64166664272600.000091com.appspot
65166612482740.000086com.buzzfeed
66166602871430.000182com.oracle
67166536601340.000191com.spotify
68166428502410.000097me.about
69166412621230.000216com.android
70166351423350.000071org.chromium
711661984570.004633com.godaddy
72166191181550.000164com.tripadvisor
7316611356310.001041com.squarespace
74166098652850.000083com.mysql
75165999433080.000076com.about
76165931643460.000067org.arxiv
77165922991310.000204org.ietf
7816588645950.000343com.soundcloud
79165838513370.000071edu.upenn
80165815914350.000057edu.princeton
81165748033580.000066org.ieee
82165737291560.000164org.gnu
83165673421240.000215com.dropbox
84165671263240.000073com.deviantart
85165438361480.000174com.forbes
86165283341300.000206com.whatsapp
8716521358870.000400com.statcounter
88165212544270.000058google.blog
89165204604300.000057com.ssrn
9016519609620.000481org.schema
91165167901320.000199org.archive
92165111141440.000182net.sourceforge
93165024311800.000130com.cnn
94164991283200.000074gov.loc
95164893331910.000121com.foursquare
96164883801890.000122edu.stanford
9716487861880.000386com.bing
98164837083710.000065edu.ucla
99164798523890.000062com.stackexchange
100164789364940.000050edu.gatech
101164784084140.000059org.sciencemag
102164761801830.000129com.dribbble
103164754712030.000109com.nbcnews
104164709334130.000059com.withgoogle
105164500542550.000092com.example
106164479773590.000066com.googlecode
107164439871080.000278com.ytimg
108164386571690.000142uk.co.bbc
109164367612360.000098edu.mit
110164282942300.000099com.mozilla
111164243522240.000102com.githubusercontent
112164146274580.000054com.sap
113164122535650.000046com.flipboard
114164119582050.000109com.washingtonpost
115164084021460.000176com.blogger
116164068975460.000047com.chrome
11716404143490.000590com.fb
118164027075710.000045edu.utah
119163981224020.000060com.jetbrains
120163961634920.000050com.chron
121163954493220.000073com.git-scm
122163946551860.000125com.huffingtonpost
123163939012230.000102com.businessinsider
12416393651970.000330com.wix
12516391015890.000380com.paypalobjects
126163798471220.000225org.bbb
127163795472430.000097com.live
128163726702870.000082gov.fda
129163726412800.000084au.com.google
13016372405540.000524com.list-manage
131163710022860.000082edu.harvard
132163670165320.000047com.fastcodesign
133163661463650.000066com.tinypic
134163649371940.000117com.wsj
135163475624100.000059tv.ustream
136163441052980.000080com.cnet
137163427653340.000071com.bbc
138163399483870.000062com.variety
139163392484070.000060org.eclipse
140163386364410.000056co.g
141163368193040.000078com.reuters
142163331162580.000091org.doi
143163262282620.000090com.ibm
144163211632660.000088com.wired
145163189573120.000076uk.co.telegraph
146163172741970.000112com.typepad
147163172623490.000067com.gmail
148163163424180.000058org.iana
149163097962690.000087com.bloomberg
150163093632480.000095net.windows
151163083941040.000290com.shopify
152163049614560.000054co.ibb
153163048911810.000129com.stackoverflow
154163032802400.000097com.techcrunch
15516297167550.000519net.akamaihd
156162966752720.000087com.go
157162962171540.000166gov.nih
158162895193940.000061gov.nasa
159162886953390.000071com.msn
160162875253520.000067com.latimes
161162854231620.000147com.etsy
162162823101090.000274com.google-analytics
163162822865080.000049edu.rutgers
164162822325450.000047ca.utoronto
165162756721700.000142com.twimg
166162755821030.000293com.mailchimp
16716274598900.000378de.google
168162712402650.000088org.acm
169162687073660.000066com.mashable
170162674304980.000050com.quora
171162647464160.000058au.gov.nsw
172162641481160.000242com.jimdo
17316261109500.000589com.fontawesome
174162526685500.000047com.vogue
175162516424670.000053com.zdnet
176162508183570.000067uk.co.dailymail
177162477306630.000044com.hbo
178162476214470.000055com.googleblog
179162456807610.000039com.dezeen
180162446872770.000085com.usatoday
181162443201580.000162com.eventbrite
182162431635820.000045edu.osu
183162394592630.000090com.meetup
184162294624290.000058gov.archives
185162241804500.000055edu.cornell
186162231754610.000053edu.berkeley
187162187633960.000061com.ted
188162179361510.000170com.opera
189162149805810.000045edu.washington
190162113632990.000080com.udacity
191162083635800.000045org.hrw
192161978692080.000107com.surveymonkey
193161954993160.000075com.time
194161924494860.000051com.ecwid
195161874384090.000060com.kickstarter
196161874073210.000074org.npr
197161873446960.000042com.discogs
198161811087000.000042io.itch
199161778074960.000050org.unicode
200161777483130.000076com.springer
20116176015290.001149ru.yandex
202161742314460.000055org.kernel
203161731993700.000065com.aol
204161730597010.000042com.economist
205161714042900.000081com.hp
206161689832310.000099com.mapquest
20716167485480.000602com.qq
208161637857580.000039org.wikibooks
209161605183620.000066com.cnbc
210161545403900.000062org.un
211161527693330.000072org.python
212161524224880.000051com.ft
213161514742100.000107org.drupal
214161489814010.000060me.paypal
215161487406900.000042com.strava
216161481524170.000058com.angieslist
217161442222670.000088com.hubspot
218161424741360.000191com.zendesk
219161414035040.000049org.aarp
220161399273640.000066com.giphy
221161380247410.000040org.amnesty
222161360865520.000046com.yellowpages
223161336163430.000069com.nypost
224161327977670.000038com.wikia
225161322237140.000041com.dropboxusercontent
226161316154190.000058com.fortune
22716128988700.000439net.jsfiddle
228161284283300.000072com.wiley
22916127117910.000355com.baidu
230161264492010.000110uk.co.amazon
231161246485090.000049com.unsplash
232161233351450.000179uk.co.google
233161228603610.000066com.prnewswire
234161190938210.000037com.slate
235161178004820.000051com.cisco
236161143233530.000067com.photobucket
237161120365610.000046com.venturebeat
238161111668730.000036com.pixabay
239161084369760.000034com.arstechnica
240161056541980.000111org.purl
241161026422060.000108com.ebay
242161012427980.000038com.manta
243160992231370.000189com.wixsite
244160983187020.000042com.intel
245160975986850.000043com.nationalgeographic
246160963994420.000056com.entrepreneur
247160902874050.000060gov.whitehouse
248160900764590.000054com.nature
249160898023190.000074com.oreilly
250160883193760.000064com.office
251160871485760.000045com.samsung
25216084044570.000494com.vk
253160828144790.000052com.matterport
254160805534750.000052org.postgresql
255160780466010.000045com.newyorker
256160752552970.000081gov.cdc
257160751191730.000138com.constantcontact
258160739336970.000042com.vice
259160731138290.000037edu.psu
2601607140211360.000031com.gizmodo
261160714025510.000047com.scribd
262160703539230.000035com.qz
263160686163560.000067org.ampproject
264160686035570.000046gov.nist
265160683562940.000081me.telegram
266160670254900.000051com.wikihow
267160663868120.000037ly.snip
268160661422200.000104com.disqus
269160656949730.000034edu.yale
270160629224740.000052com.cbsnews
271160622647790.000038edu.kit
272160602416070.000044org.eff
273160597865830.000045com.box
274160593282370.000098net.php
275160579171260.000209com.feedburner
276160571794760.000052com.theatlantic
277160556828280.000037com.engadget
278160520132640.000089gov.ftc
279160475077910.000038com.merchantcircle
280160449012520.000093com.digg
281160441414480.000055org.hbr
282160426607070.000041org.nodejs
283160421894530.000055com.inc
284160408923740.000064com.images-amazon
285160397453790.000064com.skype
286160388012120.000107com.salesforce
287160382917160.000041com.statista
2881603536414210.000027edu.utexas
289160342942930.000081com.staticflickr
290160336002910.000081com.fastcompany
2911603303714390.000027com.pexels
292160304766940.000042edu.columbia
293160286188760.000036com.marketwatch
294160271536000.000045com.avvo
2951602434914360.000027com.storify
296160239863400.000070int.who
297160232921060.000284com.addthis
298160213207750.000038com.indiatimes
2991601617714450.000026com.thinkwithgoogle
300160157214060.000060org.maven
301160143994490.000055com.w3schools
3021601365814080.000028com.smashingmagazine
303160121648780.000036com.mysanantonio
304160117143720.000064co.elastic
305160117052150.000105com.stumbleupon
306160112292260.000101to.amzn
3071600818314920.000025edu.purdue
308160065451950.000116net.behance
309160060385600.000046org.pbs
31016005359630.000476me.fb
311160033753020.000079com.googlesyndication
312160029949690.000034au.net.abc
3131600295116010.000022com.vanityfair
314160027914990.000050com.slack
315160014922700.000087gov.ca
316159999723110.000076com.tripod
317159952763380.000071com.sxsw
318159930364080.000060uk.co.blogspot
319159902511410.000185com.weibo
320159876816840.000043net.researchgate
3211598556714150.000027com.alexa
322159843353550.000067com.dailymotion
3231598283514010.000028edu.ucsd
324159821386860.000042com.blackberry
3251598189110140.000033org.worldbank
326159791003150.000075fr.free
327159780044720.000052net.leadpages
328159765849740.000034com.thenextweb
329159716885540.000046com.moz
3301597046316990.000020org.owasp
331159694043600.000066com.sciencedirect
332159687767620.000039com.uservoice
3331596849810030.000033com.shutterstock
334159657743750.000064edu.cmu
335159622101760.000134org.icann
336159611797320.000040com.proofpoint
337159584009030.000035edu.uark
3381595732310820.000032com.evernote
339159568903730.000064com.livejournal
340159552336910.000042com.googlesource
3411595148010060.000033ly.ow
342159497795890.000045gov.sec
343159463019550.000034com.speakerdeck
3441594494913510.000029com.lifehacker
345159416795840.000045com.citysearch
346159413048790.000035org.unesco
347159405258140.000037com.psychologytoday
3481593781313190.000031com.trello
349159376069130.000035com.sfgate
350159362859940.000033com.designobserver
3511593410315360.000024edu.northwestern
352159335054570.000054com.snapchat
3531593207113200.000031uk.ac.ox
354159316966710.000043tv.twitch
3551593155110210.000032gov.fcc
356159309236780.000043org.bitbucket
3571592982017780.000019com.fifa
358159297244120.000059com.businesswire
359159289288030.000037org.aiga
360159266172440.000096com.wufoo
361159265935690.000045com.atlassian
362159262032140.000106de.amazon
363159254963270.000072com.typeform
3641592476616030.000022com.mcafee
3651592299810470.000032com.libsyn
3661592223318780.000017org.coursera
367159217529160.000035com.zynga
368159215499610.000034com.kudzu
3691592152918520.000018com.semrush
370159205416740.000043com.ubuntu
3711592014816730.000021com.econsultancy
3721591820014400.000027com.indiegogo
3731591727513830.000028com.politico
374159172612950.000081org.mediawiki
375159167397540.000039org.aclweb
376159166399630.000034com.deloitte
377159148549300.000034org.spie
378159147609810.000033com.livestream
3791591220314490.000026co.vine
3801591011915530.000023org.khanacademy
381159088265160.000048com.goodreads
382159083399890.000033gov.uspto
383159077623030.000079org.joomla
3841590699213980.000028com.zoho
385159026869080.000035me.websta
386159015168250.000037com.foxnews
387159003023500.000067com.booking
3881589916010130.000033io.codepen
389158983821290.000206com.youtube-nocookie
390158970261630.000146jp.co.yahoo
3911589637816130.000022edu.unc
3921589614116520.000021com.technologyreview
3931589427717090.000020com.digitaltrends
3941589338312170.000031org.iso
3951589322915690.000023com.pingdom
396158931179140.000035gov.senate
397158928602890.000082com.smugmug
398158905951990.000111com.bandcamp
399158893839750.000034com.mckinsey
400158886019200.000035it.binged
4011588810213890.000028com.udemy
402158859959910.000033com.what3words
4031588524725610.000012com.sophos
4041588464416220.000022org.weforum
405158845603800.000064net.themeforest
406158842726260.000044gov.noaa
4071588277718960.000017com.ehow
408158812157180.000041org.vim
4091587959414900.000025com.elpais
4101587919713430.000030com.sciencedaily
411158790744450.000056com.squareup
412158790528160.000037com.gartner
413158770644390.000056com.netflix
414158739824700.000053com.webs
415158734982710.000087com.rawgit
4161587348510350.000032edu.uah
4171587325717100.000020uk.co.wired
418158715034630.000053com.bizjournals
4191587116110020.000033com.americanexpress
4201587016715680.000023org.pnas
421158691514330.000057com.monster
4221586910610240.000032com.nielsen
4231586667013170.000031com.redhat
424158664416670.000044com.java
42515865156760.000425org.reactjs
4261586469219490.000017ch.ethz
427158621914000.000060com.force
428158619304040.000060com.herokuapp
4291586137717980.000019com.socialmediaexaminer
4301586080311080.000031com.adage
431158602128920.000035com.googledrive
4321585965318990.000017com.tutsplus
433158582841140.000263jp.co.google
4341585707316960.000020edu.usc
435158565709840.000033com.prweb
436158561917600.000039gov.justice
4371585574814810.000025com.playstation
4381585543217340.000020com.canva
439158549615140.000049us.icio
440158528131720.000138com.xing
441158501978660.000036re.cli
4421585011415720.000023edu.uchicago
4431584956714000.000028com.bostonglobe
444158487238010.000038com.steampowered
445158444412920.000081ca.google
446158442594370.000057com.bigcartel
4471584342321500.000015com.urbandictionary
448158427288440.000036io.material
449158414254810.000051com.bigcommerce
4501583919615400.000024com.caniuse
451158383092450.000096com.getclicky
4521583444813840.000028com.dell
453158343758080.000037gov.state
4541583421417320.000020com.hotmail
455158336722500.000094es.google
4561583133816920.000021au.com.smh
4571583076716320.000022com.upwork
458158301997370.000040org.gnupg
459158298129980.000033edu.utep
460158290953540.000067com.stripe
461158288529010.000035com.msdn
462158281834220.000058com.adweek
4631582691517010.000020com.codeplex
4641582625720050.000016ca.uwaterloo
465158258961070.000283org.networkadvertising
4661582499624750.000013com.twitpic
4671582363013750.000029uk.ac.cam
468158232422250.000101com.myshopify
4691582308817520.000019com.nike
470158224188450.000036com.outlook
4711582231414980.000025com.gettyimages
4721582123313410.000030com.istockphoto
4731582092111890.000031de.heise
4741581947416000.000022com.marketo
475158184755200.000048com.cargocollective
4761581833413680.000029ca.blogspot
4771581723119900.000016com.norton
4781581574414590.000026de.spiegel
479158146268460.000036jp.co.fujixerox
480158136309970.000033com.chicagotribune
4811581288718070.000018com.ikea
4821581247715500.000023com.ning
4831581225420520.000016com.crunchbase
484158110346990.000042com.webmd
485158088262020.000110com.windowsphone
4861580837515210.000024com.scientificamerican
487158082672390.000097com.getbootstrap
4881580821225670.000012com.codecademy
4891580798310990.000031edu.alamo
490158073995070.000049com.npmjs
4911580686615850.000022com.billboard
4921580616610520.000032com.theschooloflife
4931580553320140.000016com.msnbc
4941580433723030.000014com.instructables
495158039207250.000040gov.copyright
4961580364015300.000024uk.ac.ucl
4971580357216760.000021fr.lemonde
498158023349250.000034edu.umich
4991580085313780.000028edu.wisc
500158007471400.000188ru.mail
5011580037123580.000013com.starwars
502157978788650.000036de.blogspot
5031579779115080.000024com.kissmetrics
5041579704711150.000031com.beautifulpixels
5051579694713860.000028com.airbnb
5061579685324510.000013edu.hbs
507157963251660.000145com.eepurl
508157956377680.000038com.css-tricks
509157953632330.000098com.bitly
5101579482216150.000022edu.jhu
5111579369113620.000029com.alibaba
5121579265411640.000031com.sun
513157922717720.000038com.tandfonline
514157915938930.000035com.underconsideration
515157906335180.000048in.co.google
516157890877930.000038com.uber
517157885077040.000042com.photoshelter
518157873325660.000046com.symantec
5191578719629360.000010uk.bl
520157868556830.000043gov.hhs
521157832098070.000037io.getmdl
5221578283116910.000021com.irishtimes
5231578187421540.000015edu.ncsu
5241578133313930.000028com.searchenginejournal
52515781203670.000448com.messenger
526157800515170.000048org.sonatype
527157786779790.000033ca.cbc
5281577858715920.000022com.yandex
529157778904310.000057com.clicky
5301577742717680.000019com.hulu
5311577662514430.000026com.accenture
5321577442016100.000022edu.academia
533157733145280.000047gov.epa
5341577283314030.000028com.marketingland
535157724729720.000034uk.co.guardian
5361577175920400.000016tv.periscope
5371576999916160.000022com.today
5381576844724530.000013ly.visual
539157669453690.000065edu.nyu
5401576684314090.000028org.apa
5411576656128940.000011com.girlswhocode
5421576640915460.000024com.hollywoodreporter
543157655045490.000047uk.co.independent
5441576479123780.000013com.glamour
5451576476021310.000015au.com.news
546157644895530.000046gov.ed
5471576337417140.000020com.invisionapp
5481576332124500.000013org.gimp
549157631047090.000041com.feedly
5501576300113220.000031org.change
5511576164520720.000015com.ibtimes
5521576125515980.000022com.thomsonreuters
5531576028115170.000024gov.nyc
5541576020018260.000018com.posterous
555157590199430.000034com.bravesites
5561575812336490.000008com.space
5571575810514340.000027gov.bls
558157569794430.000056cn.com.sina
559157567123970.000061com.custhelp
5601575507123890.000013com.tesla
5611575390614760.000025com.businessweek
562157534347740.000038com.uk
5631575318617820.000019com.zillow
5641575223518140.000018com.zapier
5651575199725830.000012com.dreamstime
5661575154630030.000010com.klout
5671575099216230.000022com.thehill
568157507222340.000098com.wpengine
5691575007629780.000010com.rottentomatoes
5701574979526930.000012com.campaignmonitor
5711574913016290.000022uk.ac.ed
5721574862622460.000014com.wikidot
5731574838722520.000014com.123rf
574157480382170.000105fr.google
5751574787316110.000022com.intuit
5761574747916410.000021org.letsencrypt
577157467828750.000036com.questionpro
578157448076640.000044com.gotowebinar
5791574452619870.000016com.nokia
5801574293926580.000012edu.brown
5811574249436000.000008com.formula1
5821574236421840.000014com.mentalfloss
583157423424510.000055gov.irs
584157422664910.000050net.openid
5851574066416880.000021com.nba
5861573922215930.000022org.pewresearch
5871573872422220.000014com.aljazeera
5881573835610580.000032com.ezlocal
5891573738114370.000027org.altervista
5901573700214780.000025in.blogspot
591157363202790.000084it.placehold
5921573554832330.000009edu.uic
5931573528022510.000014com.programmableweb
5941573524721530.000015com.cbs
5951573480711530.000031gov.sba
5961573421821980.000014com.techradar
597157341588260.000037gov.census
5981573351317470.000019org.postimg
599157328785060.000049gov.usda
6001573253315350.000024com.target
601157313437210.000041com.docker
6021573112215190.000024com.gigaom
6031573101328000.000011com.oxforddictionaries
6041572817216930.000021net.daum
605157279899620.000034com.gofundme
6061572798016390.000022kr.flic
6071572668111560.000031com.formstack
608157262567630.000039org.sqlite
6091572543616610.000021com.autodesk
6101572481213960.000028com.techrepublic
611157248068170.000037com.patreon
612157212409700.000034com.insiderpages
6131572122717950.000019com.us
6141572019810310.000032com.hotfrog
615157201669660.000034com.whitepages
6161571960519000.000017edu.illinois
6171571957816420.000021com.pwc
6181571846020390.000016edu.asu
6191571839124860.000013com.animoto
620157173202490.000094com.fc2
6211571705218250.000018org.rubyonrails
622157166257420.000040com.wunderground
623157160372130.000106org.debian
6241571596011240.000031org.cmlibrary
6251571582910620.000032com.idt
6261571570513590.000029com.investopedia
6271571545218560.000018com.howstuffworks
6281571475313260.000030org.redcross
6291571461714930.000025com.indeed
6301571376221010.000015com.lonelyplanet
6311571370520540.000016com.gamespot
632157134319100.000035gov.nps
6331571315910840.000032com.thesprintbook
6341571272911410.000031com.smartguy
635157119518320.000037com.att
6361571166020490.000016com.refinery29
637157092925220.000048com.vendio
6381570914428510.000011com.domaintools
639157088428740.000036com.itsnicethat
6401570793918010.000018org.filezilla-project
6411570776013950.000028com.vmware
642157070051710.000139it.google
6431570636139940.000007com.boredpanda
6441570536413910.000028gov.va
645157053358490.000036com.pinimg
6461570496514280.000027com.reverbnation
6471570460420160.000016ca.ubc
6481570434619950.000016com.nfl
649157037686660.000044com.houzz
6501570370015160.000024com.prezi
6511570307419120.000017edu.indiana
6521570217430490.000010com.hubpages
653157015224360.000057com.nasdaq
6541570135927340.000011com.9to5mac
6551570123915790.000023com.pcworld
6561570078518240.000018edu.ucdavis
6571570073114160.000027gov.usgs
658157000758860.000035com.500px
6591569965210010.000033com.acninc
6601569944821570.000015com.livestrong
6611569904813280.000030org.oecd
6621569851922670.000014com.newscientist
6631569720618460.000018com.espn
6641569710114840.000025edu.umn
6651569707417030.000020com.freepik
6661569632219020.000017edu.virginia
6671569488716050.000022com.vox
6681569460618580.000018com.deadline
669156935254830.000051org.whatbrowser
6701569299114990.000025com.mixcloud
671156917288470.000036com.emarketer
6721569159713600.000029fr.blogspot
673156915396950.000042com.flippa
674156912032560.000092com.elegantthemes
6751569035615900.000022com.newsweek
6761568967521700.000015com.getresponse
677156885894600.000054io.atom
6781568858417000.000020com.gallup
6791568831821870.000014edu.bu
6801568736928150.000011org.moma
6811568639418880.000017com.findlaw
6821568377515420.000024edu.si
6831568351620940.000015com.pastebin
6841568291711550.000031dk.fcm
6851568264015470.000024com.globo
686156826173680.000065org.openstreetmap
6871568188911420.000031org.writersleague
6881568057718840.000017edu.cuny
6891568055119250.000017com.starbucks
6901568046514470.000026com.warnerbros
6911567923820750.000015com.socialmediatoday
6921567896611500.000031com.prosperent
6931567867311140.000031org.grayarea
6941567844819840.000016org.aclu
695156778797390.000040org.jenkins-ci
6961567459220020.000016com.mercurynews
6971567444315520.000023com.business2community
6981567440218360.000018mp.j
6991567436343680.000007com.petapixel
7001567378226300.000012com.googlepages
7011567349218940.000017com.hostgator
702156732797450.000039com.geocities
7031567282513360.000030org.mayoclinic
704156722611670.000143gov.privacyshield
7051567123410490.000032com.ycombinator
7061567057114960.000025net.java
7071567001714630.000026us.imageshack
7081566990524320.000013com.psychcentral
7091566906116240.000022com.boston
7101566780015090.000024org.fao
7111566643819800.000016edu.arizona
7121566563915810.000023com.nydailynews
7131566519218320.000018de.welt
714156650992380.000098com.youku
7151566469319150.000017com.salon
7161566455423650.000013edu.gmu
7171566368710170.000032com.aweber
718156636612420.000097jp.co.amazon
7191566365622990.000014com.yourdomain
7201566215020210.000016com.domain
7211566151122850.000014com.ew
7221565970611490.000031com.collegian
723156590437960.000038org.elasticsearch
7241565848713800.000028com.mlb
725156584558990.000035com.delicious
7261565825722390.000014ca.ualberta
7271565783032650.000009org.edx
728156559209880.000033google.design
7291565566627760.000011org.kiva
7301565452614100.000028com.weather
7311565429918370.000018net.codecanyon
7321565428827430.000011com.lynda
7331565420515030.000024com.merriam-webster
7341565414710420.000032com.womentechmakers
7351565409110650.000032net.brownbook
736156537899860.000033com.hootsuite
7371565345039790.000007com.lmgtfy
738156520704260.000058com.ea
7391565191817790.000019edu.umd
7401565185814970.000025com.thedrum
7411565096117350.000020com.aliexpress
742156508842040.000109com.automattic
7431565024315140.000024int.coe
7441565001922470.000014org.openoffice
7451564998816580.000021com.firefox
7461564977215990.000022com.searchenginewatch
7471564962018530.000018com.zazzle
7481564896020270.000016com.gq
7491564865615740.000023org.cambridge
7501564865119040.000017edu.msu
751156477414440.000056com.barnesandnoble
7521564738821490.000015com.azcentral
7531564718124290.000013edu.wustl
7541564711025540.000012org.semanticscholar
7551564696018870.000017edu.umass
7561564665015550.000023fm.last
7571564616720600.000016au.com.blogspot
7581564560711910.000031site.tenerifeforum
7591564554931400.000010com.copyblogger
7601564539011020.000031uk.gov.peterborough
7611564487022890.000014com.topsy
762156445908970.000035com.unity3d
7631564447216280.000022com.over-blog
7641564391115010.000025com.waze
7651564216423200.000014com.gawker
7661564210324660.000013ms.1drv
7671564182813700.000029com.timeanddate
7681564133934770.000009com.answers
7691564116913250.000030com.arcgis
770156408597940.000038com.clkmg
7711563906910800.000032com.cbslocal
7721563893025760.000012org.phys
7731563834816950.000021com.stitcher
7741563768116630.000021com.gumroad
7751563721713660.000029gov.fbi
7761563705023340.000013com.fiverr
7771563622718000.000019com.lulu
7781563567616710.000021com.rollingstone
7791563541718800.000017com.nvidia
7801563509427020.000012com.headspace
781156347673410.000070org.opensource
7821563440016840.000021com.neilpatel
7831563399917710.000019uk.co.metro
784156332089900.000033jp.ac.kobe-u
7851563299617410.000020com.mtv
78615632518560.000499net.facebook
7871563251626490.000012edu.tufts
788156324199150.000035br.com.uol
7891563189425620.000012com.fox
7901563185010570.000032com.brightcove
7911563131315050.000024com.sky
7921563077629330.000010com.popsci
7931563030133890.000009com.wolfram
7941562757224340.000013com.theonion
7951562756713480.000029org.readthedocs
7961562733519270.000017com.trendmicro
797156266543880.000062com.marriott
798156265693420.000070nl.google
7991562630527210.000011edu.caltech
8001562589610600.000032com.2findlocal
8011562527014720.000025uk.co.theregister
802156251578910.000035uk.co.eventbrite
8031562515611210.000031com.fotolia
8041562459618490.000018com.history
805156242743060.000077com.naver
8061562377029850.000010edu.dartmouth
8071562362416060.000022com.bmj
8081562302825150.000012ch.cern
8091562291919140.000017it.scoop
8101562193613570.000029com.walmart
8111562174619300.000017org.kde
8121562134418980.000017com.nrf
8131561933016490.000021im.gitter
8141561928623790.000013com.bestbuy
815156192834730.000052com.iconfinder
8161561835618660.000018org.jstor
8171561810913770.000028com.searchengineland
818156162721840.000128jp.ne.hatena
8191561580015430.000024com.splashthat
8201561456331100.000010org.notepad-plus-plus
8211561411016270.000022com.com
8221561373815290.000024org.heart
8231561289625290.000012edu.uiuc
8241561266627300.000011com.fitbit
8251561185910260.000032com.company
8261561095424890.000012com.wikispaces
8271561087515410.000024com.cafepress
8281561054217380.000020com.ssllabs
8291561013923520.000013de.bild
83015608795690.000447com.parallels
831156086309170.000035gov.usa
8321560862418060.000018com.buffer
8331560854319660.000016com.discordapp
8341560777812060.000031com.infusionsoft
8351560752320310.000016edu.uci
836156072248380.000036org.openweathermap
8371560663231590.000010gd.is
838156055021820.000129jp.ameblo
839156048379000.000035com.cdbaby
8401560459810000.000033com.newsbank
8411560439318150.000018com.deezer
8421560380418220.000018com.discovery
843156027847650.000038org.doxygen
8441560222610300.000032org.travelblog
8451560221310340.000032org.tpr
846156010344280.000058net.launchpad
847156001897770.000038com.sagepub
8481559849310590.000032com.chamberofcommerce
849155980655100.000049com.cracked
850155975907490.000039org.plos
8511559725149430.000006com.checkpoint
8521559703119360.000017uk.co.thesun
85315597000990.000302com.namecheap
8541559665431620.000009com.spreaker
8551559619915330.000024com.xkcd
8561559375913310.000030com.tableau
8571559364214880.000025com.pcmag
8581559349719340.000017edu.ufl
8591559163434530.000009edu.buffalo
8601559134427710.000011com.producthunt
8611559127934240.000009org.lifehack
8621559111319770.000016com.examiner
8631559102210730.000032net.azurewebsites
8641559009123600.000013com.bleacherreport
8651558956610160.000033com.bizcommunity
866155894209960.000033com.chambermaster
8671558929411470.000031com.oup
8681558912618890.000017com.thedailybeast
8691558880526400.000012com.snopes
8701558813720840.000015com.ign
8711558805935920.000008com.appleinsider
8721558757110960.000031com.lookuppage
8731558754120590.000016com.mac
874155874467460.000039com.usnews
875155869286690.000043com.163
8761558625929660.000010org.greenpeace
8771558611436720.000008edu.temple
878155861089190.000035com.tiddlywiki
8791558518019930.000016de.zeit
8801558432716600.000021com.strikingly
8811558412418540.000018co.angel
8821558341922370.000014com.yolasite
883155832665410.000047com.1and1
8841558322816500.000021com.windows
8851558316024540.000013net.comcast
8861558218345410.000007com.blog
8871558191113500.000029com.shareasale
8881558158411030.000031com.spoke
8891558071226760.000012com.macrumors
8901558023521060.000015com.si
8911558005529470.000010com.avast
8921557973111040.000031com.communitywalk
8931557953410390.000032com.independent
8941557948418280.000018it.blogspot
8951557888222350.000014com.icloud
8961557881322270.000014ca.sfu
8971557833117590.000019edu.duke
8981557813414060.000028gov.ny
8991557805830480.000010edu.ucsc
9001557748217810.000019com.lithium
9011557744730810.000010com.marieclaire
902155772478900.000035com.mariadb
9031557696232020.000009com.brainyquote
9041557676825980.000012ca.globalnews
9051557603428950.000011edu.oregonstate
906155759677380.000040es.com.blogspot
907155756076810.000043fr.amazon
9081557504025700.000012com.nintendo
909155749891530.000166de.bund
9101557462920780.000015com.popsugar
9111557404011160.000031com.lacartes
9121557364119290.000017com.angelfire
9131557358820670.000015org.poynter
9141557357310710.000032com.citysquares
9151557341722020.000014com.movember
9161557326618820.000017uk.ac.lse
9171557301710450.000032com.thegreatdiscontent
9181557291720980.000015org.wpmudev
9191557216425220.000012com.fineartamerica
9201557208324930.000012edu.vt
9211557196528330.000011edu.hawaii
9221557173021710.000015com.teenvogue
9231557164115310.000024com.calendly
9241557163515580.000023com.steamcommunity
9251557142930260.000010org.thinkprogress
9261557142514350.000027com.techtarget
9271557124820450.000016com.blogtalkradio
928155711886870.000042uk.co.tripadvisor
9291557110915260.000024com.glassdoor
9301557051215440.000024com.xbox
9311557031213810.000028me.m
9321556988522180.000014uk.co.express
9331556928816750.000021uk.co.mirror
934155685031190.000232info.aboutads
9351556839325080.000012com.blogs
9361556802721150.000015com.templatemonster
937155675783280.000072com.netdna-ssl
9381556728913710.000029gov.dol
9391556723015540.000023org.unicef
9401556718811350.000031com.netdna-cdn
941155663904110.000059com.mapbox
9421556606110880.000032com.americantowns
9431556593024560.000013org.7-zip
9441556529038520.000008com.thenation
945155647728530.000036ca.amazon
9461556462448170.000006com.depositphotos
9471556455526870.000012edu.pitt
9481556453826030.000012nl.uva
9491556428421110.000015sg.com.google
9501556423610200.000032com.galvanize
9511556376310430.000032com.judysbook
9521556333010860.000032org.twinery
9531556302717560.000019com.timeout
9541556278716970.000020com.mediafire
9551556170920860.000015com.w3techs
9561556167613730.000029com.ups
957155612919450.000034gov.house
958155609948550.000036io.pantheon
9591556047723950.000013com.me
9601555916932070.000009cc.tiny
9611555909819580.000016com.apnews
9621555788332480.000009org.code
963155577275370.000047com.getpocket
964155575276720.000043com.elsevier
965155568487290.000040com.prestashop
9661555683323880.000013com.homedepot
9671555679814460.000026com.bufferapp
9681555676734160.000009com.virustotal
9691555655116940.000021com.outbrain
9701555603650220.000006com.wechat
9711555572223400.000013com.pandora
9721555571323010.000014com.foxmovies
9731555561641400.000007com.kpcb
9741555548528600.000011com.lanyrd
975155553147240.000041com.redbubble
9761555392432620.000009org.catalyst
9771555330720960.000015tech.ces
9781555328415910.000022gov.wa
9791555316014770.000025jp.blogspot
9801555294821100.000015com.twilio
981155527384770.000052mp.mailchi
9821555267135810.000008com.biography
9831555206519320.000017com.healthline
9841555157610850.000032com.pacegallery
9851555154130020.000010com.iconosquare
9861555083620000.000016com.baltimoresun
9871555016628030.000011com.imageshack
9881554994319450.000017gov.uscourts
9891554966328750.000011int.esa
9901554914727310.000011com.virgin
9911554839245880.000006com.diigo
9921554839018100.000018com.people
9931554838614730.000025se.haxx
9941554775917060.000020com.visualstudio
9951554773730640.000010com.freelancer
9961554764023070.000014com.xerox
997155475055120.000049com.myportfolio
9981554743813640.000029es.amazon
9991554642433560.000009com.complex
1000155459774690.000053br.com.google

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

October 2018 crawl archive now available

The crawl archive for October 2018 is now available! It contains 3.0 billion web pages and 240 TiB of uncompressed content, crawled between October 15th and 24th.

The October crawl contains 600 million new URLs, not contained in any crawl archive before. New URLs stem from:

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the May/June/July 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 10 links (“hops”) away from the home pages of the top 40 million domains of the webgraph dataset
  • a random sample of outlinks taken from WAT files of the September crawl
  • 15 million external links sampled from Wikipedia data dumps

Please note that the character set detection was not fully working for the first 13 segments of the October crawl – about 15% of the page captures in these segments have no charset and language assigned. More information is found in the bug report.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-43/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-43/segment.paths.gz100
WARC filesCC-MAIN-2018-43/warc.paths.gz5600058.84
WAT filesCC-MAIN-2018-43/wat.paths.gz5600019.34
WET filesCC-MAIN-2018-43/wet.paths.gz560008.22
Robots.txt filesCC-MAIN-2018-43/robotstxt.paths.gz560000.21
Non-200 responses filesCC-MAIN-2018-43/non200responses.paths.gz560001.78
URL index filesCC-MAIN-2018-43/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-43/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

September 2018 crawl archive now available

The crawl archive for September 2018 is now available! It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th.

The following improvements and fixes to the data formats have been made:

  • the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields.
  • WARC revisit records (HTTP status 304) in the URL indexes do not include a field for the payload "digest" anymore. The corresponding column "content_digest" in the columnar index now contains null values.
  • we’ve fixed a bug in the WARC writer which added an extra line break (\r\n) between HTTP header and payload in WARC response record. See the announcement on our Google group for details. Thanks again to Greg Lindahl for discovering this bug!

The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from

  • the continued seed donation of URLs from mixnode.com
  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the May/June/July 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 25 million domains of the webgraph dataset
  • a random sample taken from WAT files of the August crawl

New Fields in the Columnar URL Index

The columnar index has been updated to contain two new fields added to WARC and CDX files starting with the August crawl:

In addition, the column content_digest now contains null values.

The table schema in the cc-index-table project on github has been updated to reflect these changes.

Please follow the instructions below to upgrade to the new schema for Spark, Athena/Presto or Hive. If you do not want to use the new fields, no action is required, the tools should continue to work with the old schema.

Spark

The property spark.sql.parquet.mergeSchema must be set to true, e.g. by running the Spark job with the command

spark-submit --conf spark.sql.parquet.mergeSchema=true ...

Note that enabling schema merging has a negative impact on the performance of Spark jobs,
you may want to enable it only in case the new fields are required for your task.

Athena / Presto

Please create a new table using the updated schema. The old schema will continue to work but the new fields cannot be used. Further information can be found in the chapter about schema updates in the Athena documentation.

Hive

Schema evolution is supported since version 0.13. The procedure is essentially the same as for Athena – you need to drop and re-create the table with the updated schema only in case the new fields are used.

Archive Location and Download

The September crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-39/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-39/segment.paths.gz100
WARC filesCC-MAIN-2018-39/warc.paths.gz5632047.9
WAT filesCC-MAIN-2018-39/wat.paths.gz5632018.5
WET filesCC-MAIN-2018-39/wet.paths.gz563207.87
Robots.txt filesCC-MAIN-2018-39/robotstxt.paths.gz563200.19
Non-200 responses filesCC-MAIN-2018-39/non200responses.paths.gz563201.83
URL index filesCC-MAIN-2018-39/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-39/. Also the columnar index has been updated to contain this crawl.

We are grateful to our friends at mixnode for donating a seed list of 200 Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August Crawl Archive Introduces Language Annotations

The crawl archive for August 2018 is now available! It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th. Together with an upgrade of the crawler software we’ve plugged in a language detector and now provide as annotation the language a web page is written in.

Please note that the WARC files of August 2018 (CC-MAIN-2018-34) are affected by a WARC format error and contain an extra \r\n between HTTP header and payload content. Also the given "Content-Length" is off by 2 bytes. For more information about this bug see this post on our user forum.

Language Annotations

We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage:

languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]}

On github you’ll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics.

Please note that the columnar index does not contain the detected languages for now. This requires a change of the table schema. We plan to add the new fields later after we’ve verified that an update of the schema does not break common tools (e.g., Spark or Presto/Athena) used to process the table.

Crawler Software Upgrade and Minor Changes to WARC Files

Our crawler has been upgraded and is now based on the most recent version of Apache Nutch (1.15). The source code can be found on github in our Nutch fork.

In conjunction with the crawler upgrade we made the following minor changes affecting the WARC record format of the crawl archives:

  • "HTTP 304 notmodified" responses are now stored as WARC revisit records in the "crawldiagnostics" subset along with 404s, redirects and other non-200 responses. For now the revisit records contain a payload digest although there is no payload sent together with HTTP 304 responses. The stupid reason is that the columnar index requires the digest field and we want to make sure that all tools continue to work as expected. The SHA-1 digest of an empty payload (zero bytes) is used for the revisit records.
  • All HTTP response headers are now preserved. As before, if the page content is truncated or was compressed or chunked during transfer, the headers "Content-Encoding", "Transfer-Encoding" and "Content-Length" need to be rewritten, otherwise WARC readers may fail reading the record payload. E.g., a page compressed on the HTTP protocol layer may have the following headers – the original headers are prefixed with X-Crawler-:
    X-Crawler-Content-Encoding: gzip
    X-Crawler-Content-Length: 2010
    Content-Length: 16125
    
  • The crawler may now also store pages fetched partially because of a network disconnect. These captures are marked as WARC-Truncated: disconnect in the WARC record header. Note that the crawler may also truncate the page payload because of a content limit (we store only 1 MB per page) or a time limit (after 10 minutes a page download is canceled).
  • the WARC record headers indicate still "WARC/1.0" although we follow the WARC specification, v1.1. While testing various WARC reader libraries we’ve found that at least two of them fail on records with a "WARC/1.1" header.

Please note that due to a bug the first two crawled segments are without robots.txt captures.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-34/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-34/segment.paths.gz100
WARC filesCC-MAIN-2018-34/warc.paths.gz7152067.79
WAT filesCC-MAIN-2018-34/wat.paths.gz7152016.76
WET filesCC-MAIN-2018-34/wet.paths.gz715206.92
Robots.txt filesCC-MAIN-2018-34/robotstxt.paths.gz715200.19
Non-200 responses filesCC-MAIN-2018-34/non200responses.paths.gz715201.79
URL index filesCC-MAIN-2018-34/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-34/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs May/June/July 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs).

Host-level graph

The graph consists of 886 million nodes and 5.4 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 793 million dangling nodes (89.5%) and the largest strongly connected component contains only 67 million (7.5%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 886 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl May/June/July 2018 host-level webgraph

SizeFileDescription
5.60 GBcc-main-2018-may-jun-jul-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 98 vertices files
25.12 GBcc-main-2018-may-jun-jul-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 196 edges files
9.99 GBcc-main-2018-may-jun-jul-host.graphgraph in BVGraph format
2 kBcc-main-2018-may-jun-jul-host.properties
11.30 GBcc-main-2018-may-jun-jul-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-may-jun-jul-host-t.properties
1 kBcc-main-2018-may-jun-jul-host.statsWebGraph statistics
13.35 GBcc-main-2018-may-jun-jul-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 92 million nodes and 1.45 billion edges. 57% or 53 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 37% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/domain/.

Download files of the Common Crawl May/June/July 2018 domain-level webgraph

SizeFileDescription
0.64 GBcc-main-2018-may-jun-jul-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.85 GBcc-main-2018-may-jun-jul-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.21 GBcc-main-2018-may-jun-jul-domain.graphgraph in BVGraph format
2 kBcc-main-2018-may-jun-jul-domain.properties
3.43 GBcc-main-2018-may-jun-jul-domain-t.graphtranspose of the graph
2 kBcc-main-2018-may-jun-jul-domain-t.properties
1 kBcc-main-2018-may-jun-jul-domain.statsWebGraph statistics
1.96 GBcc-main-2018-may-jun-jul-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 92 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (May/June/July 2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12538162220.013272com.facebook
22476750010.016429com.googleapis
32357456630.009596com.google
42282638440.008408com.twitter
52239867650.007043com.youtube
62144685060.006211org.w
72000017070.004495org.gmpg
81991779290.003686com.instagram
919565892110.003123com.linkedin
1018904142250.001434com.gravatar
1118886866140.002009com.wordpress
1218791656260.001378org.wikipedia
1318683474230.001591com.pinterest
1418605644130.002616org.wordpress
1518523062210.001661com.apple
1618506550120.002795com.bootstrapcdn
1718299454330.000893com.blogspot
1818261082240.001454com.vimeo
1918104372370.000799com.amazon
2018101742340.000875gl.goo
2118052860380.000756be.youtu
2218015066280.001162com.microsoft
2317990876160.001783com.googletagmanager
2417945844190.001702com.adobe
2517942552440.000652com.tumblr
2617901968150.001947com.cloudflare
2717853502200.001684com.macromedia
2817832868450.000641com.wp
2917823530610.000483com.yahoo
3017781678400.000719com.flickr
3117733538460.000626ly.bit
3217680810480.000606me.wp
3317674312350.000857com.paypal
3417654864320.000904com.amazonaws
3517602138220.001598com.github
36175884601040.000250com.nytimes
3717550714540.000545org.mozilla
3817517872700.000400com.weebly
3917506702890.000291com.googleusercontent
4017435634410.000714io.github
41174005361840.000140com.wsj
42173536381440.000209com.dropbox
43173410621660.000161org.wikimedia
44173367101410.000217com.imgur
4517319686570.000497com.medium
4617318280680.000411org.creativecommons
4717316806650.000434com.bing
48172749761470.000198com.blogger
4917261700290.001115com.gstatic
5017257470660.000422com.jquery
51172361042110.000119com.businessinsider
52172116981550.000182net.slideshare
53172114302070.000120com.wired
5417203902530.000577co.t
5517197252560.000520eu.europa
56171877201820.000142com.myspace
5717184156920.000278com.mailchimp
5817153616360.000843org.apache
5917131630310.000912net.doubleclick
6017127558690.000402com.statcounter
6117120814630.000477com.list-manage
62171128502460.000100org.npr
63171052201450.000203com.issuu
6417099450270.001250ru.yandex
65170898263140.000078com.theverge
66170890123210.000077com.appspot
67170806881680.000159org.gnu
68170592801420.000216com.yelp
6917056948520.000581org.w3
70170534383310.000075com.about
71170480622670.000090me.about
72170357261760.000148com.oracle
731703127880.004428com.godaddy
74170163121750.000148org.ietf
75170149423770.000065com.slate
7616988198420.000702net.cloudfront
77169875503010.000082com.buzzfeed
78169862362260.000111com.tinyurl
79169840384360.000056edu.princeton
80169701483360.000074com.deviantart
81169450982060.000122com.cnn
82169432423660.000066edu.washington
83169413041050.000250com.reddit
84169229143930.000062edu.ucla
8516917390850.000302com.soundcloud
86169173384490.000055com.nike
87169094781930.000136uk.co.bbc
8816901808600.000485org.schema
89168997323780.000064org.arxiv
90168968523970.000060org.chromium
91168944041810.000142com.theguardian
92168885861650.000163com.forbes
93168878023800.000064com.stackexchange
94168863501610.000172com.android
95168858803430.000070gov.loc
96168813044370.000056com.qz
97168780643330.000074com.foursquare
98168710922420.000102com.nbcnews
99168628843120.000079gov.fda
100168562443890.000063org.ieee
10116855606300.000991com.squarespace
102168499164430.000055org.sciencemag
10316828640820.000323net.akamaihd
104168250502840.000085com.example
105168219224230.000057com.trello
106168154721720.000152com.whatsapp
107168128362150.000119es.google
108168125982980.000082com.typeform
109167934186160.000043com.flipboard
11016787856590.000493net.fbcdn
111167828361510.000190org.bbb
112167818222180.000118edu.stanford
113167814384270.000057com.libsyn
114167794445020.000049google.blog
115167771342830.000085com.go
116167746364190.000057com.withgoogle
117167633156100.000043edu.utah
11816762919640.000437com.ytimg
119167615123280.000076com.reuters
120167562092510.000097com.live
121167499831640.000163org.archive
122167477955180.000048edu.gatech
12316741185750.000357com.fb
124167391691060.000250edu.utexas
125167385382080.000120com.huffingtonpost
126167376952860.000084com.bloomberg
127167326292410.000103com.techcrunch
128167267683170.000078edu.harvard
129167241252050.000123com.dribbble
130167239353180.000078com.git-scm
131167191421690.000159gov.nih
132167074501460.000199net.sourceforge
133167067163400.000072com.msn
13416705396770.000351com.wix
135166972362940.000083uk.co.blogspot
136166946653160.000078com.googlecode
137166882643690.000066com.bbc
138166850432250.000111com.typepad
139166842812340.000106com.washingtonpost
140166830862130.000119com.imdb
141166792845380.000047com.chron
142166716337060.000042com.hbo
143166683473640.000067com.mashable
14416665099870.000294com.shopify
14516661523760.000351com.paypalobjects
146166593752620.000092edu.mit
147166509673940.000062com.tinypic
148166509082930.000083au.com.google
149166471493080.000080com.cnet
150166411643000.000082com.usatoday
151166401122960.000083net.windows
152166374164160.000058au.gov.nsw
153166266382990.000082com.ibm
154166228233710.000065uk.co.dailymail
155166221253390.000073uk.co.telegraph
156166193193620.000067com.gmail
157166158871730.000151com.eventbrite
158166122042270.000110net.php
159166090485230.000048com.fastcodesign
160165965773340.000074com.time
161165942044210.000057com.ted
16216593660730.000366de.google
163165916165310.000047org.rubyonrails
164165754022710.000088com.mapquest
165165751335610.000045edu.illinois
166165716701700.000154com.opera
167165689943950.000061com.latimes
168165651698190.000037com.dezeen
169165604263060.000081com.hp
170165569981800.000143com.stackoverflow
171165541844520.000055org.eclipse
172165494662360.000105com.ebay
173165437064400.000055com.kickstarter
174165403014250.000057gov.nasa
175165389742310.000106uk.co.amazon
176165333494860.000051edu.cornell
177165319171880.000139com.etsy
178165290574100.000058com.aol
179165260905420.000046com.quora
180165250422880.000084com.meetup
181165201855200.000048com.googleblog
182165186407610.000039io.itch
183165179724140.000058com.variety
184165172194990.000050edu.berkeley
185165087336220.000043uk.co.pinterest
186165061121000.000257com.livestream
187165029455140.000049com.ft
188165013764660.000053co.g
189164997134390.000056com.theatlantic
190164981315190.000048com.zdnet
191164920142440.000101com.surveymonkey
192164886271990.000130com.tripadvisor
193164851753900.000063com.cnbc
1941648316210370.000031com.engadget
195164810384630.000054com.mixcloud
196164760026050.000044com.vogue
197164702777520.000039com.nationalgeographic
198164686577500.000040com.creativebloq
199164674515650.000045com.yellowpages
20016467125900.000290com.addthis
201164661472390.000103org.drupal
202164641143110.000079com.udacity
203164636058980.000035com.sfgate
204164561917480.000040com.discogs
205164555492550.000095com.digg
206164533818170.000037com.wikia
207164495105280.000047com.nature
208164489441850.000140com.spotify
209164480545600.000045org.pbs
21016448046860.000300com.twimg
211164444394310.000056com.angieslist
212164373153840.000063com.skype
213164353804690.000053com.fortune
214164345011020.000255net.jsfiddle
215164337436020.000044com.newyorker
216164318714920.000051com.cbsnews
217164301824050.000059gov.whitehouse
218164274143540.000069org.python
219164257042800.000086com.hubspot
220164246053240.000076gov.cdc
221164212835470.000046org.aarp
222164206235410.000046com.findlaw
223164190611940.000135com.zendesk
224164165658790.000036com.arstechnica
225164153244930.000051org.hbr
226164150391600.000173com.wixsite
227164142645130.000049com.cisco
22816414231580.000495com.vk
229164140073610.000067com.photobucket
230164083883570.000069com.springer
231164076675090.000049com.superpages
232164060367280.000041com.intel
233164054314070.000058com.giphy
234164045192530.000096to.amzn
235164042408000.000038com.manta
23616403321470.000608com.qq
2371640326011880.000029com.gizmodo
238164030764980.000050com.entrepreneur
239164029506080.000043com.venturebeat
2401640210810470.000031edu.upenn
241164011453600.000068com.nypost
242164010944060.000059org.un
2431639587414050.000028uk.ac.ox
244163950786260.000043com.scribd
245163940569610.000034com.thenextweb
246163932375870.000044com.unsplash
247163921644700.000053com.xrea
248163908297640.000039com.hackernoon
2491639073510130.000032edu.columbia
250163856807130.000041com.box
251163839532300.000108com.stumbleupon
2521638363815510.000024edu.purdue
253163816495930.000044com.vice
254163813029130.000035ly.snip
255163804142290.000109net.behance
256163801116000.000044com.symantec
257163795921110.000236com.jimdo
258163783189590.000034com.googledrive
259163771722570.000094com.salesforce
260163759693870.000063com.images-amazon
261163740335040.000049org.unicode
262163726834450.000055com.office
263163709546350.000043com.citysearch
264163657868760.000036com.healthgrades
265163639573030.000081org.acm
266163634462400.000103com.disqus
267163631443380.000073com.tripod
268163569139920.000033com.pixabay
269163542343700.000066com.oreilly
2701634396310480.000031com.indiegogo
2711634252110400.000031com.evernote
272163374767350.000040gov.noaa
273163372267850.000038com.spreadshirt
2741633290213640.000029com.searchengineland
2751633046615970.000023uk.co.theregister
276163273606200.000043com.avvo
277163250451890.000138com.constantcontact
278163243544890.000051com.inc
279163229489200.000035com.naturalnews
280163217023760.000065org.ampproject
281163215784500.000055me.paypal
282163201883420.000071com.livejournal
283163199524220.000057com.businesswire
284163193879740.000033au.net.abc
285163191442590.000093org.joomla
286163141408850.000035com.dropboxusercontent
287163121038530.000036com.statista
288163083785580.000045com.goodreads
2891630836113980.000028com.sciencedaily
2901630737813770.000029com.storify
291163027927960.000038com.curbed
292163027291860.000139com.feedburner
2931630093915480.000024com.pcmag
294163003426440.000042gov.defense
295163000607260.000041org.eff
296162986713830.000064com.sxsw
2971629839815940.000023com.mcafee
298162968434710.000052com.snapchat
2991629504710020.000032com.shutterstock
300162940877000.000042com.moz
301162934011580.000175uk.co.google
302162913544330.000056com.adweek
303162897333370.000073gov.ca
304162888692370.000105com.bandcamp
305162851972330.000106de.amazon
306162812799090.000035gov.census
307162790956270.000043site.business
308162784117670.000039com.economist
309162734033460.000070com.wiley
310162715951570.000177com.weibo
311162714369110.000035gov.uspto
31216270078930.000273me.fb
313162664039900.000033gov.fcc
3141626321116240.000022com.pcworld
3151626307013660.000029org.worldbank
3161626240416950.000021com.fifa
317162618938920.000035com.merchantcircle
318162610247490.000040tv.twitch
3191625745517190.000020edu.unc
320162556999540.000034com.steampowered
3211625567216370.000022org.khanacademy
322162553978800.000036com.indiatimes
323162550763090.000080com.smugmug
324162546675640.000045com.wikihow
325162537049490.000034org.unesco
3261625338415900.000023edu.northwestern
3271625248110560.000031com.redhat
3281625005015990.000023com.scientificamerican
329162472258080.000037gov.nist
3301624661115630.000024com.smashingmagazine
331162413418220.000037com.deloitte
3321624055413880.000028com.politico
333162395143250.000076com.googlesyndication
334162385807870.000038org.tigris
335162377773670.000066com.prnewswire
3361623571110170.000032edu.yale
337162355439250.000035com.ubuntu
338162336137420.000040org.aiga
3391623267814870.000026com.pexels
3401623050714930.000026com.thinkwithgoogle
3411622962114400.000027com.alibaba
342162266353040.000081ca.google
343162262323590.000068com.dailymotion
3441622612217030.000021com.vanityfair
3451622376617940.000019com.udemy
346162233522770.000086com.windowsphone
347162223786210.000043com.slack
3481621993614090.000028ca.blogspot
349162194733320.000074com.bitly
350162142299620.000034gov.nps
351162132182790.000086com.wufoo
352162130887550.000039com.webmd
353162125547940.000038de.blogspot
3541620900410190.000032com.prweb
3551620493117660.000019edu.usc
356162045205460.000046com.homeadvisor
357162035909910.000033com.deepmind
358162032663220.000077com.mozilla
3591620153516520.000022org.weforum
3601620078419120.000017com.ehow
361161998795860.000044com.netflix
362161994427240.000041com.samsung
363161973874770.000052com.webs
3641619719618330.000018com.ikea
365161962281830.000141jp.co.yahoo
3661619580725480.000012com.sophos
367161950598130.000037org.amnesty
368161915019960.000033org.spie
3691619148216350.000022com.billboard
3701619112013740.000029com.hootsuite
3711618822110010.000032com.whitepages
372161881113190.000078fr.free
373161871951870.000139com.xing
374161865318150.000037com.java
3751618625218270.000018org.coursera
3761618413913810.000029com.speakerdeck
377161823691520.000190com.youtube-nocookie
3781617809919080.000017com.tutsplus
379161778368990.000035com.marketwatch
380161755459950.000033edu.psu
3811617449216880.000021com.chrome
3821617393010770.000031com.airbnb
3831617348517400.000020au.com.smh
384161730089520.000034gov.senate
385161722582760.000087com.getbootstrap
3861617141414430.000027com.marketingland
3871616988311040.000030com.ycombinator
388161685414130.000058int.who
3891616840310210.000032edu.umich
3901616764715110.000025com.xkcd
3911616520816170.000022com.merriam-webster
392161640629860.000033it.binged
3931616243710520.000031com.sun
394161621507690.000039com.googlesource
3951616188214390.000027edu.ucsd
396161608243990.000060com.mysql
397161566984180.000057com.bigcartel
398161548618460.000036gov.state
399161537259810.000033com.itsnicethat
4001615369314590.000027uk.ac.cam
401161525352600.000093com.myshopify
4021615205314820.000026co.vine
403161508305400.000047gov.usda
4041614793619340.000017edu.ucdavis
4051614753918400.000018com.autodesk
406161471957950.000038org.aclweb
4071614663314030.000028com.css-tricks
4081614467021540.000014edu.ncsu
4091614365516460.000022com.playstation
410161433499460.000034io.material
4111614331411050.000030org.iso
412161429318360.000036gov.justice
413161418588270.000037com.foxnews
414161411888290.000037com.gartner
4151614096717210.000020uk.ac.ucl
416161377873820.000064com.booking
417161368569450.000034com.psychologytoday
41816136353810.000337com.baidu
419161358047860.000038gov.copyright
4201613557415930.000023com.target
4211613508320210.000016edu.arizona
4221613130414500.000027io.codepen
423161300584150.000058com.monster
424161264944820.000052gov.irs
4251612616117320.000020com.freepik
4261612356214300.000027com.gumroad
4271612250015190.000025de.spiegel
428161202422850.000085gov.ftc
4291611772216600.000021com.com
430161167163630.000067com.githubusercontent
4311611506319770.000016com.msnbc
432161138735720.000045in.co.google
4331611002217220.000020com.gigaom
4341610967014380.000027com.dell
435161089489070.000035com.tandfonline
436161088263960.000060net.themeforest
4371610855214420.000027com.businessweek
438161070325330.000047gov.epa
4391610684110390.000031com.gofundme
440161068213260.000076com.rawgit
4411610660918750.000018com.angelfire
4421610559318240.000018com.yoast
4431610495025250.000012com.fiverr
4441610494616230.000022com.nymag
4451610359916160.000022com.hollywoodreporter
4461610345710270.000032ca.cbc
4471610323416480.000022com.sap
4481610303610980.000030com.nielsen
449161029984260.000057org.nodejs
4501610270925100.000012edu.hbs
451161027011950.000135com.eepurl
452161026887410.000040com.blackberry
4531610212126180.000012edu.caltech
4541610151515760.000024com.ning
455161009485820.000044uk.co.independent
456160990809670.000033com.underconsideration
4571609878719040.000017com.semrush
4581609651425620.000012com.popsci
4591609622119030.000017com.howstuffworks
460160961117340.000040gov.hhs
461160949817920.000038com.usnews
46216093468170.001739com.wixstatic
4631609324015910.000023org.fao
4641609105619850.000016tv.periscope
4651609041921520.000014com.cbs
4661608958614570.000027org.altervista
467160886564800.000052us.icio
468160870934510.000055com.force
4691608664910230.000032com.500px
4701608522121140.000015uk.ac.ed
4711608376322280.000014com.instructables
4721608365819360.000017org.filezilla-project
4731608224916750.000021com.nba
4741608178225990.000012com.codecademy
4751608108715310.000025com.elpais
4761608094210910.000030es.iac
477160785921200.000226com.google-analytics
478160780963050.000081com.staticflickr
479160778068610.000036uk.co.guardian
4801607677615170.000025com.warnerbros
481160763994960.000050com.cargocollective
4821607633618680.000018com.canva
4831607621220720.000015com.gamespot
4841607551916650.000021edu.jhu
4851607490714370.000027edu.wisc
486160728449760.000033com.uservoice
487160705489630.000033net.researchgate
4881607027614200.000028com.istockphoto
4891606988010220.000032com.insiderpages
490160687987900.000038tv.ustream
4911606833721910.000014au.com.news
4921606826540050.000007com.space
493160674178810.000036gov.arts
494160671212810.000085com.fc2
495160670284440.000055com.sciencedirect
4961606680717260.000020com.hulu
4971606641214340.000027gov.usgs
4981606405818040.000019com.fedex
4991606329014600.000027com.forrester
5001606197918970.000017org.pnas
501160612567990.000038com.feedly
5021606017629860.000010com.hubpages
5031606007221070.000015com.crunchbase
5041605944717250.000020com.mercurynews
5051605774514150.000028com.reverbnation
5061605701210040.000032com.lighthouseapp
5071605649116090.000023com.indeed
5081605633623960.000013com.programmableweb
509160556147460.000040com.gotowebinar
5101605512512340.000029com.mlb
5111605405510050.000032com.timeanddate
5121605395717130.000020kr.flic
513160538981590.000174com.googleadservices
5141605374916000.000023edu.si
515160526922470.000099com.getclicky
516160520332220.000114jp.co.amazon
5171605182617490.000020com.today
5181605133813760.000029ly.ow
519160506864410.000055edu.cmu
5201605049813750.000029org.redcross
521160503906010.000044com.squareup
5221604873518090.000019com.domain
5231604817916920.000021edu.uchicago
5241604749813850.000028de.heise
5251604658910420.000031com.googlelabs
526160464979350.000034com.patreon
5271604584321440.000015com.ibtimes
528160454514010.000059com.clicky
5291604359418260.000018com.socialmediaexaminer
5301604278714160.000028com.americanexpress
531160424474110.000058com.w3schools
5321604213221790.000014org.gimp
533160418007270.000041com.photoshelter
534160416453860.000063edu.nyu
535160397049100.000035org.scala-lang
5361603841328770.000010com.oxforddictionaries
537160381648880.000035ca.amazon
5381603790417610.000019com.upwork
5391603684916040.000023org.apa
5401603673615490.000024com.accenture
5411603602023200.000013com.csmonitor
5421603552127720.000011com.lynda
5431603457020920.000015com.bestbuy
544160345309430.000034com.emarketer
545160328654030.000059com.herokuapp
546160326069980.000033au.com.yellowpages
547160323345810.000045com.houzz
5481603156216570.000022com.codeplex
549160273931080.000243jp.co.google
5501602714616540.000022com.theglobeandmail
5511602706518170.000019com.zillow
5521602613931270.000009org.notepad-plus-plus
5531602558610350.000031com.uber
5541602551221580.000014com.aljazeera
555160251934810.000052org.doi
5561602453114020.000028gov.fbi
557160242282780.000086com.youku
5581602414711350.000030edu.alamo
5591602351917740.000019org.letsencrypt
5601602351518470.000018com.lulu
5611602200014180.000028com.unity3d
562160217244720.000052com.iconfinder
563160197911980.000133com.histats
5641601966918780.000017com.norton
565160195657030.000042uk.co.tripadvisor
5661601911113860.000028com.walmart
5671601842921800.000014edu.asu
5681601795415560.000024com.prezi
569160168899890.000033gov.usa
5701601542017600.000020com.thehill
5711601385720530.000015com.thestar
5721601361215610.000024in.blogspot
5731601226510410.000031jp.co.fujixerox
5741601101019800.000016com.trendmicro
5751601077215330.000025com.bufferapp
5761600953315790.000024com.intuit
5771600858315070.000025edu.umn
5781600771325120.000012edu.wustl
5791600637910290.000032com.chamberofcommerce
5801600598211190.000030net.brownbook
5811600544716790.000021com.hotmail
582160040964300.000056cn.com.sina
5831600351613920.000028com.techrepublic
5841600288016660.000021com.econsultancy
5851600223141880.000007com.boredpanda
586160019681070.000247com.messenger
5871600160121670.000014com.icloud
588160011779570.000034com.outlook
5891600075424120.000013com.twitpic
5901600069720450.000015com.ifttt
5911600065921340.000015com.lonelyplanet
5921600030718510.000018edu.virginia
593160000433290.000075com.naver
5941599911221030.000015com.mentalfloss
5951599871822600.000014com.refinery29
5961599751935010.000008net.minecraft
597159970192350.000106fr.google
5981599681515540.000024com.jetbrains
5991599604413990.000028com.aweber
6001599581523890.000013com.animoto
6011599574913950.000028us.imageshack
6021599563118390.000018com.zazzle
6031599548311110.000030com.ezlocal
604159950334000.000060com.newrelic
6051599487417540.000020com.posterous
6061599470919860.000016org.aclu
607159945906330.000043gov.sec
608159940839800.000033uk.co.eventbrite
6091599299725330.000012edu.unl
6101599102433900.000009com.fitbit
6111599056335620.000008com.wolfram
6121599013510490.000031edu.utep
6131598969917650.000019org.owasp
6141598956019140.000017com.people
6151598935317000.000021com.irishtimes
6161598798216580.000021org.cambridge
6171598748817140.000020com.aliexpress
6181598661327450.000011org.kiva
6191598609922010.000014com.getresponse
620159853288700.000036ca.yelp
6211598476626500.000011com.klout
6221598442017240.000020edu.academia
6231598437038260.000008edu.byu
6241598436319180.000017edu.cuny
6251598393431520.000009edu.dartmouth
6261598375739070.000007com.lmgtfy
6271598227713970.000028com.alexa
6281598211728410.000011com.lastpass
6291598108714790.000026com.mckinsey
6301598093820780.000015it.scoop
63115980757980.000261org.reactjs
63215979441880.000292net.facebook
6331597924727660.000011com.campaignmonitor
6341597825933920.000009edu.uic
6351597719421220.000015ch.ethz
636159760841480.000197ru.mail
6371597545323280.000013com.glamour
638159751831970.000135it.google
6391597418211280.000030fr.blogspot
6401597318917730.000019com.foxbusiness
6411597251719930.000016edu.msu
6421597247022060.000014ca.ualberta
643159722879560.000034com.city-data
6441597032620290.000016edu.uci
6451597006617290.000020com.newsweek
646159698157970.000038org.jenkins-ci
6471596974116760.000021com.marketo
6481596966610120.000032com.cdbaby
6491596958518480.000018com.hostgator
6501596924422160.000014com.softpedia
6511596903547120.000006com.diigo
652159687369970.000033au.com.truelocal
6531596852417050.000021com.yandex
6541596846735070.000008com.starwars
6551596702232770.000009com.softonic
6561596658413870.000028com.lifehacker
657159647694170.000057com.stripe
6581596423616810.000021com.thomsonreuters
6591596401219670.000016com.nfl
660159627317800.000038com.uk
6611596264114510.000027com.weather
6621596204522620.000014edu.bu
663159608411770.000146org.icann
6641596058028960.000010org.ala
665159604168350.000036org.openstreetmap
6661595761816470.000022mp.j
667159573952890.000084com.maxcdn
66815957230910.000290org.networkadvertising
6691595714431680.000009com.avast
6701595592526640.000011org.virtualbox
6711595529320080.000016edu.umass
6721595473718290.000018gov.nyc
6731595449123430.000013com.homedepot
6741595359519420.000017edu.ufl
6751595266816910.000021com.nokia
6761595233321610.000014com.livestrong
6771595118421240.000015com.history
678159486904090.000058com.fastcompany
6791594772322890.000013com.newscientist
6801594671617060.000021com.vox
681159443523130.000078com.taobao
682159440105160.000048net.openid
6831594371515840.000023fm.last
6841594362321900.000014org.craigslist
685159433108780.000036br.com.uol
6861594251229500.000010ca.uwaterloo
687159400803740.000065com.netdna-ssl
6881593843817300.000020com.pwc
689159356209880.000033gov.sba
690159352074580.000054com.barnesandnoble
6911593504526530.000011org.moma
6921593423827440.000011org.phys
693159339757080.000041com.docker
6941593326210550.000031com.adage
6951593313611140.000030com.formstack
6961593273136800.000008cc.co
697159321609690.000033com.pinimg
6981593210916700.000021com.xbox
699159311765000.000050com.cracked
700159306613490.000070nl.google
701159301962040.000123jp.ameblo
7021592970028000.000011edu.hawaii
7031592930521020.000015com.blogtalkradio
704159291738570.000036com.delicious
7051592858429930.000010com.123rf
7061592821021630.000014com.britannica
7071592798429460.000010org.greenpeace
7081592706718590.000018com.stitcher
7091592691018310.000018com.marketwired
7101592672315140.000025gov.ny
7111592647730270.000010uk.bl
7121592622619980.000016net.boingboing
713159262053520.000070org.opensource
714159254936970.000042fr.amazon
7151592541720590.000015com.templatemonster
7161592538419280.000017com.networkworld
7171592536310380.000031com.infusionsoft
7181592503514240.000028com.shareasale
7191592493210200.000032au.com.yelp
7201592410210250.000032org.designmuseum
7211592407141090.000007org.libreoffice
7221592236030720.000010com.wikidot
7231592198615360.000024com.globo
7241592171029600.000010ca.globalnews
7251592125724640.000012com.fox
726159207717710.000039com.163
7271592069533990.000009org.edx
7281591925619610.000016com.mac
7291591881118420.000018gov.treasury
7301591838620850.000015com.urbandictionary
7311591799315280.000025gov.bls
732159177812020.000126jp.ne.hatena
7331591678814000.000028com.arcgis
734159156565430.000046com.technologyreview
7351591560917090.000021com.gettyimages
736159152368720.000036com.msdn
7371591519517040.000021com.windows
7381591456910940.000030com.mtnonline
7391591277628020.000011com.knowyourmeme
740159124722230.000111com.automattic
741159120837830.000038com.discordapp
7421591204112130.000029com.gloworld
7431591075642600.000007com.trulia
744159104229480.000034com.mysanantonio
745159103982380.000104com.parallels
7461591032610920.000030com.cbslocal
747159097884540.000055com.mapbox
7481590910916940.000021com.mtv
7491590907626950.000011com.imageshack
7501590830818550.000018edu.duke
7511590726914060.000028com.accuweather
7521590720141300.000007com.techsmith
7531590714317420.000020uk.co.wired
7541590709735940.000008com.makezine
7551590646127590.000011edu.pitt
7561590641620770.000015edu.indiana
7571590573510850.000030edu.uah
7581590563114960.000025me.m
7591590530310830.000030com.judysbook
7601590511614940.000026com.buffer
7611590492316120.000023com.searchenginewatch
7621590445846190.000006org.edublogs
7631590276514280.000028com.ups
764159021757770.000038gov.ed
765159020339600.000034au.com.whitepages
7661590113522880.000013uk.co.metro
7671590098121360.000015com.ign
7681590056819400.000017net.codecanyon
7691589952619870.000016com.pastebin
7701589842520150.000016com.nvidia
7711589828810680.000031com.womentechmakers
7721589811832960.000009org.code
7731589801029760.000010edu.oregonstate
7741589779519630.000016com.espn
7751589747717230.000020org.gnome
776158969208330.000036com.proofpoint
7771589666114630.000027gov.dot
7781589661210670.000031com.zoho
7791589517625340.000012com.producthunt
780158951507510.000039com.atlassian
7811589449621880.000014ca.ubc
782158944395760.000045com.us
7831589428518150.000019com.contentmarketinginstitute
7841589408115010.000025com.investopedia
7851589326725080.000012com.bankofamerica
7861589318716850.000021gov.wa
7871589299919210.000017com.deadline
7881589211725440.000012com.nhl
7891589200136570.000008org.lifehack
7901589179016830.000021com.vmware
7911589171824510.000012com.starbucks
7921589145523620.000013ly.visual
7931589139912040.000029org.change
7941589134819230.000017uk.ac.lse
7951589119421150.000015com.magentocommerce
796158908812480.000098org.iana
7971589037626310.000012com.lifewire
7981588996710150.000032com.visualstudio
7991588939715590.000024jp.blogspot
8001588918416060.000023com.sky
8011588909416010.000023com.gotomeeting
8021588905210600.000031com.bizcommunity
8031588894032810.000009com.smashwords
8041588716816510.000022com.mediafire
8051588659915780.000024com.ssrn
8061588609516860.000021net.recode
8071588583225930.000012com.asus
8081588439316360.000022se.haxx
8091588413813700.000029es.amazon
810158837864740.000052com.teamviewer
8111588362117870.000019com.outbrain
812158825776990.000042com.getpocket
8131588118430530.000010com.macrumors
8141588077232680.000009net.battle
8151588036215810.000024com.nydailynews
8161587964117970.000019edu.vanderbilt
8171587923124170.000013com.thestreet
818158786739410.000034net.azurewebsites
8191587849916410.000022fr.lemonde
8201587849316200.000022org.postimg
8211587770233930.000009com.formula1
8221587756514550.000027com.oup
8231587662126240.000012gov.cia
8241587630427100.000011org.olympic
8251587562825760.000012org.7-zip
8261587557739060.000007uk.ac.warwick
8271587548125730.000012com.tesla
8281587372623500.000013hk.com.google
8291587362918720.000018com.ecwid
8301587312010030.000032com.mlstatic
8311587216916630.000021com.glassdoor
8321587216320760.000015ca.utoronto
8331587158323030.000013net.comcast
8341587041621180.000015com.readwrite
8351587013021850.000014ca.qc.gouv
8361586991617850.000019gov.congress
837158677249470.000034com.att
8381586732718010.000019uk.co.mirror
839158672994950.000050com.marriott
8401586623929560.000010com.coinbase
8411586613420240.000016com.me
8421586603532560.000009gd.is
8431586551914210.000028org.plos
8441586492915180.000025com.business2community
8451586377710790.000030com.sagepub
8461586340628760.000010com.fineartamerica
8471586302411770.000029me.pxlme
8481586299515690.000024com.over-blog
8491586290915200.000025com.techtarget
8501586268719700.000016ru.narod
8511586249617700.000019com.ssllabs
8521586183524230.000013com.ge
8531586130118380.000018org.unicef
8541586098720320.000016int.wipo
855158606211790.000143de.bund
856158592329820.000033gov.house
8571585877221120.000015uk.co.thesun
8581585873230780.000010net.sucuri
8591585841721840.000014com.yolasite
8601585841325670.000012ms.1drv
8611585827121230.000015au.com.blogspot
8621585785426860.000011com.fool
8631585783439500.000007com.thenation
8641585761044180.000007edu.temple
8651585758333410.000009com.makeuseof
8661585701818320.000018edu.umd
867158569437820.000038es.com.blogspot
8681585536415830.000023com.pingdom
8691585533422920.000013com.macworld
870158552904550.000054jp.ne.sakura
8711585508825630.000012com.webnode
8721585484030440.000010com.freelancer
8731585400022970.000013gov.nsf
8741585387226230.000012edu.brown
8751585260149140.000006ca.gc.statcan
8761585259519970.000016com.getfirebug
8771585193322820.000013com.wikispaces
8781585073720090.000016org.jstor
8791585069118560.000018co.angel
8801585066529120.000010edu.tufts
881158500017430.000040org.bitbucket
8821584976422430.000014edu.osu
8831584921421210.000015edu.tamu
8841584906521270.000015org.wpmudev
885158476368340.000036net.noscript
8861584723237350.000008com.appleinsider
887158470681490.000195com.ggpht
8881584653217630.000019it.blogspot
8891584648524900.000012org.documentcloud
8901584590219260.000017com.cc
8911584458822070.000014us.zoom
8921584403217550.000020com.rollingstone
8931584349642050.000007li.paper
8941584317120520.000015edu.rutgers
8951584266923650.000013com.theonion
896158422927230.000041com.geocities
8971584215733280.000009com.indiewire
8981584170229780.000010int.esa
899158409549390.000034com.netdna-cdn
9001584066726040.000012ly.generalassemb
9011584060235710.000008edu.buffalo
902158400944850.000051br.com.google
9031584003810780.000030com.bitballoon
904158397034840.000051com.1and1
9051583897025710.000012com.sony
906158387706740.000042com.trustpilot
9071583737912000.000029org.oecd
9081583703420640.000015com.azcentral
9091583608111420.000030com.communitywalk
9101583582121970.000014org.videolan
9111583490725460.000012com.pandora
9121583328750620.000006org.anitaborg
9131583328423880.000013gov.in
9141583308130170.000010com.4shared
9151583279330580.000010org.metmuseum
9161583253321490.000015com.theknot
917158324798970.000035org.osgeo
918158301142010.000127me.line
919158276103720.000065com.bizjournals
9201582687425840.000012com.fujitsu
9211582674626740.000011com.blogs
922158262442720.000087org.debian
9231582489377700.000004edu.du
9241582376923370.000013com.bleacherreport
925158234105660.000045com.quantcast
9261582291724090.000013uk.co.express
9271582247921480.000015com.redbubble
9281582241328560.000010com.cosmopolitan
9291582221018030.000019org.cancer
9301582181711360.000030com.graphis
9311582106820580.000015de.zeit
9321582105323460.000013ca.sfu
933158207297320.000040com.wunderground
9341582067523110.000013com.convinceandconvert
9351582037228880.000010org.bitcoin
9361582019214640.000027com.usps
9371581987245450.000006com.blog
9381581959020670.000015com.salon
9391581727727790.000011com.technet
9401581726117990.000019net.daringfireball
9411581718925600.000012com.googlepages
942158166471150.000229com.bluehost
9431581655720890.000015com.w3techs
9441581626017710.000019com.calendly
9451581546930500.000010com.rottentomatoes
9461581499972680.000004com.elance
9471581453225290.000012com.createspace
948158139438550.000036com.comscore
9491581361024040.000013edu.colorado
9501581359910890.000030com.2findlocal
9511581356510570.000031org.tpr
9521581290816430.000022com.bt
953158128413650.000067com.rackcdn
9541581123434100.000009com.kotaku
9551581100837830.000008edu.syr
956158104977290.000041com.verisign
9571580990210090.000032com.tiddlywiki
9581580895717360.000020com.strikingly
9591580886761970.000005com.mercedes-benz
9601580886027360.000011com.oprah
9611580885117070.000021com.bmj
9621580836721400.000015com.popsugar
9631580740821750.000014org.hrw
9641580735520340.000016com.shareholder
9651580731618880.000017com.digicert
9661580708716450.000022com.steamcommunity
9671580703640570.000007com.pastemagazine
9681580666830990.000010com.voanews
9691580662610440.000031org.travelblog
9701580647014410.000027org.heart
9711580629029630.000010com.thrillist
9721580577241640.000007com.youcaring
9731580572910750.000031com.independent
9741580541121770.000014net.atlassian
9751580497145290.000006com.secondlife
9761580431115460.000024int.coe
9771580425223630.000013com.xerox
9781580208019070.000017com.computerworld
9791580186126080.000012com.groupon
9801580125638090.000008edu.rochester
9811580070934280.000009com.sas
9821579966320370.000016com.getsatisfaction
983157994674760.000052com.aliyuncs
9841579862665250.000005com.threatpost
9851579827118250.000018ru.spb
9861579794422240.000014com.gawker
9871579780924680.000012me.flavors
9881579773041970.000007com.slides
9891579764127940.000011com.madmimi
9901579750924110.000013com.hindustantimes
9911579697843380.000007org.teamusa
9921579657014670.000026gov.va
9931579567218700.000018mil.navy
994157954783500.000070jp.co.rakuten
995157953897200.000041com.hilton
9961579503610460.000031com.chicagotribune
9971579502414860.000026com.cafepress
9981579428323360.000013org.dyndns
9991579411322710.000014com.teenvogue
1000157936988960.000035gov.export

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!