Host- and Domain-Level Web Graphs May/June/July 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs).

Host-level graph

The graph consists of 886 million nodes and 5.4 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 793 million dangling nodes (89.5%) and the largest strongly connected component contains only 67 million (7.5%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 886 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl May/June/July 2018 host-level webgraph

SizeFileDescription
5.60 GBcc-main-2018-may-jun-jul-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 98 vertices files
25.12 GBcc-main-2018-may-jun-jul-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 196 edges files
9.99 GBcc-main-2018-may-jun-jul-host.graphgraph in BVGraph format
2 kBcc-main-2018-may-jun-jul-host.properties
11.30 GBcc-main-2018-may-jun-jul-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-may-jun-jul-host-t.properties
1 kBcc-main-2018-may-jun-jul-host.statsWebGraph statistics
13.35 GBcc-main-2018-may-jun-jul-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 92 million nodes and 1.45 billion edges. 57% or 53 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 37% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-may-jun-jul/domain/.

Download files of the Common Crawl May/June/July 2018 domain-level webgraph

SizeFileDescription
0.64 GBcc-main-2018-may-jun-jul-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.85 GBcc-main-2018-may-jun-jul-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.21 GBcc-main-2018-may-jun-jul-domain.graphgraph in BVGraph format
2 kBcc-main-2018-may-jun-jul-domain.properties
3.43 GBcc-main-2018-may-jun-jul-domain-t.graphtranspose of the graph
2 kBcc-main-2018-may-jun-jul-domain-t.properties
1 kBcc-main-2018-may-jun-jul-domain.statsWebGraph statistics
1.96 GBcc-main-2018-may-jun-jul-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 92 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (May/June/July 2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12538162220.013272com.facebook
22476750010.016429com.googleapis
32357456630.009596com.google
42282638440.008408com.twitter
52239867650.007043com.youtube
62144685060.006211org.w
72000017070.004495org.gmpg
81991779290.003686com.instagram
919565892110.003123com.linkedin
1018904142250.001434com.gravatar
1118886866140.002009com.wordpress
1218791656260.001378org.wikipedia
1318683474230.001591com.pinterest
1418605644130.002616org.wordpress
1518523062210.001661com.apple
1618506550120.002795com.bootstrapcdn
1718299454330.000893com.blogspot
1818261082240.001454com.vimeo
1918104372370.000799com.amazon
2018101742340.000875gl.goo
2118052860380.000756be.youtu
2218015066280.001162com.microsoft
2317990876160.001783com.googletagmanager
2417945844190.001702com.adobe
2517942552440.000652com.tumblr
2617901968150.001947com.cloudflare
2717853502200.001684com.macromedia
2817832868450.000641com.wp
2917823530610.000483com.yahoo
3017781678400.000719com.flickr
3117733538460.000626ly.bit
3217680810480.000606me.wp
3317674312350.000857com.paypal
3417654864320.000904com.amazonaws
3517602138220.001598com.github
36175884601040.000250com.nytimes
3717550714540.000545org.mozilla
3817517872700.000400com.weebly
3917506702890.000291com.googleusercontent
4017435634410.000714io.github
41174005361840.000140com.wsj
42173536381440.000209com.dropbox
43173410621660.000161org.wikimedia
44173367101410.000217com.imgur
4517319686570.000497com.medium
4617318280680.000411org.creativecommons
4717316806650.000434com.bing
48172749761470.000198com.blogger
4917261700290.001115com.gstatic
5017257470660.000422com.jquery
51172361042110.000119com.businessinsider
52172116981550.000182net.slideshare
53172114302070.000120com.wired
5417203902530.000577co.t
5517197252560.000520eu.europa
56171877201820.000142com.myspace
5717184156920.000278com.mailchimp
5817153616360.000843org.apache
5917131630310.000912net.doubleclick
6017127558690.000402com.statcounter
6117120814630.000477com.list-manage
62171128502460.000100org.npr
63171052201450.000203com.issuu
6417099450270.001250ru.yandex
65170898263140.000078com.theverge
66170890123210.000077com.appspot
67170806881680.000159org.gnu
68170592801420.000216com.yelp
6917056948520.000581org.w3
70170534383310.000075com.about
71170480622670.000090me.about
72170357261760.000148com.oracle
731703127880.004428com.godaddy
74170163121750.000148org.ietf
75170149423770.000065com.slate
7616988198420.000702net.cloudfront
77169875503010.000082com.buzzfeed
78169862362260.000111com.tinyurl
79169840384360.000056edu.princeton
80169701483360.000074com.deviantart
81169450982060.000122com.cnn
82169432423660.000066edu.washington
83169413041050.000250com.reddit
84169229143930.000062edu.ucla
8516917390850.000302com.soundcloud
86169173384490.000055com.nike
87169094781930.000136uk.co.bbc
8816901808600.000485org.schema
89168997323780.000064org.arxiv
90168968523970.000060org.chromium
91168944041810.000142com.theguardian
92168885861650.000163com.forbes
93168878023800.000064com.stackexchange
94168863501610.000172com.android
95168858803430.000070gov.loc
96168813044370.000056com.qz
97168780643330.000074com.foursquare
98168710922420.000102com.nbcnews
99168628843120.000079gov.fda
100168562443890.000063org.ieee
10116855606300.000991com.squarespace
102168499164430.000055org.sciencemag
10316828640820.000323net.akamaihd
104168250502840.000085com.example
105168219224230.000057com.trello
106168154721720.000152com.whatsapp
107168128362150.000119es.google
108168125982980.000082com.typeform
109167934186160.000043com.flipboard
11016787856590.000493net.fbcdn
111167828361510.000190org.bbb
112167818222180.000118edu.stanford
113167814384270.000057com.libsyn
114167794445020.000049google.blog
115167771342830.000085com.go
116167746364190.000057com.withgoogle
117167633156100.000043edu.utah
11816762919640.000437com.ytimg
119167615123280.000076com.reuters
120167562092510.000097com.live
121167499831640.000163org.archive
122167477955180.000048edu.gatech
12316741185750.000357com.fb
124167391691060.000250edu.utexas
125167385382080.000120com.huffingtonpost
126167376952860.000084com.bloomberg
127167326292410.000103com.techcrunch
128167267683170.000078edu.harvard
129167241252050.000123com.dribbble
130167239353180.000078com.git-scm
131167191421690.000159gov.nih
132167074501460.000199net.sourceforge
133167067163400.000072com.msn
13416705396770.000351com.wix
135166972362940.000083uk.co.blogspot
136166946653160.000078com.googlecode
137166882643690.000066com.bbc
138166850432250.000111com.typepad
139166842812340.000106com.washingtonpost
140166830862130.000119com.imdb
141166792845380.000047com.chron
142166716337060.000042com.hbo
143166683473640.000067com.mashable
14416665099870.000294com.shopify
14516661523760.000351com.paypalobjects
146166593752620.000092edu.mit
147166509673940.000062com.tinypic
148166509082930.000083au.com.google
149166471493080.000080com.cnet
150166411643000.000082com.usatoday
151166401122960.000083net.windows
152166374164160.000058au.gov.nsw
153166266382990.000082com.ibm
154166228233710.000065uk.co.dailymail
155166221253390.000073uk.co.telegraph
156166193193620.000067com.gmail
157166158871730.000151com.eventbrite
158166122042270.000110net.php
159166090485230.000048com.fastcodesign
160165965773340.000074com.time
161165942044210.000057com.ted
16216593660730.000366de.google
163165916165310.000047org.rubyonrails
164165754022710.000088com.mapquest
165165751335610.000045edu.illinois
166165716701700.000154com.opera
167165689943950.000061com.latimes
168165651698190.000037com.dezeen
169165604263060.000081com.hp
170165569981800.000143com.stackoverflow
171165541844520.000055org.eclipse
172165494662360.000105com.ebay
173165437064400.000055com.kickstarter
174165403014250.000057gov.nasa
175165389742310.000106uk.co.amazon
176165333494860.000051edu.cornell
177165319171880.000139com.etsy
178165290574100.000058com.aol
179165260905420.000046com.quora
180165250422880.000084com.meetup
181165201855200.000048com.googleblog
182165186407610.000039io.itch
183165179724140.000058com.variety
184165172194990.000050edu.berkeley
185165087336220.000043uk.co.pinterest
186165061121000.000257com.livestream
187165029455140.000049com.ft
188165013764660.000053co.g
189164997134390.000056com.theatlantic
190164981315190.000048com.zdnet
191164920142440.000101com.surveymonkey
192164886271990.000130com.tripadvisor
193164851753900.000063com.cnbc
1941648316210370.000031com.engadget
195164810384630.000054com.mixcloud
196164760026050.000044com.vogue
197164702777520.000039com.nationalgeographic
198164686577500.000040com.creativebloq
199164674515650.000045com.yellowpages
20016467125900.000290com.addthis
201164661472390.000103org.drupal
202164641143110.000079com.udacity
203164636058980.000035com.sfgate
204164561917480.000040com.discogs
205164555492550.000095com.digg
206164533818170.000037com.wikia
207164495105280.000047com.nature
208164489441850.000140com.spotify
209164480545600.000045org.pbs
21016448046860.000300com.twimg
211164444394310.000056com.angieslist
212164373153840.000063com.skype
213164353804690.000053com.fortune
214164345011020.000255net.jsfiddle
215164337436020.000044com.newyorker
216164318714920.000051com.cbsnews
217164301824050.000059gov.whitehouse
218164274143540.000069org.python
219164257042800.000086com.hubspot
220164246053240.000076gov.cdc
221164212835470.000046org.aarp
222164206235410.000046com.findlaw
223164190611940.000135com.zendesk
224164165658790.000036com.arstechnica
225164153244930.000051org.hbr
226164150391600.000173com.wixsite
227164142645130.000049com.cisco
22816414231580.000495com.vk
229164140073610.000067com.photobucket
230164083883570.000069com.springer
231164076675090.000049com.superpages
232164060367280.000041com.intel
233164054314070.000058com.giphy
234164045192530.000096to.amzn
235164042408000.000038com.manta
23616403321470.000608com.qq
2371640326011880.000029com.gizmodo
238164030764980.000050com.entrepreneur
239164029506080.000043com.venturebeat
2401640210810470.000031edu.upenn
241164011453600.000068com.nypost
242164010944060.000059org.un
2431639587414050.000028uk.ac.ox
244163950786260.000043com.scribd
245163940569610.000034com.thenextweb
246163932375870.000044com.unsplash
247163921644700.000053com.xrea
248163908297640.000039com.hackernoon
2491639073510130.000032edu.columbia
250163856807130.000041com.box
251163839532300.000108com.stumbleupon
2521638363815510.000024edu.purdue
253163816495930.000044com.vice
254163813029130.000035ly.snip
255163804142290.000109net.behance
256163801116000.000044com.symantec
257163795921110.000236com.jimdo
258163783189590.000034com.googledrive
259163771722570.000094com.salesforce
260163759693870.000063com.images-amazon
261163740335040.000049org.unicode
262163726834450.000055com.office
263163709546350.000043com.citysearch
264163657868760.000036com.healthgrades
265163639573030.000081org.acm
266163634462400.000103com.disqus
267163631443380.000073com.tripod
268163569139920.000033com.pixabay
269163542343700.000066com.oreilly
2701634396310480.000031com.indiegogo
2711634252110400.000031com.evernote
272163374767350.000040gov.noaa
273163372267850.000038com.spreadshirt
2741633290213640.000029com.searchengineland
2751633046615970.000023uk.co.theregister
276163273606200.000043com.avvo
277163250451890.000138com.constantcontact
278163243544890.000051com.inc
279163229489200.000035com.naturalnews
280163217023760.000065org.ampproject
281163215784500.000055me.paypal
282163201883420.000071com.livejournal
283163199524220.000057com.businesswire
284163193879740.000033au.net.abc
285163191442590.000093org.joomla
286163141408850.000035com.dropboxusercontent
287163121038530.000036com.statista
288163083785580.000045com.goodreads
2891630836113980.000028com.sciencedaily
2901630737813770.000029com.storify
291163027927960.000038com.curbed
292163027291860.000139com.feedburner
2931630093915480.000024com.pcmag
294163003426440.000042gov.defense
295163000607260.000041org.eff
296162986713830.000064com.sxsw
2971629839815940.000023com.mcafee
298162968434710.000052com.snapchat
2991629504710020.000032com.shutterstock
300162940877000.000042com.moz
301162934011580.000175uk.co.google
302162913544330.000056com.adweek
303162897333370.000073gov.ca
304162888692370.000105com.bandcamp
305162851972330.000106de.amazon
306162812799090.000035gov.census
307162790956270.000043site.business
308162784117670.000039com.economist
309162734033460.000070com.wiley
310162715951570.000177com.weibo
311162714369110.000035gov.uspto
31216270078930.000273me.fb
313162664039900.000033gov.fcc
3141626321116240.000022com.pcworld
3151626307013660.000029org.worldbank
3161626240416950.000021com.fifa
317162618938920.000035com.merchantcircle
318162610247490.000040tv.twitch
3191625745517190.000020edu.unc
320162556999540.000034com.steampowered
3211625567216370.000022org.khanacademy
322162553978800.000036com.indiatimes
323162550763090.000080com.smugmug
324162546675640.000045com.wikihow
325162537049490.000034org.unesco
3261625338415900.000023edu.northwestern
3271625248110560.000031com.redhat
3281625005015990.000023com.scientificamerican
329162472258080.000037gov.nist
3301624661115630.000024com.smashingmagazine
331162413418220.000037com.deloitte
3321624055413880.000028com.politico
333162395143250.000076com.googlesyndication
334162385807870.000038org.tigris
335162377773670.000066com.prnewswire
3361623571110170.000032edu.yale
337162355439250.000035com.ubuntu
338162336137420.000040org.aiga
3391623267814870.000026com.pexels
3401623050714930.000026com.thinkwithgoogle
3411622962114400.000027com.alibaba
342162266353040.000081ca.google
343162262323590.000068com.dailymotion
3441622612217030.000021com.vanityfair
3451622376617940.000019com.udemy
346162233522770.000086com.windowsphone
347162223786210.000043com.slack
3481621993614090.000028ca.blogspot
349162194733320.000074com.bitly
350162142299620.000034gov.nps
351162132182790.000086com.wufoo
352162130887550.000039com.webmd
353162125547940.000038de.blogspot
3541620900410190.000032com.prweb
3551620493117660.000019edu.usc
356162045205460.000046com.homeadvisor
357162035909910.000033com.deepmind
358162032663220.000077com.mozilla
3591620153516520.000022org.weforum
3601620078419120.000017com.ehow
361161998795860.000044com.netflix
362161994427240.000041com.samsung
363161973874770.000052com.webs
3641619719618330.000018com.ikea
365161962281830.000141jp.co.yahoo
3661619580725480.000012com.sophos
367161950598130.000037org.amnesty
368161915019960.000033org.spie
3691619148216350.000022com.billboard
3701619112013740.000029com.hootsuite
3711618822110010.000032com.whitepages
372161881113190.000078fr.free
373161871951870.000139com.xing
374161865318150.000037com.java
3751618625218270.000018org.coursera
3761618413913810.000029com.speakerdeck
377161823691520.000190com.youtube-nocookie
3781617809919080.000017com.tutsplus
379161778368990.000035com.marketwatch
380161755459950.000033edu.psu
3811617449216880.000021com.chrome
3821617393010770.000031com.airbnb
3831617348517400.000020au.com.smh
384161730089520.000034gov.senate
385161722582760.000087com.getbootstrap
3861617141414430.000027com.marketingland
3871616988311040.000030com.ycombinator
388161685414130.000058int.who
3891616840310210.000032edu.umich
3901616764715110.000025com.xkcd
3911616520816170.000022com.merriam-webster
392161640629860.000033it.binged
3931616243710520.000031com.sun
394161621507690.000039com.googlesource
3951616188214390.000027edu.ucsd
396161608243990.000060com.mysql
397161566984180.000057com.bigcartel
398161548618460.000036gov.state
399161537259810.000033com.itsnicethat
4001615369314590.000027uk.ac.cam
401161525352600.000093com.myshopify
4021615205314820.000026co.vine
403161508305400.000047gov.usda
4041614793619340.000017edu.ucdavis
4051614753918400.000018com.autodesk
406161471957950.000038org.aclweb
4071614663314030.000028com.css-tricks
4081614467021540.000014edu.ncsu
4091614365516460.000022com.playstation
410161433499460.000034io.material
4111614331411050.000030org.iso
412161429318360.000036gov.justice
413161418588270.000037com.foxnews
414161411888290.000037com.gartner
4151614096717210.000020uk.ac.ucl
416161377873820.000064com.booking
417161368569450.000034com.psychologytoday
41816136353810.000337com.baidu
419161358047860.000038gov.copyright
4201613557415930.000023com.target
4211613508320210.000016edu.arizona
4221613130414500.000027io.codepen
423161300584150.000058com.monster
424161264944820.000052gov.irs
4251612616117320.000020com.freepik
4261612356214300.000027com.gumroad
4271612250015190.000025de.spiegel
428161202422850.000085gov.ftc
4291611772216600.000021com.com
430161167163630.000067com.githubusercontent
4311611506319770.000016com.msnbc
432161138735720.000045in.co.google
4331611002217220.000020com.gigaom
4341610967014380.000027com.dell
435161089489070.000035com.tandfonline
436161088263960.000060net.themeforest
4371610855214420.000027com.businessweek
438161070325330.000047gov.epa
4391610684110390.000031com.gofundme
440161068213260.000076com.rawgit
4411610660918750.000018com.angelfire
4421610559318240.000018com.yoast
4431610495025250.000012com.fiverr
4441610494616230.000022com.nymag
4451610359916160.000022com.hollywoodreporter
4461610345710270.000032ca.cbc
4471610323416480.000022com.sap
4481610303610980.000030com.nielsen
449161029984260.000057org.nodejs
4501610270925100.000012edu.hbs
451161027011950.000135com.eepurl
452161026887410.000040com.blackberry
4531610212126180.000012edu.caltech
4541610151515760.000024com.ning
455161009485820.000044uk.co.independent
456160990809670.000033com.underconsideration
4571609878719040.000017com.semrush
4581609651425620.000012com.popsci
4591609622119030.000017com.howstuffworks
460160961117340.000040gov.hhs
461160949817920.000038com.usnews
46216093468170.001739com.wixstatic
4631609324015910.000023org.fao
4641609105619850.000016tv.periscope
4651609041921520.000014com.cbs
4661608958614570.000027org.altervista
467160886564800.000052us.icio
468160870934510.000055com.force
4691608664910230.000032com.500px
4701608522121140.000015uk.ac.ed
4711608376322280.000014com.instructables
4721608365819360.000017org.filezilla-project
4731608224916750.000021com.nba
4741608178225990.000012com.codecademy
4751608108715310.000025com.elpais
4761608094210910.000030es.iac
477160785921200.000226com.google-analytics
478160780963050.000081com.staticflickr
479160778068610.000036uk.co.guardian
4801607677615170.000025com.warnerbros
481160763994960.000050com.cargocollective
4821607633618680.000018com.canva
4831607621220720.000015com.gamespot
4841607551916650.000021edu.jhu
4851607490714370.000027edu.wisc
486160728449760.000033com.uservoice
487160705489630.000033net.researchgate
4881607027614200.000028com.istockphoto
4891606988010220.000032com.insiderpages
490160687987900.000038tv.ustream
4911606833721910.000014au.com.news
4921606826540050.000007com.space
493160674178810.000036gov.arts
494160671212810.000085com.fc2
495160670284440.000055com.sciencedirect
4961606680717260.000020com.hulu
4971606641214340.000027gov.usgs
4981606405818040.000019com.fedex
4991606329014600.000027com.forrester
5001606197918970.000017org.pnas
501160612567990.000038com.feedly
5021606017629860.000010com.hubpages
5031606007221070.000015com.crunchbase
5041605944717250.000020com.mercurynews
5051605774514150.000028com.reverbnation
5061605701210040.000032com.lighthouseapp
5071605649116090.000023com.indeed
5081605633623960.000013com.programmableweb
509160556147460.000040com.gotowebinar
5101605512512340.000029com.mlb
5111605405510050.000032com.timeanddate
5121605395717130.000020kr.flic
513160538981590.000174com.googleadservices
5141605374916000.000023edu.si
515160526922470.000099com.getclicky
516160520332220.000114jp.co.amazon
5171605182617490.000020com.today
5181605133813760.000029ly.ow
519160506864410.000055edu.cmu
5201605049813750.000029org.redcross
521160503906010.000044com.squareup
5221604873518090.000019com.domain
5231604817916920.000021edu.uchicago
5241604749813850.000028de.heise
5251604658910420.000031com.googlelabs
526160464979350.000034com.patreon
5271604584321440.000015com.ibtimes
528160454514010.000059com.clicky
5291604359418260.000018com.socialmediaexaminer
5301604278714160.000028com.americanexpress
531160424474110.000058com.w3schools
5321604213221790.000014org.gimp
533160418007270.000041com.photoshelter
534160416453860.000063edu.nyu
535160397049100.000035org.scala-lang
5361603841328770.000010com.oxforddictionaries
537160381648880.000035ca.amazon
5381603790417610.000019com.upwork
5391603684916040.000023org.apa
5401603673615490.000024com.accenture
5411603602023200.000013com.csmonitor
5421603552127720.000011com.lynda
5431603457020920.000015com.bestbuy
544160345309430.000034com.emarketer
545160328654030.000059com.herokuapp
546160326069980.000033au.com.yellowpages
547160323345810.000045com.houzz
5481603156216570.000022com.codeplex
549160273931080.000243jp.co.google
5501602714616540.000022com.theglobeandmail
5511602706518170.000019com.zillow
5521602613931270.000009org.notepad-plus-plus
5531602558610350.000031com.uber
5541602551221580.000014com.aljazeera
555160251934810.000052org.doi
5561602453114020.000028gov.fbi
557160242282780.000086com.youku
5581602414711350.000030edu.alamo
5591602351917740.000019org.letsencrypt
5601602351518470.000018com.lulu
5611602200014180.000028com.unity3d
562160217244720.000052com.iconfinder
563160197911980.000133com.histats
5641601966918780.000017com.norton
565160195657030.000042uk.co.tripadvisor
5661601911113860.000028com.walmart
5671601842921800.000014edu.asu
5681601795415560.000024com.prezi
569160168899890.000033gov.usa
5701601542017600.000020com.thehill
5711601385720530.000015com.thestar
5721601361215610.000024in.blogspot
5731601226510410.000031jp.co.fujixerox
5741601101019800.000016com.trendmicro
5751601077215330.000025com.bufferapp
5761600953315790.000024com.intuit
5771600858315070.000025edu.umn
5781600771325120.000012edu.wustl
5791600637910290.000032com.chamberofcommerce
5801600598211190.000030net.brownbook
5811600544716790.000021com.hotmail
582160040964300.000056cn.com.sina
5831600351613920.000028com.techrepublic
5841600288016660.000021com.econsultancy
5851600223141880.000007com.boredpanda
586160019681070.000247com.messenger
5871600160121670.000014com.icloud
588160011779570.000034com.outlook
5891600075424120.000013com.twitpic
5901600069720450.000015com.ifttt
5911600065921340.000015com.lonelyplanet
5921600030718510.000018edu.virginia
593160000433290.000075com.naver
5941599911221030.000015com.mentalfloss
5951599871822600.000014com.refinery29
5961599751935010.000008net.minecraft
597159970192350.000106fr.google
5981599681515540.000024com.jetbrains
5991599604413990.000028com.aweber
6001599581523890.000013com.animoto
6011599574913950.000028us.imageshack
6021599563118390.000018com.zazzle
6031599548311110.000030com.ezlocal
604159950334000.000060com.newrelic
6051599487417540.000020com.posterous
6061599470919860.000016org.aclu
607159945906330.000043gov.sec
608159940839800.000033uk.co.eventbrite
6091599299725330.000012edu.unl
6101599102433900.000009com.fitbit
6111599056335620.000008com.wolfram
6121599013510490.000031edu.utep
6131598969917650.000019org.owasp
6141598956019140.000017com.people
6151598935317000.000021com.irishtimes
6161598798216580.000021org.cambridge
6171598748817140.000020com.aliexpress
6181598661327450.000011org.kiva
6191598609922010.000014com.getresponse
620159853288700.000036ca.yelp
6211598476626500.000011com.klout
6221598442017240.000020edu.academia
6231598437038260.000008edu.byu
6241598436319180.000017edu.cuny
6251598393431520.000009edu.dartmouth
6261598375739070.000007com.lmgtfy
6271598227713970.000028com.alexa
6281598211728410.000011com.lastpass
6291598108714790.000026com.mckinsey
6301598093820780.000015it.scoop
63115980757980.000261org.reactjs
63215979441880.000292net.facebook
6331597924727660.000011com.campaignmonitor
6341597825933920.000009edu.uic
6351597719421220.000015ch.ethz
636159760841480.000197ru.mail
6371597545323280.000013com.glamour
638159751831970.000135it.google
6391597418211280.000030fr.blogspot
6401597318917730.000019com.foxbusiness
6411597251719930.000016edu.msu
6421597247022060.000014ca.ualberta
643159722879560.000034com.city-data
6441597032620290.000016edu.uci
6451597006617290.000020com.newsweek
646159698157970.000038org.jenkins-ci
6471596974116760.000021com.marketo
6481596966610120.000032com.cdbaby
6491596958518480.000018com.hostgator
6501596924422160.000014com.softpedia
6511596903547120.000006com.diigo
652159687369970.000033au.com.truelocal
6531596852417050.000021com.yandex
6541596846735070.000008com.starwars
6551596702232770.000009com.softonic
6561596658413870.000028com.lifehacker
657159647694170.000057com.stripe
6581596423616810.000021com.thomsonreuters
6591596401219670.000016com.nfl
660159627317800.000038com.uk
6611596264114510.000027com.weather
6621596204522620.000014edu.bu
663159608411770.000146org.icann
6641596058028960.000010org.ala
665159604168350.000036org.openstreetmap
6661595761816470.000022mp.j
667159573952890.000084com.maxcdn
66815957230910.000290org.networkadvertising
6691595714431680.000009com.avast
6701595592526640.000011org.virtualbox
6711595529320080.000016edu.umass
6721595473718290.000018gov.nyc
6731595449123430.000013com.homedepot
6741595359519420.000017edu.ufl
6751595266816910.000021com.nokia
6761595233321610.000014com.livestrong
6771595118421240.000015com.history
678159486904090.000058com.fastcompany
6791594772322890.000013com.newscientist
6801594671617060.000021com.vox
681159443523130.000078com.taobao
682159440105160.000048net.openid
6831594371515840.000023fm.last
6841594362321900.000014org.craigslist
685159433108780.000036br.com.uol
6861594251229500.000010ca.uwaterloo
687159400803740.000065com.netdna-ssl
6881593843817300.000020com.pwc
689159356209880.000033gov.sba
690159352074580.000054com.barnesandnoble
6911593504526530.000011org.moma
6921593423827440.000011org.phys
693159339757080.000041com.docker
6941593326210550.000031com.adage
6951593313611140.000030com.formstack
6961593273136800.000008cc.co
697159321609690.000033com.pinimg
6981593210916700.000021com.xbox
699159311765000.000050com.cracked
700159306613490.000070nl.google
701159301962040.000123jp.ameblo
7021592970028000.000011edu.hawaii
7031592930521020.000015com.blogtalkradio
704159291738570.000036com.delicious
7051592858429930.000010com.123rf
7061592821021630.000014com.britannica
7071592798429460.000010org.greenpeace
7081592706718590.000018com.stitcher
7091592691018310.000018com.marketwired
7101592672315140.000025gov.ny
7111592647730270.000010uk.bl
7121592622619980.000016net.boingboing
713159262053520.000070org.opensource
714159254936970.000042fr.amazon
7151592541720590.000015com.templatemonster
7161592538419280.000017com.networkworld
7171592536310380.000031com.infusionsoft
7181592503514240.000028com.shareasale
7191592493210200.000032au.com.yelp
7201592410210250.000032org.designmuseum
7211592407141090.000007org.libreoffice
7221592236030720.000010com.wikidot
7231592198615360.000024com.globo
7241592171029600.000010ca.globalnews
7251592125724640.000012com.fox
726159207717710.000039com.163
7271592069533990.000009org.edx
7281591925619610.000016com.mac
7291591881118420.000018gov.treasury
7301591838620850.000015com.urbandictionary
7311591799315280.000025gov.bls
732159177812020.000126jp.ne.hatena
7331591678814000.000028com.arcgis
734159156565430.000046com.technologyreview
7351591560917090.000021com.gettyimages
736159152368720.000036com.msdn
7371591519517040.000021com.windows
7381591456910940.000030com.mtnonline
7391591277628020.000011com.knowyourmeme
740159124722230.000111com.automattic
741159120837830.000038com.discordapp
7421591204112130.000029com.gloworld
7431591075642600.000007com.trulia
744159104229480.000034com.mysanantonio
745159103982380.000104com.parallels
7461591032610920.000030com.cbslocal
747159097884540.000055com.mapbox
7481590910916940.000021com.mtv
7491590907626950.000011com.imageshack
7501590830818550.000018edu.duke
7511590726914060.000028com.accuweather
7521590720141300.000007com.techsmith
7531590714317420.000020uk.co.wired
7541590709735940.000008com.makezine
7551590646127590.000011edu.pitt
7561590641620770.000015edu.indiana
7571590573510850.000030edu.uah
7581590563114960.000025me.m
7591590530310830.000030com.judysbook
7601590511614940.000026com.buffer
7611590492316120.000023com.searchenginewatch
7621590445846190.000006org.edublogs
7631590276514280.000028com.ups
764159021757770.000038gov.ed
765159020339600.000034au.com.whitepages
7661590113522880.000013uk.co.metro
7671590098121360.000015com.ign
7681590056819400.000017net.codecanyon
7691589952619870.000016com.pastebin
7701589842520150.000016com.nvidia
7711589828810680.000031com.womentechmakers
7721589811832960.000009org.code
7731589801029760.000010edu.oregonstate
7741589779519630.000016com.espn
7751589747717230.000020org.gnome
776158969208330.000036com.proofpoint
7771589666114630.000027gov.dot
7781589661210670.000031com.zoho
7791589517625340.000012com.producthunt
780158951507510.000039com.atlassian
7811589449621880.000014ca.ubc
782158944395760.000045com.us
7831589428518150.000019com.contentmarketinginstitute
7841589408115010.000025com.investopedia
7851589326725080.000012com.bankofamerica
7861589318716850.000021gov.wa
7871589299919210.000017com.deadline
7881589211725440.000012com.nhl
7891589200136570.000008org.lifehack
7901589179016830.000021com.vmware
7911589171824510.000012com.starbucks
7921589145523620.000013ly.visual
7931589139912040.000029org.change
7941589134819230.000017uk.ac.lse
7951589119421150.000015com.magentocommerce
796158908812480.000098org.iana
7971589037626310.000012com.lifewire
7981588996710150.000032com.visualstudio
7991588939715590.000024jp.blogspot
8001588918416060.000023com.sky
8011588909416010.000023com.gotomeeting
8021588905210600.000031com.bizcommunity
8031588894032810.000009com.smashwords
8041588716816510.000022com.mediafire
8051588659915780.000024com.ssrn
8061588609516860.000021net.recode
8071588583225930.000012com.asus
8081588439316360.000022se.haxx
8091588413813700.000029es.amazon
810158837864740.000052com.teamviewer
8111588362117870.000019com.outbrain
812158825776990.000042com.getpocket
8131588118430530.000010com.macrumors
8141588077232680.000009net.battle
8151588036215810.000024com.nydailynews
8161587964117970.000019edu.vanderbilt
8171587923124170.000013com.thestreet
818158786739410.000034net.azurewebsites
8191587849916410.000022fr.lemonde
8201587849316200.000022org.postimg
8211587770233930.000009com.formula1
8221587756514550.000027com.oup
8231587662126240.000012gov.cia
8241587630427100.000011org.olympic
8251587562825760.000012org.7-zip
8261587557739060.000007uk.ac.warwick
8271587548125730.000012com.tesla
8281587372623500.000013hk.com.google
8291587362918720.000018com.ecwid
8301587312010030.000032com.mlstatic
8311587216916630.000021com.glassdoor
8321587216320760.000015ca.utoronto
8331587158323030.000013net.comcast
8341587041621180.000015com.readwrite
8351587013021850.000014ca.qc.gouv
8361586991617850.000019gov.congress
837158677249470.000034com.att
8381586732718010.000019uk.co.mirror
839158672994950.000050com.marriott
8401586623929560.000010com.coinbase
8411586613420240.000016com.me
8421586603532560.000009gd.is
8431586551914210.000028org.plos
8441586492915180.000025com.business2community
8451586377710790.000030com.sagepub
8461586340628760.000010com.fineartamerica
8471586302411770.000029me.pxlme
8481586299515690.000024com.over-blog
8491586290915200.000025com.techtarget
8501586268719700.000016ru.narod
8511586249617700.000019com.ssllabs
8521586183524230.000013com.ge
8531586130118380.000018org.unicef
8541586098720320.000016int.wipo
855158606211790.000143de.bund
856158592329820.000033gov.house
8571585877221120.000015uk.co.thesun
8581585873230780.000010net.sucuri
8591585841721840.000014com.yolasite
8601585841325670.000012ms.1drv
8611585827121230.000015au.com.blogspot
8621585785426860.000011com.fool
8631585783439500.000007com.thenation
8641585761044180.000007edu.temple
8651585758333410.000009com.makeuseof
8661585701818320.000018edu.umd
867158569437820.000038es.com.blogspot
8681585536415830.000023com.pingdom
8691585533422920.000013com.macworld
870158552904550.000054jp.ne.sakura
8711585508825630.000012com.webnode
8721585484030440.000010com.freelancer
8731585400022970.000013gov.nsf
8741585387226230.000012edu.brown
8751585260149140.000006ca.gc.statcan
8761585259519970.000016com.getfirebug
8771585193322820.000013com.wikispaces
8781585073720090.000016org.jstor
8791585069118560.000018co.angel
8801585066529120.000010edu.tufts
881158500017430.000040org.bitbucket
8821584976422430.000014edu.osu
8831584921421210.000015edu.tamu
8841584906521270.000015org.wpmudev
885158476368340.000036net.noscript
8861584723237350.000008com.appleinsider
887158470681490.000195com.ggpht
8881584653217630.000019it.blogspot
8891584648524900.000012org.documentcloud
8901584590219260.000017com.cc
8911584458822070.000014us.zoom
8921584403217550.000020com.rollingstone
8931584349642050.000007li.paper
8941584317120520.000015edu.rutgers
8951584266923650.000013com.theonion
896158422927230.000041com.geocities
8971584215733280.000009com.indiewire
8981584170229780.000010int.esa
899158409549390.000034com.netdna-cdn
9001584066726040.000012ly.generalassemb
9011584060235710.000008edu.buffalo
902158400944850.000051br.com.google
9031584003810780.000030com.bitballoon
904158397034840.000051com.1and1
9051583897025710.000012com.sony
906158387706740.000042com.trustpilot
9071583737912000.000029org.oecd
9081583703420640.000015com.azcentral
9091583608111420.000030com.communitywalk
9101583582121970.000014org.videolan
9111583490725460.000012com.pandora
9121583328750620.000006org.anitaborg
9131583328423880.000013gov.in
9141583308130170.000010com.4shared
9151583279330580.000010org.metmuseum
9161583253321490.000015com.theknot
917158324798970.000035org.osgeo
918158301142010.000127me.line
919158276103720.000065com.bizjournals
9201582687425840.000012com.fujitsu
9211582674626740.000011com.blogs
922158262442720.000087org.debian
9231582489377700.000004edu.du
9241582376923370.000013com.bleacherreport
925158234105660.000045com.quantcast
9261582291724090.000013uk.co.express
9271582247921480.000015com.redbubble
9281582241328560.000010com.cosmopolitan
9291582221018030.000019org.cancer
9301582181711360.000030com.graphis
9311582106820580.000015de.zeit
9321582105323460.000013ca.sfu
933158207297320.000040com.wunderground
9341582067523110.000013com.convinceandconvert
9351582037228880.000010org.bitcoin
9361582019214640.000027com.usps
9371581987245450.000006com.blog
9381581959020670.000015com.salon
9391581727727790.000011com.technet
9401581726117990.000019net.daringfireball
9411581718925600.000012com.googlepages
942158166471150.000229com.bluehost
9431581655720890.000015com.w3techs
9441581626017710.000019com.calendly
9451581546930500.000010com.rottentomatoes
9461581499972680.000004com.elance
9471581453225290.000012com.createspace
948158139438550.000036com.comscore
9491581361024040.000013edu.colorado
9501581359910890.000030com.2findlocal
9511581356510570.000031org.tpr
9521581290816430.000022com.bt
953158128413650.000067com.rackcdn
9541581123434100.000009com.kotaku
9551581100837830.000008edu.syr
956158104977290.000041com.verisign
9571580990210090.000032com.tiddlywiki
9581580895717360.000020com.strikingly
9591580886761970.000005com.mercedes-benz
9601580886027360.000011com.oprah
9611580885117070.000021com.bmj
9621580836721400.000015com.popsugar
9631580740821750.000014org.hrw
9641580735520340.000016com.shareholder
9651580731618880.000017com.digicert
9661580708716450.000022com.steamcommunity
9671580703640570.000007com.pastemagazine
9681580666830990.000010com.voanews
9691580662610440.000031org.travelblog
9701580647014410.000027org.heart
9711580629029630.000010com.thrillist
9721580577241640.000007com.youcaring
9731580572910750.000031com.independent
9741580541121770.000014net.atlassian
9751580497145290.000006com.secondlife
9761580431115460.000024int.coe
9771580425223630.000013com.xerox
9781580208019070.000017com.computerworld
9791580186126080.000012com.groupon
9801580125638090.000008edu.rochester
9811580070934280.000009com.sas
9821579966320370.000016com.getsatisfaction
983157994674760.000052com.aliyuncs
9841579862665250.000005com.threatpost
9851579827118250.000018ru.spb
9861579794422240.000014com.gawker
9871579780924680.000012me.flavors
9881579773041970.000007com.slides
9891579764127940.000011com.madmimi
9901579750924110.000013com.hindustantimes
9911579697843380.000007org.teamusa
9921579657014670.000026gov.va
9931579567218700.000018mil.navy
994157954783500.000070jp.co.rakuten
995157953897200.000041com.hilton
9961579503610460.000031com.chicagotribune
9971579502414860.000026com.cafepress
9981579428323360.000013org.dyndns
9991579411322710.000014com.teenvogue
1000157936988960.000035gov.export

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

3.25 Billion Pages Crawled in July 2018

The crawl archive for July 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-30/. It contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23th.

The July crawl contains 625 million new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Feb/Mar/Apr 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 25 million hosts or top 25 million domains of the webgraph dataset
  • a random sample taken from WAT files of the June crawl

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-30/segment.paths.gz100
WARC filesCC-MAIN-2018-30/warc.paths.gz6400061.07
WAT filesCC-MAIN-2018-30/wat.paths.gz6400020.93
WET filesCC-MAIN-2018-30/wet.paths.gz640009.08
Robots.txt filesCC-MAIN-2018-30/robotstxt.paths.gz640000.21
Non-200 responses filesCC-MAIN-2018-30/non200responses.paths.gz640002.15
URL index filesCC-MAIN-2018-30/cc-index.paths.gz3020.25

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-30/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

June 2018 Crawl Archive Now Available

The crawl archive for June 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-26/. It contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th.

The June crawl contains 700 million new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Feb/Mar/Apr 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 25 million hosts or top 25 million domains of the webgraph dataset
  • a random sample taken from WAT files of the May crawl

The remaining URLs (more than 2 billion) have already been included in one of the previous monthly crawl archives and have been stored in our URL database for a later re-fetch – if not marked as duplicates, classified as spam, etc. This huge “collection of bookmarks” dates back multiple years, even back to 2012 when we first received seed donations from Blekko. This month we started to remove old “bookmarks” from our URL database. In the future we’ll remember a URL for only 12 months after seen last as a seed or outlink. On the one hand, we hope to increase the dynamic of the crawls, on the other hand a smaller URL database will save resources.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-26/segment.paths.gz100
WARC filesCC-MAIN-2018-26/warc.paths.gz6400056.58
WAT filesCC-MAIN-2018-26/wat.paths.gz6400019.19
WET filesCC-MAIN-2018-26/wet.paths.gz640008.37
Robots.txt filesCC-MAIN-2018-26/robotstxt.paths.gz640000.2
Non-200 responses filesCC-MAIN-2018-26/non200responses.paths.gz640001.73
URL index filesCC-MAIN-2018-26/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-26/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May 2018 Crawl Archive Now Available

The crawl archive for May 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-22/. It contains 2.75 billion web pages and 215 TiB of uncompressed content, crawled between May 20th and 28th.

The May crawl contains 550 million new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps, RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the Feb/Mar/Apr 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 40 million hosts or top 40 million domains of the webgraph dataset
  • a random sample taken from WAT files of the April crawl

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-22/segment.paths.gz100
WARC filesCC-MAIN-2018-22/warc.paths.gz6400051.61
WAT filesCC-MAIN-2018-22/wat.paths.gz6400017.61
WET filesCC-MAIN-2018-22/wet.paths.gz640007.59
Robots.txt filesCC-MAIN-2018-22/robotstxt.paths.gz640000.19
Non-200 responses filesCC-MAIN-2018-22/non200responses.paths.gz640001.65
URL index filesCC-MAIN-2018-22/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-22/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs).

What’s new?

The graphs now contain links from sitemap announcements in robots.txt files. This small addition of 2.5 million inter-host links is motivated by the fact that sitemaps directives are sometimes (see example 1, 2, 3) used for link spam or aggressive SEO, often in combination with excessive use of inter-host hyperlinks on HTML pages. We hope that this addition helps to improve the detection rate of link spam detection algorithms.

Host-level graph

The graph consists of 2.14 billion nodes and 10.15 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 2.02 billion dangling nodes (94%) and the largest strongly connected component contains only 77 million (3.6%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 2 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Feb/Mar/Apr 2018 host-level webgraph

SizeFileDescription
12.45 GBcc-main-2018-feb-mar-apr-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 80 vertices files
50.22 GBcc-main-2018-feb-mar-apr-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 160 edges files
20.68 GBcc-main-2018-feb-mar-apr-host.graphgraph in BVGraph format
2 kBcc-main-2018-feb-mar-apr-host.properties
24.82 GBcc-main-2018-feb-mar-apr-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2018-feb-mar-apr-host-t.properties
1 kBcc-main-2018-feb-mar-apr-host.statsWebGraph statistics
28.84 GBcc-main-2018-feb-mar-apr-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 98 million nodes and 1.5 billion edges. 57% or 56 million nodes are dangling nodes, the largest strongly connected component covers 37 million or 38% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2018-feb-mar-apr/domain/.

Download files of the Common Crawl Feb/Mar/Apr 2018 domain-level webgraph

SizeFileDescription
0.68 GBcc-main-2018-feb-mar-apr-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.04 GBcc-main-2018-feb-mar-apr-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.32 GBcc-main-2018-feb-mar-apr-domain.graphgraph in BVGraph format
2 kBcc-main-2018-feb-mar-apr-domain.properties
3.53 GBcc-main-2018-feb-mar-apr-domain-t.graphtranspose of the graph
2 kBcc-main-2018-feb-mar-apr-domain-t.properties
1 kBcc-main-2018-feb-mar-apr-domain.statsWebGraph statistics
2.06 GBcc-main-2018-feb-mar-apr-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 98 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/Apr 2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12716049010.016926com.googleapis
22691740020.013683com.facebook
32502320830.009981com.google
42424149240.008519com.twitter
52392399050.007249com.youtube
62305594460.006339org.w
72120813270.004853org.gmpg
82112645280.003674com.instagram
920871022100.003315com.linkedin
1020308692140.002286com.wordpress
1120254236120.002787org.wordpress
1220083558230.001537com.gravatar
1320073594260.001290org.wikipedia
1419978538180.001671com.pinterest
1519658798110.002973com.bootstrapcdn
1619570742200.001624com.apple
1719538492310.000948com.blogspot
1819437996250.001294com.vimeo
1919203616350.000867gl.goo
2019193288370.000853com.amazon
2119158752170.001701com.adobe
2219158570290.001124com.microsoft
2319126248450.000704com.tumblr
2419082854150.001792com.macromedia
2519011582360.000856com.wp
2619005558410.000794ly.bit
2718975638540.000499com.yahoo
2818971200160.001786com.googletagmanager
2918852580440.000711be.youtu
3018806694330.000921com.amazonaws
3118791732320.000922com.paypal
3218760344190.001634com.cloudflare
3318706230420.000775com.flickr
3418680604610.000433org.mozilla
3518680160300.001017com.github
3618678006960.000297com.googleusercontent
3718625646690.000409com.weebly
3818604632490.000594org.w3
3918547840660.000424org.creativecommons
4018536644830.000354com.soundcloud
4118495276240.001407net.doubleclick
42184658441490.000207com.blogger
43184016201470.000212com.imgur
4418400744400.000809me.wp
4518397664280.001163com.gstatic
4618392744550.000498com.list-manage
47183914161740.000162com.myspace
4818390430670.000417com.medium
49183732701540.000194net.slideshare
5018360766870.000337io.github
51183488881770.000161com.wsj
5218332910510.000589co.t
53182562782280.000115com.reuters
5418245054390.000818org.apache
5518231016380.000837com.statcounter
56182209402360.000110uk.co.telegraph
5718214734680.000415eu.europa
58182073822320.000111org.npr
59182000742920.000087com.appspot
6018195584530.000550com.jquery
61181857381660.000178com.android
62181842722660.000096com.cnbc
63181836661460.000213com.issuu
6418181476780.000387com.cnn
65181701362610.000099com.about
66181617441330.000240com.nytimes
67181608521370.000231com.yelp
68181527922530.000102me.about
69181328881700.000172com.spotify
70181113361900.000146uk.co.bbc
71181047481510.000201com.wixsite
72180993941710.000168com.tripadvisor
73180989821780.000160org.gnu
74180862402040.000134org.wikimedia
75180859183560.000072edu.washington
7618084938430.000730net.cloudfront
77180808722010.000135com.oracle
78180754842820.000088org.python
79180658584270.000063org.chromium
80180632681650.000182org.ietf
8118058042720.000391com.huffingtonpost
82180507304030.000067edu.ucla
83180364824490.000059edu.princeton
84180287203680.000070com.slate
85180231161200.000254com.reddit
86180179442410.000107com.mozilla
87180139523050.000085com.mysql
8818004788630.000428com.ytimg
8917983326600.000457com.bing
90179795521390.000220com.dropbox
91179706584140.000065com.pixabay
92179692402130.000122com.nbcnews
93179480721640.000182com.forbes
94179477223320.000077gov.loc
95179435223800.000069com.googlecode
96179388521670.000176org.archive
97179365043110.000083com.foursquare
98179351221600.000186net.sourceforge
99179229562710.000093com.go
100179149925110.000053edu.gatech
101179147281890.000148com.theguardian
10217911318570.000476org.schema
103179087002070.000129es.google
104179044662680.000094com.example
105178939342340.000110com.hubspot
10617893100340.000899com.squarespace
10717882030710.000392com.paypalobjects
10817876036910.000322com.mashable
109178717923900.000068com.steampowered
110178713383290.000079gov.fda
11117866752810.000369com.wix
11217863474560.000486net.fbcdn
113178619883840.000069com.tinypic
114178605284310.000062com.variety
115178594743820.000069org.nodejs
116178563782030.000135edu.stanford
117178485522210.000119com.dribbble
1181784499690.003327com.godaddy
119178356382100.000126com.tinyurl
12017833096620.000433com.fb
121178275182520.000102com.businessinsider
122178201425850.000048edu.utah
123178168305580.000050edu.illinois
124178149982110.000125com.imdb
125178079362730.000093com.live
126177988282290.000112com.washingtonpost
127177956962860.000088au.com.google
128177948846650.000046com.chrome
129177940143100.000083edu.mit
130177931622150.000121com.typepad
131177928282460.000104com.techcrunch
132177926123910.000068com.sun
133177881464590.000059org.sciencemag
134177862884630.000058org.eclipse
135177855684360.000061com.withgoogle
136177634881040.000272com.addthis
137177626481440.000217com.jimdo
13817751854890.000332net.akamaihd
139177507665050.000053com.nike
140177466982810.000088com.bloomberg
141177465224540.000059org.ampproject
142177445963060.000084edu.harvard
143177426186770.000045com.hbo
144177404483080.000083com.cnet
145177399284660.000058co.g
146177334105100.000053com.chron
147177327065440.000051com.jetbrains
148177310545310.000052edu.tamu
149177262361720.000167com.etsy
150177241624790.000055com.sap
151177234225910.000047com.wikidot
152177230744450.000060com.libsyn
15317708952270.001246ru.yandex
154177008023070.000084com.wired
155176993086780.000045uk.org.tate
156176961863230.000080com.aol
157176954801560.000192com.twimg
158176921402780.000090com.usatoday
159176917563810.000069uk.co.dailymail
160176897805370.000051com.cc
161176893002740.000092io.codepen
162176886143600.000071com.cdbaby
163176883605920.000047org.virtualbox
164176869665140.000053edu.berkeley
165176843665990.000047com.googleblog
166176766826940.000044com.discogs
167176739383270.000079com.time
168176696421100.000259com.shopify
169176683962910.000087com.images-amazon
170176682981830.000155gov.nih
171176662181360.000237com.mailchimp
172176560163860.000069com.bbc
173176544506740.000045com.quora
174176497144340.000062com.marketingland
175176425602970.000086com.ibm
176176381464220.000064gov.nasa
177176370043180.000081com.photobucket
178176368121810.000159com.eventbrite
179176365085420.000051com.theverge
180176359565000.000054edu.cornell
181176354164180.000064com.git-scm
182176341544170.000065com.livejournal
183176335903280.000079com.msn
184176253405160.000052com.strikingly
185176241542550.000102com.mapquest
186176208767270.000042id.co.blogspot
187176184605740.000049com.yellowpages
188176175225510.000050org.rubyonrails
189176123424720.000057com.xrea
190176107123830.000069au.gov.nsw
191176090803630.000071com.latimes
192176080104750.000056org.kernel
193176036583760.000069com.gmail
19417597398990.000283edu.utexas
195175959222960.000086net.windows
196175951024650.000058me.paypal
197175908762430.000105com.stackoverflow
198175844744810.000055com.buzzfeed
199175780723340.000077com.meetup
200175735422120.000124com.ebay
201175680483030.000085com.staticflickr
202175680243580.000072com.npmjs
203175678927180.000042io.itch
204175671163700.000070com.ted
205175609265570.000050org.aarp
206175522742400.000108uk.co.amazon
20717551168590.000473com.vk
208175489222060.000130com.opera
209175309565350.000051net.codecanyon
210175293841760.000161com.feedburner
211175276045220.000052com.cbsnews
212175199486660.000046com.scribd
213175193365500.000050com.neilpatel
214175191203260.000079com.googlesyndication
215175187343970.000067com.springer
216175159582480.000103net.php
217175151964100.000066uk.co.blogspot
218175133465700.000049com.zdnet
219175132084370.000061com.angieslist
220175131583250.000079com.nypost
221175118765290.000052com.venturebeat
222175045464710.000057com.theatlantic
223175040044670.000058com.fortune
224175028781000.000282de.google
22517502814970.000292com.livestream
22617499932640.000427com.qq
227174962082630.000097com.surveymonkey
228174943486710.000046com.vice
229174922761610.000185uk.co.google
230174880662250.000117com.getclicky
231174865005770.000049net.daringfireball
232174853065860.000048org.maven
233174813802720.000093io.atom
2341748103210120.000034org.arxiv
235174780922420.000107com.digg
236174739824250.000063gov.whitehouse
237174731205250.000052com.box
238174720226990.000043com.newyorker
239174670943940.000068com.wiley
240174656661090.000264cc.co
241174652687390.000041com.arstechnica
242174628183730.000069org.mediawiki
243174618564420.000060com.kickstarter
244174614727890.000039com.jsbin
245174578485640.000050com.ft
24617454848940.000305me.fb
247174541105410.000051com.citysearch
248174523462670.000094com.hp
249174521462890.000087gov.ftc
250174517203140.000083gov.cdc
251174506921980.000137com.zendesk
252174504307160.000042com.unsplash
253174501322590.000101com.disqus
254174484461500.000204net.behance
255174456281620.000184com.salesforce
256174393803870.000069com.prnewswire
257174373005760.000049org.pbs
258174364964400.000060com.entrepreneur
259174317348370.000036edu.yale
260174292304910.000054com.inc
261174291865730.000049com.wikihow
262174261001010.000280com.baidu
263174236286810.000045com.dropboxusercontent
264174232347260.000042com.nationalgeographic
265174227967560.000040com.foxnews
266174219105620.000050com.wikia
267174197281950.000140jp.co.yahoo
268174195447940.000038com.naturalnews
269174184725660.000050com.deviantart
270174143962760.000091com.webs
271174138723480.000075fr.free
272174122103170.000081org.acm
273174028163960.000067ly.ow
2741739488414530.000026edu.purdue
275173936803450.000075gov.ca
276173906742180.000120com.stumbleupon
277173892549180.000035edu.psu
278173851444230.000063org.un
279173844005010.000054com.cisco
2801738343413490.000029edu.ucsd
281173817064740.000056com.giphy
282173805167360.000041com.economist
283173793742510.000103com.wufoo
2841737417412650.000032com.gizmodo
285173737523690.000070com.dailymotion
2861737286412910.000031gov.fbi
287173727805560.000050com.office
288173724827530.000041org.aclweb
289173723701730.000165com.constantcontact
290173718804530.000059com.businesswire
291173700584150.000065com.skype
292173697407380.000041org.amnesty
293173669266840.000045com.nature
294173665749320.000034edu.columbia
295173658585180.000052org.postgresql
296173646686830.000045org.tigris
2971736136815470.000023com.hotmail
298173565867460.000041com.storify
2991735200415870.000022com.vanityfair
300173513423210.000081it.placehold
301173485587400.000041com.yarnpkg
302173418863570.000072com.oreilly
303173382084940.000054com.snapchat
3041733731410320.000033edu.upenn
305173352048070.000037com.googledrive
3061733483012770.000032com.qz
3071733377410790.000033com.evernote
308173317583220.000080com.tripod
309173306347040.000043com.googlesource
310173296364260.000063int.who
311173263667100.000043com.intel
312173257867340.000041com.sublimetext
3131732487011350.000033com.shutterstock
314173220721820.000158com.weibo
315173189469440.000034org.ieee
316173185949910.000034gov.uspto
3171731635418260.000018com.fifa
318173150947900.000039ly.snip
319173146686930.000044com.bandsintown
320173140082300.000111com.bandcamp
321173135728920.000035com.statista
322173096565720.000049com.goodreads
323173095869110.000035com.bizcommunity
3241730909214850.000025uk.co.theregister
325173088427490.000041com.engadget
3261730618410720.000033org.eff
3271730333614060.000027com.trello
328173020601940.000141de.amazon
329173013804440.000060org.hbr
330173010809050.000035com.psychologytoday
3311730047614700.000025com.elpais
3321729975015760.000023com.upwork
333172953407250.000042com.samsung
334172917667190.000042org.gnupg
3351729156615320.000024edu.northwestern
336172914543240.000080com.smugmug
337172904848460.000036com.stackexchange
338172888644350.000061com.force
3391728276816020.000022org.khanacademy
3401728143012840.000031com.hootsuite
3411727868613960.000028com.ning
3421727855210010.000034com.thenextweb
343172780989550.000034com.weather
344172776249060.000035org.webmproject
345172772428870.000035com.timeanddate
3461727678215000.000025com.pexels
347172760507640.000040com.manta
348172755787970.000038com.mysanantonio
3491727380013840.000028co.vine
350172723267840.000039uk.co.guardian
3511726970416600.000021com.speakerdeck
352172689707740.000039com.uk
3531726515212590.000032org.iso
3541726329815600.000023com.billboard
355172604228740.000035com.marketwatch
356172587607650.000040com.sciencedaily
3571725860014420.000026com.thinkwithgoogle
358172581207690.000040com.digitaljournal
3591725560213790.000028in.blogspot
36017255046930.000313org.networkadvertising
3611725478415190.000024com.pcworld
3621725330815900.000022com.posterous
3631725315814320.000026com.pcmag
3641725169012810.000032com.mckinsey
365172497587440.000041com.blackberry
366172486269040.000035org.unesco
367172479166920.000044gov.noaa
3681724731613510.000029com.airbnb
3691724063213560.000029com.istockphoto
3701724012213500.000029org.altervista
371172395065530.000050com.githubusercontent
372172389362640.000096to.amzn
3731723855217190.000020org.owasp
374172378481080.000266com.bleacherreport
375172374842440.000105org.joomla
3761723679012750.000032com.netflix
377172351709170.000035edu.utep
378172349281400.000218com.googleadservices
379172345847930.000038com.merchantcircle
380172315621580.000190org.bbb
3811722750217820.000018uk.co.metro
382172270965040.000053com.bizjournals
3831722509012740.000032ca.blogspot
3841722454815560.000023com.merriam-webster
385172243283330.000077com.rawgit
3861722296616630.000021edu.usc
3871722260613370.000029uk.ac.ox
388172221207170.000042gov.sec
3891722179414360.000026de.spiegel
390172216887910.000039com.vagrantup
391172198508030.000038com.sfgate
3921721573612730.000032us.imageshack
3931721443210830.000033com.lifehacker
394172132623660.000070com.sxsw
3951721290018500.000017net.boingboing
396172119169850.000034com.americanexpress
397172114226960.000044com.moz
398172109364920.000054net.openid
3991720972412660.000032com.indiegogo
400172094302880.000088com.windowsphone
401172093663130.000083ca.google
402172088721050.000271com.people
4031720868216280.000021edu.jhu
404172062742190.000120org.drupal
4051720235218420.000018com.nfl
4061719960613600.000028gov.usgs
407171994482950.000086com.fc2
4081719894219140.000017ca.uwaterloo
409171973321310.000252jp.co.google
4101719690017890.000018com.socialmediaexaminer
4111719431016080.000022com.mcafee
4121719394619650.000016com.tutsplus
4131719209222560.000014com.twitpic
414171916064160.000065com.booking
415171905043300.000078com.bitly
416171894205320.000052com.w3schools
4171718929614570.000026com.boston
418171892685800.000048com.squareup
4191718830617280.000019com.technologyreview
4201718667214610.000026com.gumroad
4211718661012970.000031com.redhat
4221718649816770.000020com.hulu
4231718603211320.000033gov.nist
4241718547814690.000025com.discovery
4251718544413000.000031fr.blogspot
426171826648060.000037gov.nps
427171814146170.000047uk.co.independent
4281717867013060.000030com.politico
429171761065360.000051com.typeform
4301717396814180.000027com.zoho
4311717374217530.000019com.ehow
4321717309819850.000016com.cbs
4331717242015750.000023com.codeplex
434171721349610.000034ca.cbc
435171720748260.000037com.whitepages
4361717114614310.000026com.alibaba
437171703962230.000118org.icann
438171703144970.000054org.doi
439171700749310.000034net.researchgate
440171697009620.000034au.net.abc
4411716942816580.000021org.gnome
4421716872629280.000010com.hubpages
443171683101850.000153it.google
444171668363430.000075com.nielsen
445171661981970.000138com.histats
446171638348910.000035com.gofundme
4471716276815940.000022com.mtv
448171619408520.000036gov.copyright
449171617365300.000052com.sciencedirect
450171609946000.000047org.doxygen
451171599684690.000058us.icio
452171593587110.000042com.slack
4531715929416340.000021edu.academia
4541715888413550.000029com.pingdom
455171571348280.000036it.binged
456171571247350.000041com.java
4571715711210140.000034edu.alamo
4581715645216110.000022edu.unc
459171557284580.000059com.sitelock
460171555141990.000137com.xing
461171554004380.000061com.adweek
4621715532815380.000024gov.nyc
463171543607280.000042org.vim
4641715302617700.000019edu.cuny
4651715153016230.000021com.nba
466171512824950.000054mp.j
4671715127210620.000033tv.ustream
468171510149360.000034com.groupspaces
4691714948416310.000021com.udemy
470171477104890.000054edu.cmu
4711714575014740.000025com.over-blog
4721714386221100.000015com.mentalfloss
473171432387010.000043de.blogspot
4741714320213970.000028uk.ac.cam
4751714292619220.000017com.fiverr
476171415367480.000041com.webmd
477171403407470.000041com.questionpro
478171393849600.000034gov.fcc
4791713697213320.000030edu.wisc
4801713650013180.000030org.postimg
481171358209220.000035edu.umich
482171332821930.000141com.eepurl
4831713239013210.000030com.deloitte
484171323001030.000278me.m
4851713148418410.000018com.gamespot
486171287585790.000048org.whatbrowser
487171256167950.000038br.com.uol
488171235067410.000041org.jenkins-ci
489171234548250.000037gov.senate
4901712069015440.000023com.hollywoodreporter
4911711967819110.000017uk.ac.ucl
4921711877632970.000009com.blog
4931711523215340.000024net.recode
494171134928730.000035com.att
4951711307617520.000019com.angelfire
4961711281815160.000024com.techrepublic
497171115582310.000111fr.google
4981710829218460.000017com.ikea
4991710811415120.000024com.prezi
500171072167230.000042com.adage
5011710697413680.000028com.gigaom
5021710509618170.000018com.canva
5031710493815530.000023edu.uchicago
5041710452815580.000023com.econsultancy
5051710391012950.000031com.formstack
506171038024570.000059com.bigcartel
5071710332415490.000023com.scientificamerican
508171032988320.000036gov.census
5091710203810520.000033com.searchengineland
510171019303610.000071com.fastcompany
511171009826310.000046com.symantec
5121710097220410.000016ca.ualberta
513170991909710.000034com.pinimg
514170991627240.000042com.geocities
515170985929920.000034com.hotfrog
51617098588210.001560com.wixstatic
517170979024510.000059gov.ed
5181709781416680.000021net.daum
5191709517020400.000016ch.ethz
5201709416013140.000030org.redcross
521170941483920.000068com.naver
522170938485880.000048tv.twitch
5231709369618350.000018com.sky
524170936465130.000053cn.com.sina
5251709272010880.000033com.collegian
526170925084090.000066net.themeforest
5271709206818760.000017tv.periscope
5281708956023480.000013com.flipboard
5291708937019590.000016com.ign
530170886982690.000094com.myshopify
531170846983090.000083com.whatsapp
5321708382017920.000018com.starbucks
5331708180217960.000018com.aliexpress
5341708094018940.000017com.ibtimes
5351708076015210.000024com.target
5361707989211850.000033com.fotolia
537170798367310.000041gov.hhs
5381707797218620.000017edu.msu
5391707756213280.000030com.animoto
540170770969520.000034jp.ac.kobe-u
541170752089290.000034com.ubuntu
5421707475026930.000011com.klout
5431707422419490.000016it.scoop
544170736103790.000069nl.google
5451707299214400.000026com.com
5461707178418060.000018com.mac
5471707064016890.000020edu.umd
5481706858623090.000013org.gimp
549170681165120.000053com.msdn
5501706727614600.000026com.nydailynews
551170663863400.000076edu.nyu
5521706542819610.000016org.aclu
5531706514015990.000022com.intuit
554170634388140.000037com.indiatimes
5551706278814990.000025com.nymag
556170624029240.000035com.lighthouseapp
557170617768780.000035com.insiderpages
558170613466750.000045com.delicious
559170611206730.000046com.cargocollective
5601706103214430.000026fm.last
5611706083415660.000023gov.uscourts
5621706078812610.000032org.worldbank
5631705901819400.000016com.getsatisfaction
5641705898022480.000014edu.gmu
5651705777012820.000032org.change
5661705743613100.000030gov.sba
567170565188850.000035com.cbslocal
568170561165630.000050gov.usda
5691705586821220.000015org.d3js
5701705583013570.000029com.500px
5711705574615170.000024com.businessweek
572170545684410.000060com.clicky
5731705326616560.000021com.vox
5741705284811250.000033net.digitalcongo
5751705227615480.000023de.heise
576170520305550.000050in.co.google
5771705189818890.000017edu.arizona
57817050912920.000320com.atdmt
5791705052021180.000015edu.asu
5801704809810330.000033com.citysquares
581170476465830.000048gov.epa
582170474329450.000034com.feedly
5831704629414560.000026edu.si
5841704602420910.000015uk.co.wired
5851704601214750.000025com.globo
5861704596016690.000021fr.lemonde
5871704581214540.000026edu.umn
588170442501840.000154com.youtube-nocookie
589170434287770.000039com.usnews
5901704334230290.000010gd.is
5911704113418250.000018com.autodesk
5921704054819780.000016com.exacttarget
5931703948410160.000034com.alexa
594170390389760.000034com.wsoctv
5951703694620850.000015com.yolasite
596170363161550.000193com.google-analytics
5971703606216590.000021it.blogspot
598170346401630.000183com.ggpht
5991703400010510.000033org.plos
600170323947570.000040es.com.blogspot
6011703233015520.000023org.cancer
602170312248610.000036com.tiddlywiki
6031702951418330.000018au.com.blogspot
604170286884300.000062gov.irs
6051702690017840.000018edu.virginia
606170264962790.000089com.getbootstrap
607170262182850.000088jp.co.amazon
608170251601570.000191ru.mail
6091702493221470.000015com.bestbuy
610170248902200.000120jp.ameblo
6111702460013620.000028com.ycombinator
61217024552740.000390com.messenger
6131702320616930.000020com.zazzle
614170231364470.000060com.barnesandnoble
6151702178411280.000033com.reverbnation
6161702101811140.000033site.tenerifeforum
617170208568450.000036org.bouncycastle
618170206329980.000034com.chamberofcommerce
6191702056620100.000016com.marketingprofs
6201702045620610.000015com.invisionapp
6211701780814160.000027com.searchenginejournal
6221701778215180.000024org.apa
623170168247140.000042fr.amazon
6241701523415050.000024com.kissmetrics
625170151624010.000067com.nasdaq
626170147649990.000034com.2findlocal
6271701449410360.000033net.brownbook
6281701355418600.000017com.msnbc
6291701342813830.000028com.bufferapp
630170130121870.000151com.amazon-adsystem
6311701213227740.000011com.popsci
6321701143816010.000022kr.flic
633170113542080.000128jp.ne.hatena
634170111887020.000043com.herokuapp
635170109024290.000062com.custhelp
6361701079824120.000013com.starwars
6371701063417730.000019com.knowyourmeme
6381701056826340.000011org.kiva
6391701031014350.000026com.cafepress
6401700955614820.000025com.searchenginewatch
6411700888416200.000021com.newsweek
6421700874615330.000024com.uber
643170086944520.000059com.stripe
6441700860022060.000014com.freep
6451700729017320.000019com.salon
646170066968390.000036com.newsbank
6471700668022140.000014com.wikispaces
6481700638016850.000020com.splashthat
649170054648110.000037gov.state
6501700314417990.000018google.blog
6511700200012680.000032com.patreon
6521700141233470.000009com.space
6531700101018360.000018edu.rutgers
6541700093021610.000015net.comcast
6551700091216570.000021com.today
6561699883215590.000023org.coursera
657169979988440.000036com.tandfonline
6581699735817340.000019org.filezilla-project
6591699711815740.000023int.coe
660169968727670.000040com.photoshelter
661169967869280.000035jp.co.fujixerox
6621699614610270.000034com.uservoice
6631699578220840.000015uk.ac.ed
6641699512813270.000030com.walmart
6651699498218230.000018com.semrush
6661699457639290.000007com.wolframalpha
6671699368618080.000018com.blogtalkradio
6681699364816800.000020au.com.smh
669169926389780.000034com.independent
6701699210210670.000033com.lacartes
6711699039621530.000015edu.bu
672169900967030.000043com.emarketer
673169900589870.000034uk.co.eventbrite
6741699004413800.000028org.iana
675169896926720.000046com.houzz
676169890788930.000035com.chambermaster
6771698903223030.000014com.xerox
6781698897622260.000014com.instructables
6791698840617210.000020uk.co.mirror
6801698766821630.000015nl.blogspot
6811698722420930.000015nl.xs4all
6821698693014040.000027com.investopedia
6831698666821620.000015de.bild
684169865047090.000043org.sonatype
6851698582421050.000015com.newscientist
686169855329810.000034com.strawberryperl
6871698510013300.000030gov.dot
6881698352815640.000023com.mediafire
6891698296616210.000021org.weforum
6901698296417670.000019com.thedailybeast
691169826461860.000152com.sharethis
6921698147830180.000010com.tvguide
6931698134621920.000014com.foxsports
6941698032418450.000018ru.narod
6951697884426210.000012edu.caltech
6961697842817310.000019org.cambridge
6971697834415020.000025com.mixcloud
6981697789615360.000024com.smashingmagazine
6991697750628190.000011org.greenpeace
7001697681816520.000021com.timeout
7011697652822820.000014com.homedepot
7021697596214810.000025com.ssrn
7031697578218950.000017uk.ac.lse
7041697546229230.000010org.icrc
7051697358027920.000011pt.sapo
7061697308817180.000020br.com.blogspot
7071697299823230.000013com.googlepages
7081697274815400.000024com.playstation
7091697219614280.000026com.elsevier
7101697218217330.000019mil.navy
7111697080621330.000015com.britannica
7121697032810250.000034org.gwtproject
7131696945421600.000015com.gawker
7141696851621570.000015au.com.news
7151696843013920.000028com.dell
7161696808429140.000010com.fivethirtyeight
7171696794813160.000030net.java
7181696754219360.000017com.vogue
719169668468360.000036com.gartner
7201696588618050.000018com.deezer
721169657063620.000071com.heroku
7221696562610310.000033org.twinery
7231696474012880.000031net.azurewebsites
724169643783780.000069com.ea
7251696424021410.000015com.seattletimes
7261696360027930.000011org.angularjs
727169606949210.000035com.prweb
7281696001418580.000017edu.ucdavis
7291695991820260.000016edu.uci
7301695860613910.000028com.bostonglobe
7311695843213950.000028com.hostgator
7321695829015810.000023com.nokia
7331695771819500.000016edu.ucsf
7341695712616250.000021com.bmj
7351695647422300.000014com.nbc
7361695622021380.000015ca.ubc
737169558408000.000038st.assi
7381695582020900.000015uk.co.thesun
7391695452825410.000012edu.brown
7401695447417880.000018com.zillow
7411695447410490.000033org.swi-prolog
7421695352017010.000020edu.duke
7431695351817830.000018com.csmonitor
7441695345028900.000010com.voanews
7451695205010380.000033com.zwire
7461695203417020.000020com.yandex
7471695181216350.000021com.rollingstone
7481695127221780.000014com.webnode
7491695121231190.000009edu.uic
7501694957416830.000020com.examiner
751169487944990.000054com.marriott
7521694725814330.000026com.digiday
7531694682632470.000009com.lmgtfy
7541694630029150.000010com.campaignmonitor
7551694599216170.000022com.thomsonreuters
7561694580415670.000023com.vmware
7571694572810180.000034com.themonitor
7581694533419420.000016edu.indiana
7591694486412690.000032net.yahoo
7601694479612960.000031org.openstreetmap
7611694469610430.000033com.sagepub
7621694354814620.000026jp.blogspot
763169434082940.000087com.wpengine
7641694272023810.000013com.rt
7651694236018020.000018com.pastebin
7661694211628460.000011int.esa
7671693995821260.000015net.oauth
7681693980416480.000021com.css-tricks
7691693937621540.000015org.acs
7701693920220640.000015com.aljazeera
771169388128170.000037io.getmdl
7721693816421930.000014com.redbubble
773169359289190.000035com.sacurrent
774169357543770.000069com.monster
7751693541824480.000013edu.unl
7761693530214580.000026gov.congress
7771693529817800.000018org.ap
7781693513617870.000018org.golang
7791693511231940.000009ru.blogspot
7801693460817270.000019net.battle
7811693374812800.000032com.marketwired
7821693299616240.000021com.hyatt
7831693277418340.000018int.wipo
784169325881690.000173info.aboutads
7851693196410550.000033com.hatenadiary
7861693144620270.000016com.irishtimes
7871693074214450.000026com.mlb
7881693055810260.000034com.showmelocal
7891692995820690.000015com.getfirebug
7901692973222200.000014com.crunchbase
7911692950825920.000012ms.1drv
7921692935417160.000020ru.spb
793169288608290.000036com.engineyard
7941692765213880.000028com.justgiving
7951692612621440.000015org.hrw
7961692580628620.000010com.threadless
797169256728620.000036com.qualaroo
7981692514627880.000011com.lynda
799169247589070.000035com.proofpoint
8001692470816000.000022com.xbox
8011692455215080.000024com.steamcommunity
8021692438423340.000013com.theonion
8031692416026000.000012edu.hawaii
804169233748150.000037gov.justice
805169231705670.000050com.bigcommerce
8061692288214250.000026com.docker
8071692089619860.000016com.si
8081692064820170.000016com.allthingsd
8091692033825270.000012com.madmimi
8101691971812890.000031com.atlassian
8111691968410350.000033com.bitballoon
8121691908015410.000024com.theglobeandmail
8131691811425510.000012org.phys
8141691759019180.000017edu.umass
815169173069820.000034com.judysbook
8161691688218710.000017com.buffer
817169157043670.000070jp.co.rakuten
8181691558610590.000033com.spoke
8191691536031790.000009uk.bl
82016915232820.000360com.parallels
8211691509424670.000013com.fox
822169145703470.000075com.youku
8231691393421850.000014com.topsy
8241691341416740.000021com.fedex
825169129304390.000061com.taobao
8261691162433620.000009com.friendfeed
8271691136210570.000033com.ibegin
8281690838226840.000011org.semanticscholar
8291690811216880.000020com.ecwid
830169053442800.000089com.automattic
831169051643650.000070me.t
832169040144460.000060com.cracked
833169028807520.000041ru.google
8341690256824640.000013com.groupon
8351690223021370.000015org.videolan
8361690115016360.000021org.unicode
8371690001616530.000021com.gettyimages
838168997909340.000034com.gfmag
8391689937621560.000015com.lonelyplanet
8401689895824550.000013edu.hbs
8411689856416490.000021com.unity3d
8421689840010600.000033dk.brics
8431689833010870.000033com.salespider
8441689745823410.000013hk.com.google
8451689724013230.000030org.oecd
8461689620421020.000015com.livestrong
8471689590017860.000018com.freewebs
848168950462830.000088com.garmin
8491689480615460.000023gov.wa
8501689429021160.000015com.netvibes
8511689211017080.000020com.lulu
8521689092835790.000008com.aviary
8531689040213460.000029es.amazon
8541689009226560.000011org.wikiquote
8551688995220110.000016com.getresponse
8561688940020230.000016com.espn
8571688911614790.000025com.xkcd
8581688870835470.000008org.edx
8591688859614370.000026gov.va
860168885668600.000036com.infusionsoft
8611688841820740.000015com.history
8621688837422630.000014me.flavors
8631688774828060.000011org.moma
8641688765272330.000004com.weheartit
8651688759619870.000016com.popsugar
8661688586415920.000022org.mayoclinic
867168857849640.000034com.chicagotribune
868168857821790.000160com.fontawesome
8691688558415300.000024com.oup
8701688502620440.000016com.w3techs
8711688432014490.000026gov.bls
8721688243835340.000008com.squidoo
8731688151233310.000009edu.rochester
8741688045424320.000013gov.cia
8751688036621010.000015com.mercurynews
8761688025816660.000021se.haxx
8771688023819380.000017org.jstor
8781688011216700.000021com.adjust
8791688002223600.000013com.hindustantimes
8801687989410420.000033com.enterprisenetworkingplanet
8811687789438850.000008com.4shared
8821687658217450.000019com.ssllabs
8831687617217580.000019com.freepik
8841687600817570.000019de.welt
8851687539426730.000011com.panasonic
8861687529414650.000025com.forrester
8871687498022650.000014com.readwriteweb
8881687445029680.000010com.softonic
8891687380630720.000010org.psychologicalscience
8901687297238090.000008org.libreoffice
8911687121024020.000013net.faz
8921687047420080.000016com.urbandictionary
893168704489680.000034com.marketersmedia
8941687020412920.000031com.brightcove
8951687019814390.000026com.techtarget
8961686967810110.000034com.shareasale
897168695828190.000037gov.house
898168688581110.000259com.namecheap
8991686859634450.000008edu.rit
9001686849613290.000030gov.archives
9011686833211310.000033place.yellow
902168681264320.000062pl.google
9031686789624830.000012com.shutterfly
9041686782216030.000022mil.army
9051686779032440.000009com.avast
9061686694023760.000013com.iconfinder
9071686692856560.000005ro.blogspot
9081686654416260.000021com.warnerbros
9091686651610760.000033org.rethinkingschools
9101686625230770.000010cc.tiny
9111686514625940.000012com.html5rocks
9121686497832430.000009com.askmen
9131686469819520.000016com.nvidia
9141686335222730.000014com.cbssports
9151686262622750.000014edu.osu
9161686171632690.000009edu.missouri
9171686141428570.000010edu.oregonstate
9181686049619790.000016com.colourlovers
9191686046010730.000033com.elocal
9201686033222090.000014com.ask
921168602107870.000039org.bitbucket
9221686001620420.000016com.me
9231685993025560.000012com.technet
9241685951428910.000010com.ezinearticles
9251685897229830.000010com.scmp
926168588044730.000057br.com.google
9271685830031290.000009com.fitbit
9281685788421120.000015com.comcast
929168578844560.000059com.udacity
9301685764227090.000011edu.pitt
9311685707017560.000019com.findlaw
9321685540220630.000015org.openoffice
9331685527244820.000007net.minecraft
9341685507420120.000016org.kde
935168548764960.000054jp.ne.sakura
9361685473626550.000011edu.tufts
9371685437238230.000008com.techsmith
9381685405418130.000018com.screencast
9391685373422740.000014net.earthlink
9401685373217600.000019uk.co.thetimes
941168533529160.000035com.campaign-archive1
9421685326230310.000010edu.ucsc
9431685284215860.000022com.outlook
9441685264613440.000029com.usps
9451685248417040.000020gov.uscis
9461685224026390.000011com.virgin
9471685192026850.000011ly.cl
9481685190427630.000011com.asus
9491685136021640.000014fr.lefigaro
9501685085820030.000016org.poynter
9511685044639620.000007edu.byu
9521685029828220.000011com.rottentomatoes
9531685012623720.000013uk.co.express
9541684940619290.000017org.craigslist
9551684921818160.000018com.smartinsights
956168490682260.000117me.line
9571684895815770.000023com.yoast
9581684791221950.000014com.podbean
9591684712227340.000011com.tesla
9601684677231770.000009com.9to5mac
9611684562236790.000008org.wikibooks
9621684557615310.000024com.business2community
9631684455219340.000017tech.ces
9641684403427850.000011ca.huffingtonpost
9651684378825780.000012com.sophos
9661684353210630.000033com.fixr
9671684350229720.000010edu.dartmouth
9681684338037330.000008org.laptop
9691684230422760.000014com.denverpost
9701684220427730.000011com.business
9711684217414410.000026jp.geocities
9721684203411410.000033com.microbiologybytes
9731684163621760.000014ca.sfu
9741684129018530.000017com.deadline
9751684037219990.000016com.jamanetwork
9761684013420150.000016jp.ne.biglobe
9771683988215140.000024org.fao
978168398767330.000041com.sprinklr
9791683893215200.000024jp.ne.goo
9801683869642880.000007com.panoramio
9811683854430780.000010com.freelancer
9821683851643860.000007com.ladygaga
9831683791832350.000009be.blogspot
984168370207500.000041com.aweber
9851683686474390.000004com.xanga
9861683653429520.000010se.blogspot
9871683626613240.000030com.linksynergy
9881683603417590.000019com.ew
9891683492435200.000008com.diigo
9901683365424180.000013com.asahi
991168336429700.000034de.bund
9921683275823060.000014com.teenvogue
9931683174435620.000008edu.fsu
994168314849530.000034net.viainfo
9951683112410660.000033com.superiorthreads
9961683049811060.000033lt.yn
9971682987213350.000029it.amazon
998168294348210.000037ca.amazon
9991682929218480.000017com.bigthink
1000168286403350.000077org.debian

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

April 2018 Crawl Archive Now Available

The crawl archive for April 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-17/. It contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th.

The April crawl contains 625 million new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 40 million hosts or top 40 million domains of the webgraph dataset
  • a random sample taken from WAT files of the March crawl

We took actions to reduce the amount of images unintentionally crawled: Although our crawler is focused to fetch HTML pages, there has always been a small amount (1-2%) of other document formats. We accept these – it’s a part of the web and these WARC records are useful to gain insights, e.g. to test PDF or Office document parsers at scale. However, because image links contained in sitemaps haven’t properly filtered out, the amount of images has grown during the last time and reached 2% in March 2018. As a result of filtering image links from sitemaps, the amount of images now has dropped to approx. 0.5%, cf. the MIME type statistics of the latest three monthly crawls.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-17/segment.paths.gz100
WARC filesCC-MAIN-2018-17/warc.paths.gz6432054.24
WAT filesCC-MAIN-2018-17/wat.paths.gz6432019.22
WET filesCC-MAIN-2018-17/wet.paths.gz643208.4
Robots.txt filesCC-MAIN-2018-17/robotstxt.paths.gz643200.2
Non-200 responses filesCC-MAIN-2018-17/non200responses.paths.gz643201.58
URL index filesCC-MAIN-2018-17/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2018-17/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March 2018 Crawl Archive Now Available

The crawl archive for March 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-13/. It contains 3.2 billion web pages and 250+ TiB of uncompressed content, crawled between March 17th and 25th.

The March crawl contains 800 million new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps if provided by any of the highest-ranking 100 million hosts taken from the Nov/Dec/Jan 2017/2018 webgraph data set
  • a breadth-first side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 60 million hosts or top 30 million domains of the webgraph dataset
  • a random sample taken from WAT files of the February crawl
  • and the continued and increased donation of URLs from mixnode.com

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-13/segment.paths.gz100
WARC filesCC-MAIN-2018-13/warc.paths.gz8000067.66
WAT filesCC-MAIN-2018-13/wat.paths.gz8000020.38
WET filesCC-MAIN-2018-13/wet.paths.gz800008.8
Robots.txt filesCC-MAIN-2018-13/robotstxt.paths.gz800000.21
Non-200 responses filesCC-MAIN-2018-13/non200responses.paths.gz800001.83
URL index filesCC-MAIN-2018-13/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2018-13/. Also the columnar index has been updated to contain this crawl.

We are grateful to our friends at mixnode for donating a seed list of 320 Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2018 Crawl Archive Now Available

The crawl archive for February 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-09/. It contains 3.4 billion web pages and 270+ TiB of uncompressed content, crawled between February 17th and Feb 26th.

The February crawl contains more than one billion new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps if provided by any of the highest-ranking 100 million hosts taken from the January 2018 webgraph data set
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts or top 25 million domains of the webgraph dataset
  • a random sample taken from WAT files of the January crawl
  • and the continued and increased donation of URLs from mixnode.com

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-09/segment.paths.gz100
WARC filesCC-MAIN-2018-09/warc.paths.gz8000071.96
WAT filesCC-MAIN-2018-09/wat.paths.gz8000022.03
WET filesCC-MAIN-2018-09/wet.paths.gz800009.62
Robots.txt filesCC-MAIN-2018-09/robotstxt.paths.gz800000.22
Non-200 responses filesCC-MAIN-2018-09/non200responses.paths.gz800002.1
URL index filesCC-MAIN-2018-09/cc-index.paths.gz3020.25

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2018-09/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

We are grateful to our friends at mixnode for donating a seed list of 300 Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Index to WARC Files and URLs in Columnar Format

We’re happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast. So far, we’ve tested two of them: Apache Spark and AWS Athena. The latter makes it possible to run SQL queries on the columnar data even without launching a server. Below you’ll find examples how to query the data with Athena. Examples and instructions for SparkSQL are in preparation. But you are free to use any other tool: the columnar index is free to access or download for anybody. You’ll find all files on:
s3://commoncrawl/cc-index/table/cc-main/warc/

Running SQL Queries with Athena

AWS Athena is a serverless service to analyze data on S3 using SQL. With Presto under the hood you even get a long list of extra functions including lambda expressions. Usage of Athena is not free but it has an attractive price model, you pay only for the scanned data (currently $5.0 per TiB). The index table of a single monthly crawl has about 300 GB. That defines the upper bound, but most queries require only part of the data to be scanned.

Let’s start and register the Common Crawl index as database table in Athena:

1. open the Athena query editor. Make sure you’re in the us-east-1 region where all the Common Crawl data is located. You need an AWS account to access Athena, please follow the AWS Athena user guide how to register and set up Athena.

2. to create a database (here called “ccindex”) enter the command

CREATE DATABASE ccindex


and press “Run query”

3. make sure that the database “ccindex” is selected and proceed with “New Query”

4. create the table by executing the following SQL statement:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';


It will create a table “ccindex” with a schema that fits the data on S3. The two “PARTITIONED BY” columns are actually subdirectories, one for every monthly crawl and the WARC subset. Partitions allow us to update the table every month and also help to limit the costs to query the data.

5. to make Athena recognize the data partitions on S3, you have to execute the SQL statement:

MSCK REPAIR TABLE ccindex


Note that this command is also necessary to make newer crawls appear in the table. Every month we’ll add a new partition (a “directory”, e.g., crawl=CC-MAIN-2018-09/). The new partition is not visible and searchable unless it has been discovered by the repair table command. If you run the command you’ll see which partitions have been newly discovered, e.g.:

Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=warc

Now you’re ready to run the first query. We’ll count the number of pages per domain within a single top-level domain. As before press “Run query” after you’ve entered the query into the query editor frame:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC


The result appears seconds later and only 2.12 MB of data have been scanned! Pretty fine, the query has cost less than one cent. We’ve filtered the data by a partition (a monthly crawl) and selected a small (.no) top-level domain. It’s a good practice to start developing more complex queries with such filters applied to keep the costs for trials low.

But let’s continue with a second example which demonstrates the power of Presto functions – we try to find domains which provide multi-lingual content. On possible way is to look for ISO-639-1 language codes in the URL, e.g., en in https://example.com/about/en/page.html. You can find the full SQL expression on github. For demonstration purposes we restrict the search to a single and small TLD (.va for Vatican State). The magic is done by

UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)


which first extracts all two-letter path elements (e.g., /en/) and unrolls the elements into a new column “url_path_lang” (if two or more path elements are found, you get multiple rows). Now we count pages and unique languages and let Presto/Athena also create a histogram of language codes:

You can find more SQL examples and resources on the cc-index-table project page on github. We’ll also working to provide examples to process the table using SparkSQL. First experiments are also promising: you get results within minutes even on a small Spark cluster. That’s not seconds as for Athena but you’re more flexible, esp. regarding the output format – Athena supports only CSV. Please also check the Athena release notes and the current list of limitations to find out which Presto version is used and which functions are supported.

We hope the new data format will help you to get value from the Common Crawl archives, in addition to the existing services.

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017). Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the preceding announcements.

Please note that the first released version (released 2018-02-08, withdrawn 2018-02-21) contained only links from the January 2018 crawl, see the notice on the Common Crawl user group. On 2018-02-28 a fix has been provided with graphs or rankings containing all links, hosts and/or domains over all 3 crawls. We also provide the erroneously released graphs and rankings from the January 2018 crawl.

What’s new?

Here is a summary of notable aspects and changes of this web graph release:

  • a bug has been fixed which caused that relative links pointing to a different host (//www.example.com/index.html) are not added as edges of the host/domain-level webgraphs
  • the domain graph now contains the number of hosts per domain as additional column in the vertices and rankings files
  • the naming scheme has changed – the release name is now part of the file name
  • webgraph offset files are not released any more, they can be created by running

    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-host
    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-domain

Host-level graph

The graph consists of 2.75 billion nodes and 8.6 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 2.67 billion dangling nodes (97%) and the largest strongly connected component contains only 65 million (2.3%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 2.75 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Nov/Dec/Jan 2017-18 host-level webgraph

SizeFileDescription
15.9 GBcc-main-2017-18-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
40.0 GBcc-main-2017-18-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 28 edges files
16.4 GBcc-main-2017-18-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-host.properties
24.2 GBcc-main-2017-18-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2017-18-nov-dec-jan-host-t.properties
1 kBcc-main-2017-18-nov-dec-jan-host.statsWebGraph statistics
38.1 GBcc-main-2017-18-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted; “private” domains are not accepted (cf. section “divisions” in the documentation on publicsuffix.org). For example, foo.blogspot.com and commoncrawl.s3.amazonaws.com are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com and stored in the reversed form com.blogspot.

The domain-level graph has 94 million nodes and 1.44 billion edges. 59% or 56 million nodes are dangling nodes, the largest strongly connected component covers 33 million or 35% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2017-18 domain-level webgraph

SizeFileDescription
0.67 GBcc-main-2017-18-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.7 GBcc-main-2017-18-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.1 GBcc-main-2017-18-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-domain.properties
3.3 GBcc-main-2017-18-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2017-18-nov-dec-jan-domain-t.properties
1 kBcc-main-2017-18-nov-dec-jan-domain.statsWebGraph statistics
2.0 GBcc-main-2017-18-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 94 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2017-2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12607321020.013220com.facebook
22550183210.016444com.googleapis
32371825630.009278com.google
42337153440.008406com.twitter
52283219250.007823com.youtube
62165337660.006112org.w
72032463670.004710org.gmpg
82004592880.003501com.instagram
919837996100.002871com.linkedin
1019439618120.002753org.wordpress
1119334234140.002070com.wordpress
1219214522170.001665com.pinterest
1319145770270.001242org.wikipedia
1419121822230.001462com.gravatar
1518842296330.000966com.blogspot
1618810990110.002837com.bootstrapcdn
1718718320190.001594com.apple
1818626224260.001255com.vimeo
1918434062150.001863com.adobe
2018419880440.000691be.youtu
2118397832340.000964com.amazon
2218350614130.002084com.macromedia
2318323552290.001015com.microsoft
2418321908410.000757gl.goo
2518302296310.001009com.flickr
2618270630460.000657com.tumblr
2718183288590.000540com.yahoo
2818136014200.001531net.doubleclick
2918074436700.000464ly.bit
3018072284320.000988com.amazonaws
3118039506180.001618com.googletagmanager
3217994916350.000913com.paypal
3317957448780.000417eu.europa
3417950818250.001280com.cloudflare
3517880136870.000397com.weebly
3617863816300.001012com.github
3717859140810.000412org.mozilla
3817838500400.000769net.cloudfront
3917830430950.000348co.t
4017794416800.000414org.creativecommons
41177732261020.000289com.googleusercontent
4217757566570.000562org.w3
4317751372390.000782io.github
4417703562970.000340com.soundcloud
45176746261180.000226com.blogger
46176734861380.000182net.slideshare
47176663841080.000265com.reddit
4817650506510.000617com.bing
49176226781470.000171com.myspace
5017614686650.000474com.medium
51176003021170.000233org.archive
52175976521360.000187com.imgur
5317581558660.000474com.list-manage
5417545184370.000804org.apache
55174990741550.000154com.imdb
56174933162400.000097com.about
5717491778280.001104com.gstatic
58174715561690.000144com.wsj
59174641361260.000218com.jimdo
60174622402340.000101com.livejournal
6117450286470.000649com.wp
62174478361290.000206com.issuu
63174452441300.000204com.android
64174435181220.000222com.yelp
6517419300430.000721com.statcounter
6617406774500.000626me.wp
67173928921790.000138com.oracle
68173725701620.000148com.digg
69173686322310.000102me.about
70173673182990.000078com.scribd
71173619462550.000091org.python
72173596881270.000210uk.co.google
7317357006610.000525com.cnn
74173420141240.000220com.nytimes
75173399683190.000073com.quora
76173296202490.000092com.ted
77173214501530.000161com.spotify
78173019981480.000168com.wixsite
79173005002330.000101com.dailymotion
80172979082080.000118com.staticflickr
81172889543900.000062org.chromium
82172764041060.000273com.ytimg
83172698902590.000089com.webs
84172653061450.000177org.ietf
85172553422220.000109com.mozilla
86172436661870.000133net.behance
87172430481910.000130com.disqus
88172424762730.000085com.mysql
89172400421580.000152com.stumbleupon
90172364102680.000085com.foursquare
91172312723140.000075gov.loc
92172130101510.000164org.gnu
93172101181460.000171com.tripadvisor
94172033743610.000066org.nodejs
95172018823780.000064com.storify
96171787901560.000153com.forbes
9717177956600.000527com.huffingtonpost
98171684641330.000196com.dropbox
99171640121990.000125com.typepad
100171565222410.000097com.example
101171501881660.000146uk.co.bbc
102171485284790.000051edu.virginia
10317142618890.000384com.paypalobjects
10417140226480.000645net.fbcdn
105171305684030.000060com.pixabay
106171263163830.000063ca.blogspot
107171184922000.000124org.wikimedia
108171163582970.000079com.githubusercontent
109171156763630.000066com.sun
11017111592360.000863com.squarespace
111171065722920.000079com.goodreads
11217105500560.000574com.fb
113171037684590.000053kr.flic
114170942264310.000057org.ampproject
115170863965300.000048edu.gatech
116170863561800.000137com.theguardian
11717085768960.000344com.wix
118170830325180.000049it.scoop
119170813824270.000057org.sciencemag
120170721061390.000182net.sourceforge
121170622585150.000049com.nike
122170567084000.000060org.eclipse
123170547704320.000056co.g
124170524382690.000085com.tinyurl
12517052256620.000509net.akamaihd
126170479044370.000055org.kernel
12717045616680.000467com.mashable
128170454604890.000051au.com.blogspot
12917042294640.000480org.schema
130170410626200.000043com.discogs
131170382341410.000181com.youtube-nocookie
132170372623700.000065com.npmjs
133170346182980.000079com.symantec
134170235881960.000126com.live
135170186123280.000072com.googlecode
136170166203960.000061com.git-scm
137170121303940.000061com.500px
138170113101980.000126edu.stanford
139170107824160.000058com.unity3d
140170105326860.000042com.wikidot
141169924943340.000071com.alexa
142169836104470.000054com.sap
143169781462500.000092com.businessinsider
144169767262720.000085com.cnet
145169763663720.000064com.getpocket
146169716982520.000092com.go
147169644562320.000101com.washingtonpost
148169570345670.000046com.chrome
1491695597690.003080com.godaddy
150169541521400.000182com.sharethis
151169535422110.000115com.ebay
152169495565060.000050edu.berkeley
153169488123770.000064au.gov.nsw
154169425942890.000080com.msn
155169383343330.000072com.time
156169377522130.000114com.nbcnews
15716934364750.000429edu.utexas
158169300385320.000048com.jetbrains
159169277123170.000074edu.harvard
160169247705450.000047ms.1drv
161169178501890.000130com.etsy
162169148381760.000140gov.nih
163169117526640.000043com.klout
164169050583270.000072edu.mit
165169039283160.000074com.reuters
166168989462350.000098com.mapquest
167168988763180.000074com.wired
168168933645700.000046com.crunchbase
169168932704010.000060gov.nasa
170168901307220.000040com.4shared
171168857702810.000082io.codepen
172168828822950.000079com.photobucket
173168759322570.000090com.udacity
174168656923090.000076com.aol
175168581684080.000059com.cnbc
176168538162930.000079com.tripod
177168486765170.000049org.aarp
178168477205630.000046edu.utah
179168469283420.000070org.npr
180168441287460.000039com.diigo
181168420743030.000077com.meetup
182168409241200.000223com.mailchimp
183168400963670.000065com.gmail
18416835606240.001310ru.yandex
185168346124250.000057com.appspot
186168335562870.000080com.ibm
187168270303380.000071gov.ca
188168262022420.000095com.surveymonkey
189168255322760.000083com.usatoday
190168249887780.000038com.googledrive
191168228467490.000039com.naturalnews
192168199907640.000038io.soup
193168158803400.000070uk.co.telegraph
194168142361630.000148com.eventbrite
195168138842060.000119com.opera
196168133066760.000043com.zappos
19716811868880.000394com.jquery
198168117966920.000042com.wholefoodsmarket
199168095085350.000048com.createspace
200168090723220.000073com.images-amazon
201168075923040.000077com.bloomberg
202167965021930.000128com.twimg
203167933644140.000058com.kickstarter
204167927301030.000285com.addthis
205167918442510.000092com.techcrunch
206167910748040.000037edu.washington
207167908526890.000042com.abebooks
208167906942940.000079com.googlesyndication
209167902465110.000049edu.cornell
210167852605290.000048com.buzzfeed
211167831304120.000059org.un
212167811322630.000087com.stackoverflow
213167809581490.000166com.feedburner
214167792506080.000044com.theverge
215167751307960.000037com.pearltrees
21616774700670.000473com.vk
217167745863750.000064com.latimes
218167655796990.000042com.sublimetext
219167606964980.000050org.rubyonrails
220167559111700.000142com.zendesk
221167548008800.000035com.fotolog
22216754091690.000466me.fb
223167513005770.000045com.audible
224167506155490.000047org.pbs
225167493145360.000048com.deviantart
226167477654100.000059com.wiley
227167466603070.000077org.acm
228167453268620.000036tl.page
229167445722120.000114com.ssl-images-amazon
230167438908240.000037com.instapaper
231167426627410.000039com.kinja
232167420081100.000253com.shopify
233167408117670.000038com.newyorker
234167403695030.000050com.yellowpages
235167361722030.000122org.drupal
236167349787580.000039com.xda-developers
237167323119210.000035com.adsoftheworld
238167318952210.000110org.mediawiki
239167311372790.000083fr.free
240167300808050.000037co.ello
241167295154440.000054com.theatlantic
242167252514090.000059uk.co.dailymail
2431672136511890.000031edu.columbia
244167202953880.000062com.bbc
24516720112450.000661com.yimg
246167189054510.000054com.wikihow
247167186972360.000098net.php
248167147875890.000044com.citysearch
249167010818110.000037com.jigsy
250166995516840.000043com.vice
251166934169920.000034ly.ow
252166920565340.000048com.exacttarget
253166855272610.000089com.salesforce
254166828195390.000047com.cbsnews
255166780185020.000050com.zdnet
256166765263970.000061gov.whitehouse
257166754195820.000045com.ft
258166696941050.000280de.google
2591666723911900.000031edu.yale
2601666147812130.000031edu.ucla
261166577076060.000044uk.co.guardian
262166553246850.000043com.googleblog
263166540767340.000040com.nationalgeographic
26416651951920.000369com.qq
2651664966611660.000032edu.psu
266166493943990.000060uk.co.blogspot
267166487307660.000038com.foxnews
268166483216440.000043org.virtualbox
269166478505230.000048org.maven
27016647058770.000418com.people
271166467312160.000113uk.co.amazon
272166459642580.000089com.hp
273166427385500.000047com.cisco
274166400847770.000038com.economist
275166392833210.000073gov.cdc
276166347105900.000044com.bandsintown
2771663436813260.000027com.indiegogo
2781663075411870.000031com.gizmodo
279166300792180.000112com.windowsphone
280166286345840.000045org.hbr
281166281969190.000035com.authorstream
282166276724390.000055edu.cmu
283166243968510.000036com.timeanddate
2841662146811860.000031com.evernote
285166204765780.000045com.dropboxusercontent
2861661939411600.000033com.sciencedaily
287166167936870.000042com.wikia
288166152862240.000108com.bandcamp
289166132933950.000061org.whatbrowser
290166121432560.000090io.atom
2911661200912590.000029in.blogspot
292166101747140.000040com.dpreview
293166100982800.000083com.smugmug
294166094561710.000142com.weibo
295166057545280.000048com.theknot
296166041157510.000039com.merchantcircle
297165996148710.000035us.imageshack
298165985678820.000035com.slate
299165984911970.000126com.blogblog
300165965757150.000040org.imagemagick
3011659419711970.000031org.arxiv
302165916804760.000051com.squareup
303165916613690.000065com.skype
3041658834214280.000023edu.ucsd
3051658652112970.000028com.ning
306165829595750.000046com.tinypic
307165827044930.000050com.giphy
308165824236960.000042com.box
309165820583110.000076com.nypost
3101657662614540.000023com.posterous
311165761586880.000042com.bookdepository
312165760738850.000035com.brandyourself
3131657517512490.000029edu.upenn
3141657330911550.000033org.eff
315165726214780.000051org.postgresql
316165718146770.000043de.blogspot
317165682134070.000059com.angieslist
318165649537870.000038com.samsung
319165633398430.000036com.comixology
3201656166314080.000024edu.wisc
3211656098411610.000032gov.census
322165599417470.000039com.shutterstock
3231655946313230.000027uk.ac.cam
3241655892711710.000032gov.nist
325165588585430.000047com.geocities
326165588411680.000144com.xing
327165584554220.000057com.oreilly
3281655802714590.000023edu.purdue
329165566587160.000040com.nature
3301655618013970.000024com.hotmail
331165550288020.000037com.uk
332165543519960.000034com.livestream
333165532029200.000035com.arstechnica
334165520063370.000071com.prnewswire
335165488582840.000081ca.google
336165467277050.000041org.vim
337165458662200.000111com.getclicky
338165435484150.000058int.who
3391654142514360.000023edu.princeton
340165382685690.000046com.entrepreneur
341165382223820.000063com.sxsw
3421653811014990.000022com.angelfire
3431653792312290.000030edu.umich
344165378894260.000057com.springer
345165336007790.000038com.bravesites
3461653309610380.000033org.unesco
3471653159413510.000026uk.ac.ox
348165314564840.000051com.office
3491652905512600.000029org.iso
3501652876613300.000027com.pcworld
351165277788600.000036com.unsplash
352165273757550.000039com.blackberry
353165266802100.000117de.amazon
354165257907810.000038gov.state
355165235814490.000054com.fortune
356165222177030.000041org.aclweb
357165221787860.000038net.vnexpress
358165220493540.000068com.booking
359165217278790.000035com.dynamics
3601652110610200.000034com.weather
3611652024510030.000034com.communitywalk
362165196727630.000039com.vagrantup
363165161371590.000152com.constantcontact
364165148877080.000041jobs.amazon
3651651472110390.000033com.indiatimes
366165126257750.000038com.cbslocal
3671651227612000.000031com.lifehacker
3681651197214320.000023com.vox
369165097932700.000085it.placehold
370165087115650.000046com.newsweek
3711650820316090.000020net.comcast
372165055012090.000118org.joomla
373165054424480.000054com.force
3741650514812990.000028com.politico
3751650270113100.000028org.altervista
376165006715880.000044com.venturebeat
377164987652780.000083gov.ftc
378164971987560.000039com.java
3791649700012640.000029co.vine
3801649336410680.000033com.ubuntu
3811649330314630.000023com.thinkwithgoogle
382164884084460.000054com.businesswire
383164883482530.000091to.amzn
3841648817513430.000026fm.last
385164870618680.000035hu.elte
3861648601212030.000031com.gofundme
3871648569811680.000032ca.cbc
3881648494010710.000033gov.senate
3891648270715900.000020edu.uchicago
390164825506790.000043com.googlesource
391164812017130.000040org.sqlite
3921647357313350.000026com.airbnb
393164710456800.000043gov.noaa
394164704567190.000040com.manta
395164702971420.000180org.bbb
3961646673312370.000029com.searchengineland
3971646569021030.000014com.twitpic
3981646526514060.000024edu.umn
399164648208840.000035com.googlelabs
4001646429411690.000032com.engadget
4011646408913990.000024uk.co.theregister
402164632665190.000049com.inc
40316463082790.000414com.bleacherreport
404164613532640.000086es.google
4051646116813240.000027com.dell
4061645980416500.000019com.blogs
4071645934412360.000030com.stackexchange
4081645881016760.000019edu.usc
4091645835714820.000022com.mtv
410164562995270.000048org.sonatype
4111645611517200.000018mp.j
4121645608613160.000027com.variety
413164555537400.000039org.gnupg
4141645444125100.000011edu.unl
4151645336113320.000027org.ieee
4161645222215540.000021edu.northwestern
4171645119711840.000031com.americanexpress
418164501224560.000053com.snapchat
419164500652190.000111fr.google
4201644830513070.000028com.discovery
4211644792612570.000029com.businessweek
4221644771112190.000030com.netflix
4231644587015990.000020edu.jhu
424164458597690.000038com.jsbin
425164453701280.000209com.googleadservices
426164451007350.000040com.intel
427164448235660.000046com.delicious
4281644454111520.000033com.pinimg
429164432824740.000052com.nwsource
4301644237311560.000033tv.ustream
431164396011650.000147it.google
432164395137230.000040br.com.uol
433164394085210.000048com.herokuapp
434164380623120.000075com.bitly
435164321141840.000134com.eepurl
4361643132516200.000019com.examiner
437164312443580.000067com.bizjournals
438164303388550.000036com.souq
4391642956011740.000032au.net.abc
4401642920911920.000031fr.blogspot
4411642887418060.000016edu.rutgers
442164286508580.000036ca.pinterest
4431642838616300.000019com.udemy
4441642632416800.000018uk.co.thesun
4451642597514290.000023com.prezi
4461642287116930.000018com.speakerdeck
4471642146712900.000028com.mlb
448164214657820.000038com.mysanantonio
4491642119412110.000031com.chicagotribune
450164206057200.000040com.shopbop
4511641847615750.000020it.blogspot
452164183482900.000080com.hubspot
4531641616318990.000015edu.msu
454164159952820.000082com.fc2
455164157116970.000042com.moz
456164142487840.000038com.boxofficemojo
457164137827270.000040io.getmdl
45816410305760.000421me.m
4591640906112740.000028gov.fbi
4601640643219660.000015ch.ethz
461164062442620.000088com.dribbble
462164044201940.000126jp.co.yahoo
4631640279414910.000022com.trello
4641640156110260.000034com.slack
4651640127513250.000027net.researchgate
466164006893320.000072edu.nyu
467163987701370.000185com.google-analytics
468163982915550.000047com.wunderground
469163981054290.000057com.naver
4701639796018240.000016com.tutsplus
4711639658721710.000013com.googlepages
4721639612815940.000020edu.academia
473163951044130.000059com.bigcartel
474163943488100.000037it.binged
4751639398313800.000025org.khanacademy
4761639328511940.000031com.reverbnation
4771639303515870.000020com.mac
4781639278114720.000022com.target
4791639248520850.000014edu.asu
480163917962750.000084com.wufoo
4811639121920360.000014edu.arizona
482163904597000.000041uk.co.independent
4831638960215190.000022com.pexels
4841638950914120.000024com.over-blog
485163882604660.000052com.adweek
486163873622600.000089com.myshopify
4871638734413950.000024com.bostonglobe
4881638714515720.000020com.zazzle
4891638713413610.000025com.libsyn
490163860104180.000058com.fastcompany
491163854975800.000045gov.ed
492163851611190.000223com.baidu
493163851076120.000044cn.com.sina
494163844555910.000044gov.fda
495163844157280.000040es.com.blogspot
4961638409910400.000033gov.nps
4971638358616460.000019com.vanityfair
498163829788870.000035ws.snack
4991638229710460.000033com.marketwatch
5001638180018880.000016com.yolasite
5011638132415580.000021com.nba
502163799621090.000261org.networkadvertising
5031637908211530.000033gov.house
504163768938980.000035com.sfgate
5051637350522890.000012edu.caltech
506163727834650.000053com.w3schools
507163725661230.000221jp.co.google
5081637204120350.000014com.instructables
5091636946818210.000016com.msnbc
5101636853914700.000022com.scientificamerican
5111636850315430.000021com.ehow
5121636662519840.000015uk.ac.ucl
513163660406900.000042org.bitbucket
5141636572521620.000013ca.ualberta
515163648064640.000053net.openid
516163646177680.000038org.gradle
5171636420317720.000017org.aclu
5181636354014740.000022com.elpais
519163634847240.000040com.yarnpkg
5201636307727420.000010com.hubpages
521163626756330.000043com.cargocollective
5221636196416400.000019com.mercurynews
5231635918711770.000032com.steampowered
5241635897418450.000016edu.ufl
5251635883112350.000030org.change
526163587195830.000045gov.usda
527163580838280.000036com.warriorplus
5281635730212440.000029com.thenextweb
5291635727513860.000024de.spiegel
5301635706411580.000033com.proofpoint
531163569658090.000037com.whitepages
5321635335211880.000031gov.fcc
5331635281517950.000017com.nfl
5341635206813450.000026com.globo
5351635196730650.000009com.answers
536163515797060.000041org.jenkins-ci
5371635105715570.000021com.billboard
538163504577760.000038ly.snip
5391634937611720.000032com.ggpht
5401634932417800.000017org.ap
5411634917219010.000015edu.indiana
5421634906614450.000023com.nokia
5431634888819430.000015com.ign
5441634818519260.000015com.ikea
5451634804717060.000018edu.umd
54616348024520.000596com.messenger
547163478147570.000039com.msdn
5481634586018010.000017org.weforum
549163457875260.000048org.doi
550163452332020.000122jp.ameblo
551163445778910.000035com.woot
5521634444712830.000028com.patreon
5531634395715380.000021br.com.blogspot
554163434141320.000199ru.mail
5551634327621040.000014com.oxforddictionaries
556163425907440.000039com.photoshelter
5571634191913220.000027gov.uspto
5581634130916520.000019fr.lemonde
5591634093915910.000020com.rollingstone
5601634063017630.000017uk.co.metro
561163400436020.000044com.sciencedirect
5621633973037790.000007mx.unam
563163394369440.000035com.hotfrog
5641633859221270.000014com.fiverr
565163377951730.000141jp.ne.hatena
5661633744318400.000016com.aliexpress
5671633637330720.000009com.123rf
568163360763860.000063au.com.google
5691633480112380.000029com.prweb
5701633415218350.000016br.com.abril
5711633287514870.000022com.pcmag
572163321928730.000035ly.plot
5731633170932500.000008com.blog
574163315493910.000061us.icio
575163311998370.000036com.folkd
5761633117623160.000012org.kiva
5771633075223960.000012edu.brown
5781633062414780.000022com.qz
5791633049011800.000032com.psychologytoday
5801632968920880.000014com.newscientist
5811632911415770.000020com.playstation
5821632642514010.000024edu.si
583163249798460.000036io.material
5841632400010720.000033gov.usa
5851632289616080.000020com.hulu
5861632167213410.000026com.cafepress
5871632119519860.000015ca.utoronto
5881632100315970.000020com.econsultancy
589163209398150.000037gov.copyright
590163200384400.000055gov.irs
5911631869030080.000009cc.co
5921631868118340.000016com.canva
5931631743227920.000010pt.sapo
5941631529717050.000018com.colourlovers
595163148819940.000034com.hotukdeals
596163142718300.000036com.getskeleton
5971631251514950.000022com.nymag
598163121874060.000059com.barnesandnoble
5991631155212580.000029org.worldbank
6001631024520060.000014com.bestbuy
6011631010817830.000017com.nhl
6021630887920130.000014edu.uci
6031630883113980.000024com.boston
604163088148780.000035com.insiderpages
6051630753928560.000010edu.tufts
606163072173650.000066nl.google
607163064238260.000037gov.hhs
6081630606622290.000013edu.osu
6091630602417160.000018edu.duke
6101630499612260.000030com.hootsuite
611163047032470.000093jp.co.amazon
6121630221715340.000021gov.nyc
6131630158718550.000016com.fifa
6141630125916420.000019com.withgoogle
615163007274240.000057com.clicky
616162982295240.000048com.whatsapp
617162978627040.000041com.redbubble
6181629767629340.000009com.friendfeed
6191629763719540.000015com.gawker
6201629698113330.000027org.oecd
6211629656820820.000014nl.xs4all
6221629638318570.000016com.pastebin
623162954279380.000035com.tiki-toki
6241629483628090.000010edu.uic
6251629475012810.000028com.istockphoto
6261629435616050.000020com.hyatt
6271629432220590.000014edu.tamu
6281629301721640.000013edu.ncsu
6291629238913850.000024com.com
630162921039460.000034jp.ac.kobe-u
631162917419060.000035com.quantcast
6321629163519310.000015nl.blogspot
633162916188030.000037com.webmd
6341629135328230.000010com.wolfram
635162913367290.000040ca.amazon
636162909414550.000053net.launchpad
6371629009521700.000013com.wikispaces
6381628960013040.000028com.walmart
6391628897830810.000009edu.colostate
640162880945200.000048in.co.google
6411628641712070.000031com.redhat
6421628640915740.000020com.merriam-webster
6431628614217300.000018int.wipo
6441628474411960.000031com.adage
6451628415112240.000030com.ups
646162839888440.000036com.newsbank
6471628394130780.000009com.squidoo
6481628379113370.000026gov.dot
6491628370516770.000018com.me
6501628327114440.000023com.mediafire
6511628319021220.000014ca.ubc
6521628226126320.000011ca.uwaterloo
6531628188816000.000020edu.unc
6541628140520020.000015org.kde
6551628099921090.000014org.gimp
656162806114770.000051com.pingdom
6571627926225170.000011gd.is
6581627923327130.000010edu.hawaii
6591627820420760.000014com.aljazeera
6601627788015250.000021com.xbox
6611627659316110.000020com.freewebs
6621627606422530.000013com.britannica
6631627515916860.000018uk.co.mirror
6641627496222610.000013uk.co.timesonline
6651627410220650.000014au.com.news
6661627378815320.000021com.xkcd
6671627355411980.000031com.feedly
6681627304529310.000009com.laughingsquid
6691627229115070.000022gov.wa
6701627228618420.000016tv.periscope
6711627212214600.000023com.mixcloud
6721627025729190.000010com.codecademy
6731626987120030.000015edu.illinois
6741626939916790.000018uk.co.huffingtonpost
675162691073870.000062net.themeforest
6761626903116540.000019uk.co.ebay
677162688943790.000063com.ea
678162685369980.000034com.att
6791626842316660.000019net.daum
6801626796326130.000011ca.mcgill
681162650416590.000043com.houzz
6821626464815140.000022com.intuit
683162643355730.000046fr.amazon
6841626238420870.000014com.softpedia
6851626196818720.000016com.autodesk
686162618992070.000119org.icann
6871626185618120.000016com.deadline
6881626130627080.000010edu.vanderbilt
6891626120816430.000019com.foxbusiness
6901626054015980.000020gov.uscourts
691162590383800.000063com.heroku
6921625842914710.000022com.gumroad
6931625766722150.000013com.flipboard
6941625642515780.000020com.us
6951625621216340.000019de.welt
6961625576111930.000031com.deloitte
6971625473622760.000012com.yfrog
6981625468715960.000020org.owasp
6991625442427290.000010com.lynda
7001625413920460.000014org.coursera
701162534239420.000035com.cdbaby
7021625225913030.000028com.sagepub
7031625224315850.000020com.vmware
7041625222520450.000014net.earthlink
705162514177110.000041com.usnews
7061625131513830.000025org.unicef
7071625116537140.000007com.space
7081625074521210.000014com.vogue
709162497134230.000057com.cracked
7101624943618440.000016com.domain
711162492625050.000050net.yahoo
712162489212480.000092com.nielsen
7131624781810660.000033site.tenerifeforum
7141624774421290.000014com.theonion
715162474577320.000040com.atlassian
716162468817260.000040com.sharefile
717162453048210.000037org.osgeo
7181624475323290.000012com.searchenginejournal
7191624459115410.000021com.searchenginewatch
7201624386216690.000019com.windows
7211624380925030.000011org.greenpeace
7221624289410610.000033org.bravenewvoices
7231624269523140.000012edu.wustl
7241624247820150.000014uk.ac.lse
725162421489580.000034com.2findlocal
7261624195819290.000015edu.ucdavis
7271623899428080.000010edu.uoregon
728162386227720.000038org.openweathermap
7291623844514790.000022com.kissmetrics
7301623776920950.000014net.jsfiddle
7311623740516290.000019com.chron
7321623722519000.000015gov.usaid
7331623701114040.000024com.steamcommunity
734162361949410.000035com.ripple
7351623439817730.000017org.craigslist
7361623438017680.000017com.howstuffworks
737162338227880.000038com.hilton
7381623373513820.000025com.alibaba
7391623336125820.000011edu.uga
7401623297327250.000010edu.pitt
7411623284916630.000019com.yoast
7421623222628470.000010com.rottentomatoes
743162322032390.000097org.purl
7441623001314160.000024org.plos
7451622950618490.000016com.espn
7461622865720780.000014com.gamespot
7471622665741490.000007ca.yorku
7481622614524370.000012gov.cia
749162249683850.000063com.youku
7501622476916830.000018com.csmonitor
751162245118900.000035tv.twitch
7521622430035020.000008com.secondlife
7531622384314810.000022com.hollywoodreporter
7541622267016680.000019net.battle
7551622135718700.000016com.irishtimes
756162210829270.000035com.bizcommunity
7571622082023280.000012edu.vt
7581622067923750.000012com.technet
759162206698060.000037uk.co.currys
7601621920631570.000009com.avast
7611621746314850.000022org.fao
7621621725720550.000014com.twilio
763162157647170.000040com.netdna-cdn
7641621573028440.000010com.popsci
7651621529922050.000013com.podbean
7661621480512560.000029org.redcross
7671621394528130.000010org.kqed
7681621393714530.000023us.tx.state
769162138864200.000058br.com.google
7701621253017850.000017mil.navy
7711621178520100.000014com.netvibes
7721621171532550.000008edu.iastate
7731620934116310.000019com.animoto
7741620926829160.000010int.esa
7751620926122140.000013com.makezine
7761620803720240.000014edu.ucsf
7771620802935760.000008uk.ac.manchester
7781620662319040.000015com.foxsports
7791620602417920.000017com.blogtalkradio
7801620514513360.000026com.docker
7811620497516320.000019mil.army
7821620461823350.000012com.lonelyplanet
7831620434514400.000023jp.blogspot
7841620370635170.000008edu.wsu
7851620346217540.000017co.angel
786162029317020.000041com.technorati
7871620271316210.000019com.today
788162024732280.000104com.elegantthemes
7891620139515530.000021com.fedex
7901620133618270.000016com.macworld
7911620085516890.000018ru.spb
7921620044034910.000008org.eu
7931619979839400.000007edu.byu
7941619960419800.000015com.topsy
7951619951813080.000028gov.energy
7961619942519320.000015edu.umass
7971619801017820.000017org.cancer
798161976848270.000037com.themonitor
7991619720415390.000021gov.congress
800161971734530.000054com.zenfolio
801161951313260.000073com.newrelic
802161936379810.000034com.scribblemaps
8031619344513270.000027com.webnode
8041619335113090.000028com.zoho
8051619291614030.000024com.techrepublic
806161926004690.000052jp.ne.sakura
8071619194114750.000022com.html5rocks
808161915127520.000039gov.sec
809161910112460.000093me.line
810161903965600.000046gov.export
8111619032124210.000012com.redbull
8121619024512890.000028de.bund
8131619014812730.000028com.formstack
8141618940117270.000018org.pewresearch
8151618705525040.000011org.documentcloud
8161618615322060.000013com.denverpost
8171618510517910.000017com.freepik
8181618450111640.000032gov.justice
81916184479830.000405com.shareaholic
820161842118570.000036org.bouncycastle
821161841341810.000137info.aboutads
8221618307310060.000034com.weddingbee
82316182519220.001469com.wixstatic
8241618115318220.000016com.sky
8251618089835870.000008edu.syr
826161807044420.000055com.teamviewer
8271618050217410.000017edu.cuny
8281617943714920.000022de.heise
8291617939822900.000012com.refinery29
8301617910513730.000025com.gigaom
8311617905176530.000004nr.co
8321617894330660.000009com.seekingalpha
833161787955090.000049com.informit
8341617839421690.000013com.pbworks
8351617677928570.000010com.threadless
8361617652310090.000034com.spoke
8371617624416900.000018com.salon
838161753148640.000036com.tractorsupply
839161750364360.000055ru.vkontakte
8401617341970920.000004com.xanga
841161730618340.000036com.withoutabox
8421617216734310.000008edu.rochester
8431617200419280.000015google.blog
8441617199630390.000009cc.tiny
8451617186223380.000012com.sony
846161717454950.000050com.mapbox
8471617157922160.000013edu.uiuc
8481617143513690.000025com.justgiving
849161710189730.000034com.quandl
8501617090730440.000009edu.oregonstate
8511617077930880.000009edu.rice
852161702579890.000034com.citysquares
8531616930315220.000021com.accenture
8541616904517170.000018gov.weather
8551616826425780.000011ch.cern
8561616778723650.000012com.nbcsports
8571616766534580.000008tt.db
8581616745713110.000027gov.ny
8591616690437640.000007com.panoramio
860161664713980.000061com.list-manage1
8611616571634990.000008edu.fsu
8621616560216720.000019com.indeed
8631616482416700.000019org.gnome
8641616426723060.000012com.motherjones
8651616410932860.000008com.techsmith
8661616402320210.000014de.bild
867161637869870.000034com.zwire
868161637689360.000035org.gwtproject
8691616344115930.000020uk.co.thetimes
8701616174511830.000031com.hostgator
8711616168322470.000013com.shutterfly
8721616122475570.000004com.weheartit
8731616107810370.000033com.lacartes
8741615988521200.000014me.flavors
8751615949418690.000016com.digitaltrends
8761615887725180.000011com.lego
8771615886746850.000006com.skyrock
8781615824814550.000023com.ssrn
879161565917740.000038ru.google
8801615638016610.000019ru.narod
8811615531927110.000010au.edu.anu
8821615513529070.000010net.nocookie
8831615439516820.000018com.infoworld
8841615373617770.000017com.starbucks
8851615281710180.000034com.live5news
8861615193239840.000007to.gplus
8871615147040440.000007org.nypl
8881615145421060.000014com.trendmicro
8891615093516160.000019com.codeplex
8901615078616490.000019com.gettyimages
891161501625160.000049com.typeform
8921614938018140.000016com.amzn
8931614921217330.000018com.upwork
8941614895923740.000012com.hatenablog
8951614853712750.000028uk.co.eventbrite
8961614831129520.000009ly.cl
897161482679790.000034au.com.yelp
8981614766912210.000030com.linksynergy
8991614762324500.000012tv.blip
900161475939660.000034com.strawberryperl
9011614669225160.000011com.ezinearticles
9021614625057460.000005com.minus
9031614612312800.000028gov.archives
904161460029910.000034net.brownbook
9051614589520410.000014org.c-span
9061614583343990.000006com.treehugger
9071614569614960.000022se.google
9081614559014130.000024com.smashingmagazine
9091614520531200.000009com.askmen
9101614459923590.000012com.rt
9111614436213400.000026gov.sba
9121614336821910.000013com.madmimi
9131614328932010.000009com.voanews
9141614219310310.000034edu.alamo
9151614127812930.000028be.google
91616141249980.000323org.nginx
9171614025527900.000010com.asus
9181613977816990.000018com.techradar
9191613970220090.000014com.allthingsd
9201613907421500.000013com.mentalfloss
9211613895540090.000007net.minecraft
9221613770244170.000006com.pbase
9231613622316590.000019com.bloglovin
9241613601415230.000021com.forrester
925161359249290.000035com.sacurrent
9261613555611820.000032com.strikingly
9271613537717810.000017org.openoffice
9281613481710540.000033com.garmin
9291613475411570.000033org.postimg
9301613456524750.000011com.eonline
9311613418015950.000020com.lulu
9321613193618090.000016com.ibtimes
933161317789240.000035com.fabric
9341613171316550.000019com.zillow
935161316239900.000034com.shareasale
9361613149121610.000013com.history
9371613133215420.000021com.mcafee
9381613103154420.000005com.archdaily
939161307913240.000073com.cloudinary
9401613060437000.000007com.thingiverse
9411613041636330.000008com.starwars
9421613003931490.000009com.pitchfork
9431613000735280.000008com.gyazo
9441612970818610.000016ca.huffingtonpost
945161290393550.000068com.monster
9461612894740340.000007com.tistory
9471612878340790.000007edu.utk
9481612854938580.000007com.lmgtfy
9491612849610640.000033mp.mailchi
9501612786017240.000018com.ssllabs
9511612748112470.000029org.moodle
9521612630610170.000034org.simile-widgets
9531612614222310.000013com.invisionapp
9541612601521050.000014com.real
9551612528936400.000007edu.buffalo
9561612497333420.000008com.indiewire
957161249592830.000082org.debian
9581612481120300.000014com.ew
9591612481115310.000021com.uber
9601612474750510.000006edu.gsu
961161244578360.000036com.list-manage2
9621612438013640.000025net.java
9631612393311670.000032com.tandfonline
964161239114860.000051com.taobao
9651612360316600.000019com.bmj
9661612324034200.000008org.lifehack
9671612280823020.000012com.canalblog
9681612259721410.000013edu.ucsc
969161223689800.000034org.tpr
9701612235827810.000010nl.utwente
9711612160819410.000015com.getresponse
9721612157726310.000011com.dallasnews
9731612099822370.000013edu.colorado
9741611856016380.000019com.ecwid
9751611847612870.000028es.amazon
9761611847110220.000034com.ibegin
9771611813516370.000019com.deezer
9781611798913940.000024jp.ne.goo
9791611777219710.000015jp.ne.biglobe
9801611775621300.000014edu.bu
981161177092140.000114com.homestead
982161174779310.000035com.chamberofcommerce
9831611613058920.000005ie.tcd
9841611577240850.000007edu.uconn
9851611473135900.000008edu.usf
9861611470215260.000021com.warnerbros
9871611434847770.000006ca.ucalgary
9881611385720140.000014hk.com.google
989161137861780.000139com.parallels
9901611346718410.000016com.getfirebug
9911611321915300.000021com.waze
9921611314133720.000008ru.org
9931611294931830.000009com.polyvore
9941611262424730.000011com.campaignmonitor
9951611255516840.000018com.thehill
996161124079850.000034com.showmelocal
9971611235313210.000027gov.usgs
9981611193719080.000015jp.or.nhk
9991611175758510.000005com.rapidshare
10001611164730400.000009com.expedia

Graphs of January 2018 Crawl

Erroneously we released webgraphs and rankings of a single monthly crawl (January 2018) instead of a quarterly release covering 3 crawls. To ensure reproducibility we’ve preserved the erronuous release.

The host-level graph consists of 775 million nodes and 2.7 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 719 million dangling nodes (93%).

Download files of the Common Crawl Jan 2018 host-level webgraph

SizeFileDescription
4.84 GBcc-main-2018-jan-host-vertices.txt.gznodes ⟨id, rev host⟩
10.21 GBcc-main-2018-jan-host-edges.txt.gzedges ⟨from_id, to_id⟩
4.90 GBcc-main-2018-jan-host.graphgraph in BVGraph format
2 kBcc-main-2018-jan-host.properties
5.94 GBcc-main-2018-jan-host-t.graphtranspose of the graph (outlinks mapped to inlinks)
2 kBcc-main-2018-jan-host-t.properties
1 kBcc-main-2018-jan-host.statsWebGraph statistics
10.79 GBcc-main-2018-jan-host-ranks.txt.gzharmonic centrality and pagerank

The domain-level graph with 70 million nodes and 835 million edges has 60% or 42 million nodes are dangling nodes, the largest strongly connected component covers 22 million or 31% of the nodes.

Download files of the Common Crawl Jan 2018 domain-level webgraph

SizeFileDescription
0.49 GBcc-main-2018-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
3.30 GBcc-main-2018-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
1.80 GBcc-main-2018-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2018-jan-domain.properties
1.89 GBcc-main-2018-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2018-jan-domain-t.properties
1 kBcc-main-2018-jan-domain.statsWebGraph statistics
1.46 GBcc-main-2018-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!