We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The following improvements have been made for this webgraph release:

  • the graphs now also included edges stemming from HTTP 303 “See Other” redirects (in addition to other HTTP redirect status codes)
  • the Common Crawl robots.txt WARC files are used to get additional host-level redirects including hosts which exclude the entire content in their robots.txt
  • links from robots.txt files to sitemaps are now extracted directly from the robots.txt WARC files, see the Feb/Mar/Apr 2018 web graph announcement for more details about this type of host-level links

Host-level graph

The graph consists of 820 million nodes and 4.55 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 752 million dangling nodes (92%) and the largest strongly connected component contains 50 million (6%) nodes.

You can download the graph and the ranks of all 820 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/ as prefix to access the files from everywhere.

SizeFileDescription
5.29 GBcc-main-2019-aug-sep-oct-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
20.73 GBcc-main-2019-aug-sep-oct-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 56 edges files
8.15 GBcc-main-2019-aug-sep-oct-host.graphgraph in BVGraph format
2 kBcc-main-2019-aug-sep-oct-host.properties
10.00 GBcc-main-2019-aug-sep-oct-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-aug-sep-oct-host-t.properties
1 kBcc-main-2019-aug-sep-oct-host.statsWebGraph statistics
11.59 GBcc-main-2019-aug-sep-oct-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 92.7 million nodes and 2.4 billion edges. 52% or 48 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/.

Download files of the Common Crawl Aug/Sep/Oct 2019 domain-level webgraph

SizeFileDescription
0.64 GBcc-main-2019-aug-sep-oct-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
9.06 GBcc-main-2019-aug-sep-oct-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.64 GBcc-main-2019-aug-sep-oct-domain.graphgraph in BVGraph format
2 kBcc-main-2019-aug-sep-oct-domain.properties
4.82 GBcc-main-2019-aug-sep-oct-domain-t.graphtranspose of the graph
2 kBcc-main-2019-aug-sep-oct-domain-t.properties
1 kBcc-main-2019-aug-sep-oct-domain.statsWebGraph statistics
1.97 GBcc-main-2019-aug-sep-oct-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 92 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Aug/Sept/Oct 2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13219322210.020989com.googleapis
22993949030.012691com.facebook
32920276220.012925com.google
42682358440.007369com.twitter
52630480450.006660org.w
62602310660.006435com.youtube
72433093090.003914com.instagram
82395469870.004993org.gmpg
92353721680.004863com.googletagmanager
1023456362130.002913com.linkedin
1122601398120.003086org.wordpress
1222511672100.003602com.cloudflare
1322484028220.001698com.gravatar
1422366260230.001509com.pinterest
1522337722190.002143com.wordpress
1622168804140.002422com.bootstrapcdn
1722152656320.001134org.wikipedia
1821939876200.001777com.apple
1921666694420.000842com.blogspot
2021595338210.001736com.jquery
2121576018490.000713be.youtu
2221574068350.001064com.vimeo
2321525328300.001154com.microsoft
2421517514180.002164com.gstatic
2521444512170.002176com.adobe
2621427572390.000964com.amazonaws
2721426674500.000695com.wp
2821393434510.000681com.amazon
2921314258650.000516com.tumblr
3021291646460.000767gl.goo
3121256658250.001309com.macromedia
3221253410290.001173com.baidu
3321136800670.000501ly.bit
3421087412270.001200com.flickr
3521083642240.001391com.github
3621068510890.000381com.yahoo
3721063868410.000928com.google-analytics
3821063204310.001139com.googlesyndication
3921053852570.000608eu.europa
4021051522610.000541org.mozilla
41209978121060.000300com.reddit
4220934970370.001002net.cloudfront
4320930738280.001184ru.yandex
4420907836380.000964com.addthis
4520875048480.000734co.t
4620863874470.000744net.doubleclick
4720860222700.000482org.w3
4820822794980.000329com.googleusercontent
4920819594430.000814com.squarespace
5020815248750.000462com.medium
5120812946910.000376org.creativecommons
52208110761750.000140org.wikimedia
5320788398860.000417com.weebly
5420786242630.000534me.wp
55207641441290.000221com.nytimes
5620754576880.000400io.github
5720744406550.000625com.paypal
58207310501650.000148uk.co.bbc
5920729776580.000557net.jsdelivr
60207236081080.000297com.soundcloud
61207206141720.000141com.imgur
62206964741300.000210com.dropbox
63206762261370.000181com.forbes
64206410781730.000141net.slideshare
6520640886540.000634org.schema
66206370161530.000163com.theguardian
67206194101870.000136com.cnn
68206144822040.000118com.businessinsider
69205893762170.000109com.wsj
70205753842810.000086edu.harvard
71205728061670.000147com.bing
72205710442410.000098com.techcrunch
73205672922900.000084edu.mit
74205571442850.000084com.reuters
75205540743750.000067com.msn
76205498523290.000075com.cnet
77205421041400.000178org.archive
78205381042500.000094com.bloomberg
7920522694330.001120com.fontawesome
80205213741410.000175gov.nih
8120513408930.000355com.shopify
82205131402710.000089com.myspace
83205070482070.000116edu.stanford
8420496152530.000647com.wix
85204891022000.000120com.stackoverflow
86204879384340.000057com.googleblog
87204846321540.000163org.apache
88204781942290.000102com.oracle
89204753922140.000110com.washingtonpost
90204723862600.000091com.android
91204692982670.000090com.bbc
92204664401940.000123org.ietf
93204385423100.000079com.time
94204363522980.000081uk.co.telegraph
95204179763690.000067com.ted
96204156623720.000067gov.nasa
97204123163680.000067com.githubusercontent
98204098881850.000136com.npmjs
99204016043940.000063com.quora
100203961466010.000042com.thenextweb
101203956741610.000156com.giphy
102203887787260.000037com.wikia
103203807283430.000072uk.co.dailymail
104203796522940.000082com.usatoday
105203783963710.000067com.latimes
106203702127130.000037org.chromium
107203697003060.000079org.un
108203681481440.000174com.wixsite
109203666104930.000050com.economist
11020361312260.001226com.qq
111203434262680.000090com.appspot
112203392664800.000052com.pixabay
113203373984910.000050com.zdnet
114203283083150.000079com.example
115203254223580.000070com.livejournal
116203223343800.000066com.mashable
117203082003020.000080com.cnbc
118203080662530.000093org.ampproject
119203069844420.000056com.nationalgeographic
120202934265050.000049com.venturebeat
121202923804040.000062com.dailymotion
122202855021390.000178com.twimg
123202841644760.000052org.bitbucket
124202823685470.000046com.pexels
125202807143270.000075com.springer
126202799922180.000108com.huffingtonpost
12720279190940.000355com.whatsapp
128202779284590.000054com.cisco
129202684161460.000170com.blogger
130202676841230.000234com.ytimg
131202647304130.000061com.fortune
132202630146410.000040uk.ac.ox
133202622582310.000100com.getbootstrap
134202616488470.000035org.cambridge
135202612686290.000040org.weforum
136202508541970.000123com.typepad
137202506982790.000086com.sciencedirect
138202501625120.000048com.about
139202471922860.000084com.wired
140202401303170.000078com.skype
141202352025580.000045org.worldbank
142202301921340.000186com.issuu
143202250045040.000049com.mysql
144202209966500.000039org.sciencemag
145202209725310.000047org.arxiv
146202182966240.000041uk.co.guardian
147202161944070.000062com.nature
148202140121270.000226com.unpkg
149202136381430.000175com.spotify
150201955008240.000036com.playstation
151201953521770.000139uk.co.google
152201952604400.000057gov.noaa
153201935743230.000077com.staticflickr
154201935123660.000068com.gmail
1552019193410370.000028org.eclipse
156201918323950.000063net.researchgate
157201859343420.000072com.fc2
158201791946030.000042org.ieee
159201771401320.000201com.zendesk
160201771083830.000065com.theatlantic
161201738505900.000043com.git-scm
162201737221820.000136me.t
163201694462820.000085com.googlecode
164201679642120.000113net.behance
165201669603640.000068com.w3schools
166201654086570.000039com.stackexchange
167201475661280.000222com.youtube-nocookie
168201442664300.000058com.buzzfeed
169201431685730.000043br.com.uol
170201412228280.000036ca.blogspot
171201385285920.000042com.evernote
172201375368540.000034com.scientificamerican
173201230002270.000102com.dribbble
174201229664950.000049com.vice
175201198121800.000137com.feedburner
176201187865740.000043net.azurewebsites
177201133705360.000046com.alexa
178201107804180.000059com.outlook
179201033824240.000059com.gitlab
180200925884220.000059me.about
181200922324090.000061com.goodreads
1822009184211020.000026com.nvidia
183200824504190.000059com.mozilla
184200785244470.000056com.entrepreneur
185200737402360.000099com.ft
186200715344520.000055com.wikihow
187200661242450.000096com.disqus
1882006494210920.000026com.jetbrains
1892006375613270.000023org.phys
190200620666020.000042org.greenpeace
191200614743860.000065org.hbr
192200594681780.000139com.salesforce
193200585325370.000046com.adage
194200560123000.000080org.doi
1952005591411060.000026org.ap
196200540688600.000034com.500px
197200518244880.000051gov.loc
198200513429570.000030com.sap
199200505006260.000041com.marketwatch
2002004982412650.000024com.siemens
2012004958411730.000025ca.utoronto
202200493004280.000058uk.co.independent
203200480342220.000104com.hubspot
204200457885930.000042com.slate
205200420183490.000071gg.discord
2062002495614350.000021com.hackernoon
207200220964870.000051uk.co.blogspot
2082001213014510.000021org.tensorflow
209200076824010.000062com.indiatimes
2102000748610350.000028org.kernel
211200016985300.000047com.trello
212199990346660.000038com.searchengineland
2131999708410090.000029com.unity3d
214199969404730.000052com.computerworld
215199962325490.000045com.withgoogle
2161999307813690.000022edu.osu
217199918809490.000030edu.si
218199902366120.000041au.net.abc
2191998808814280.000021com.lego
220199875322870.000084com.nbcnews
2211997748213560.000022com.angelfire
222199760804990.000049com.moz
223199753581990.000122net.sourceforge
224199692366670.000038co.ibb
2251996811416180.000019org.edx
226199670725150.000048com.box
227199614589860.000029com.huffpost
228199613705980.000042gov.state
2291995641815630.000019blog.home
2301995560816780.000018com.oregonlive
231199542846310.000040com.pinimg
232199531808630.000034gov.usgs
2331994989220480.000016com.sputniknews
2341994895010470.000027co.elastic
2351994746011960.000025edu.rutgers
236199466142110.000115com.optimizely
2371994541814090.000021org.maven
2381994266813730.000022net.seesaa
239199395122370.000099com.aliyuncs
240199393002910.000083com.tinyurl
241199391821880.000134com.eepurl
242199381522240.000103com.wpengine
2431993653822350.000014com.slides
244199355866590.000039com.sciencedaily
245199332621360.000183com.addtoany
246199330889460.000031com.storify
247199321941420.000175com.yimg
248199270323540.000070com.getpocket
249199256427150.000037com.vox
25019922530600.000546com.vk
251199209941710.000142org.allaboutcookies
2521991999011650.000025com.vogue
253199183643350.000074com.wufoo
2541991467612820.000023ms.1drv
2551990648414810.000020io.itch
256199063128340.000035com.techtarget
257199051626000.000042org.change
258199015305970.000042com.uk
259199012584210.000059com.squareup
2601989757614080.000021com.itv
261198968029540.000030com.thehill
2621989677212910.000023com.scmp
2631989451417770.000017com.diigo
264198931923160.000079es.google
265198902446510.000039com.lifehacker
266198887866710.000038gov.fcc
267198869807390.000037com.chicagotribune
2681988618023090.000014com.pearltrees
2691988551615540.000019org.unep
270198819603130.000079net.windows
271198818422480.000094ru.rambler
272198806425060.000049us.icio
27319877580920.000358com.weibo
274198765561090.000290com.paypalobjects
275198748268910.000033com.strikingly
2761987359811780.000025com.netlify
277198676544560.000055gov.epa
278198663502920.000083com.criteo
279198640807140.000037org.pewresearch
280198611365330.000047org.plos
2811986095412250.000024com.newscientist
282198608368490.000035uk.co.mirror
2831986070010100.000029com.mediafire
2841986029810720.000027com.sky
285198599469280.000031com.buffer
2861985891012280.000024com.aljazeera
2871985816813390.000022it.scoop
288198580402090.000116org.iana
2891985726020700.000016com.coca-colacompany
290198569126830.000038com.flipboard
2911985390018010.000017jp.ac.u-tokyo
2921985311610180.000028uk.co.metro
293198510543090.000079com.ibm
294198469683220.000077com.go
2951984683815520.000019uk.bl
2961984155612640.000024com.nikkei
29719840090520.000667com.fb
2981983984425060.000013it.unimi
2991983685815950.000019com.googlesource
300198345044740.000052com.udacity
301198340248350.000035uk.co.thetimes
302198322621680.000144com.imdb
303198316608430.000035gov.congress
304198281426680.000038org.fao
3051982665611910.000025org.acs
3061982523817280.000018com.toptal
3071982473610650.000027edu.duke
308198239826210.000041site.business
3091982092011330.000026com.trendmicro
310198178229550.000030com.theconversation
311198142589830.000029co.g
312198130348510.000034com.bmj
313198122021700.000143com.amazon-adsystem
3141980839810450.000027com.searchenginewatch
3151980612813760.000022edu.gatech
3161980347422070.000015com.viki
3171980338811350.000026edu.brookings
318198031789710.000030com.reverbnation
3191979896010690.000027au.com.smh
32019797938440.000797com.googleadservices
321197961644750.000052org.freecodecamp
322197928066580.000039br.com.google
3231979189617660.000017jp.co.japantimes
324197912344000.000063me.telegram
3251979020813320.000022com.msnbc
3261978967219150.000016org.wikibooks
3271978935612960.000023com.dw
3281978762213660.000022com.hostgator
329197841544770.000052com.theverge
3301978091615740.000019com.bankofamerica
331197769869940.000029com.yoast
332197757429970.000029com.socialmediaexaminer
333197741468410.000035org.apa
334197727484260.000058com.elsevier
335197714044580.000055com.bigcartel
3361977019022400.000014com.kinja
3371977002417010.000018com.mediaplex
3381976908010580.000027uk.co.huffingtonpost
3391976682016820.000018org.bitcoin
3401976566814300.000021com.grammarly
3411976522020710.000016com.mathworks
3421976466212530.000024com.livescience
343197642022490.000094com.live
3441976351622650.000014org.biorxiv
3451976202417940.000017com.makeuseof
346197607009420.000031com.econsultancy
347197592965180.000047com.bigcommerce
348197590889530.000030com.searchenginejournal
34919757028620.000537net.akamaihd
3501975586617640.000017com.colourlovers
351197512323140.000079com.rackcdn
3521974916218340.000017com.sas
353197469622230.000104org.gnu
3541974239024900.000013com.itsnicethat
3551974169422960.000014uk.ac.sussex
356197392128200.000036com.neilpatel
357197385541620.000156com.opera
358197385409510.000030com.gumroad
359197334348680.000034com.business2community
360197311389090.000032uk.co.pinterest
361197305706170.000041uk.parliament
362197295608980.000032com.ecwid
363197290245260.000047me.m
3641972830211860.000025com.thelancet
3651972747616770.000018uk.co.timesonline
3661972556816620.000018edu.iastate
367197209488900.000033com.thedrum
3681971820012340.000024com.seattletimes
369197167721160.000258com.jimdo
3701971515817480.000018org.rsc
371197133183180.000078me.wa
3721971305223120.000014io.soup
373197121742400.000098net.php
374197105249960.000029com.healthline
375197066621030.000317net.facebook
376197006623890.000064com.meetup
3771969816813970.000021int.unfccc
3781969784223640.000014com.autoblog
3791969718411470.000026uk.co.ebay
3801969628415120.000020com.channel4
381196961023450.000072int.who
382196958428560.000034com.photoshelter
383196934262970.000081org.python
3841969316821030.000016edu.miami
3851969310824450.000013com.mysanantonio
3861969305213140.000023com.bustle
3871969300424160.000013com.smore
3881969087212440.000024uk.co.express
3891968949618820.000016com.smashwords
3901968934614540.000021com.gawker
3911968926614920.000020org.hrc
3921968857013780.000022uk.gov.blog
393196882102660.000090com.rawgit
394196853862510.000094uk.org.ico
3951968437222290.000015org.vim
3961968369421480.000015uk.ac.york
3971968304819020.000016com.discovermagazine
3981968246620170.000016com.dummies
3991968227028110.000012com.iht
4001967870214980.000020fr.lesechos
4011967719016430.000019org.amnesty
4021967718410870.000026org.aarp
403196759128360.000035uk.gov.legislation
4041967566615820.000019com.pbworks
4051967522811970.000025com.cio
4061967503615410.000020com.googlegroups
407196736968880.000033uk.gov.nationalarchives
408196717864890.000051com.nwsource
4091966919013440.000022com.thestar
4101966832819930.000016com.treehugger
4111966827616020.000019com.brainyquote
412196678685130.000048com.livechatinc
4131966726211950.000025org.heart
414196660462590.000091com.unsplash
4151966593814750.000020ie.independent
4161966566224440.000013org.sciencenews
4171966400814780.000020fi.google
4181966289612010.000025uk.co.standard
419196624041630.000156com.eventbrite
4201966185019970.000016com.timesofisrael
4211966134013040.000023com.surveygizmo
4221965977812450.000024org.ohchr
4231965671619890.000016com.nationalreview
4241965426020220.000016com.gucci
425196532546050.000041org.mediawiki
426196512349720.000029com.wordstream
4271965110215840.000019com.netvibes
4281964956619760.000016org.bitcointalk
4291964822823720.000014com.deepmind
4301964812417730.000017org.iucn
4311964790414960.000020com.startribune
432196460242930.000082com.ebay
4331963938813550.000022com.convinceandconvert
434196371005220.000047edu.yale
435196366143840.000065com.kickstarter
436196357761000.000321com.godaddy
4371963491221570.000015com.instapaper
4381963383017670.000017uk.co.ibtimes
4391963137812610.000024com.imageshack
440196301461100.000284com.mailchimp
4411962700828870.000011net.openreview
442196269244810.000052gov.whitehouse
4431962688413010.000023ch.ipcc
444196258589590.000030com.bandsintown
445196255983880.000064com.office
4461962403220390.000016edu.udel
4471962363618180.000017uk.ac.kcl
448196198669880.000029org.ilo
4491961863618800.000016tl.we
4501961812820920.000016io.gitlab
4511961669819750.000016com.digitaljournal
45219615278840.000440com.list-manage
45319614194150.002224com.wixstatic
4541961192817910.000017com.secondlife
4551960499811710.000025uk.gov.tfl
4561960364619940.000016org.peta
4571960288012520.000024com.medicalnewstoday
4581960184417440.000018com.teenvogue
45919601126450.000773net.fbcdn
4601960076818130.000017com.upi
461196004102050.000117com.etsy
4621959880015770.000019no.google
4631959770620970.000016com.shell
4641959673215350.000020com.quicksprout
465195966224060.000062com.fastcompany
4661959622613240.000023org.hrw
467195961645590.000045edu.berkeley
468195957368260.000036com.intel
4691959340819110.000016com.tomsguide
4701959276216550.000018ca.pinterest
471195914623650.000068com.hp
472195903126490.000039org.nodejs
4731958929621350.000015com.politifact
4741958851624000.000013com.towardsdatascience
4751958835622920.000014com.dailykos
4761958805817490.000018com.oprah
4771958523830390.000011org.arkive
478195847328590.000034com.engadget
4791958423817400.000018com.shareholder
480195842289670.000030ly.snip
4811957764613590.000022com.smallbiztrends
4821957760423840.000014com.hsbc
483195774141040.000312com.statcounter
484195773345660.000044com.photobucket
4851957646821610.000015org.jenkins-ci
4861957402410170.000028com.contentmarketinginstitute
4871956923824470.000013uk.co.spectator
4881956795819660.000016com.thecut
4891956739826550.000012uk.ac.mmu
4901956303014580.000021net.convio
4911956262618970.000016org.project-syndicate
492195626028570.000034com.deviantart
4931956231216580.000018google.ai
4941956091219210.000016com.ogilvy
4951956052817750.000017com.csoonline
496195594349900.000029com.cognitoforms
4971955839820290.000016link.page
4981955745222240.000015com.upworthy
4991955535616700.000018com.kinsta
500195515743930.000063com.getclicky
5011954879419070.000016ms.nyti
5021954829419510.000016uk.ac.leeds
5031954682212470.000024st.po
504195466903590.000069com.mapbox
5051954595823410.000014com.sciencealert
5061954512023870.000013com.instructure
5071954389423430.000014org.theiet
5081954329226200.000012com.ksl
5091954005421680.000015com.webbyawards
5101953788628520.000011com.brandyourself
5111953556427350.000012jp.hatenablog
5121953455227410.000012com.zynga
513195337803820.000066org.acm
5141953232218410.000017com.cmswire
515195319504310.000058io.codepen
5161953103213430.000022org.pocoo
5171953011229310.000011uk.co.autocar
518195299001600.000158com.tripadvisor
519195293722340.000099org.drupal
520195280289910.000029com.gizmodo
5211952514423170.000014org.aei
522195241488640.000034com.matterport
5231952314218710.000017uk.co.thesundaytimes
5241952123010410.000027com.tinypic
525195209448130.000036com.netflix
5261952042024390.000013com.newatlas
5271951876424100.000013com.triplepundit
528195186663810.000066com.booking
5291951832029780.000011fr.hellocoton
5301951736622010.000015org.unfpa
5311951630016030.000019pt.google
5321951400217150.000018net.openid
5331951133230800.000011com.blogsky
5341951124017630.000017com.bloglines
535195080142720.000089com.adnxs
5361950723221060.000015org.royalsociety
5371950658626590.000012com.asiaone
5381950428423080.000014com.waterstones
5391950385823420.000014com.financialexpress
5401950321216390.000019uk.org.nationaltrust
5411950277216460.000019org.pypi
542195012028990.000032com.highcharts
5431950079018890.000016org.panda
5441950070228980.000011org.ifaw
5451950070018280.000017org.thinkprogress
546194997249010.000032com.arstechnica
5471949823622030.000015com.kaggle
5481949765619480.000016org.wri
5491949480426930.000012co.electrek
5501949378623060.000014uk.org.wwf
5511949342624360.000013com.mongabay
5521949328233190.000010com.carscoops
5531949216210820.000027com.mixpanel
5541948655015020.000020io.fabric
5551948625812690.000023com.firebaseapp
556194858309060.000032edu.psu
5571948486818480.000017com.infolinks
5581948405616470.000018com.coschedule
5591948194016720.000018us.pa.state
5601948020022090.000015uk.ac.nhm
5611947965013020.000023com.clicky
562194777265000.000049tv.twitch
563194775445320.000047edu.cornell
564194770848720.000033edu.washington
56519476626710.000478com.livestream
5661947560023070.000014com.autonews
5671947452026600.000012pt.publico
5681947448619290.000016org.americanprogress
5691947419025780.000012com.nordvpn
5701947397222060.000015org.sonatype
5711947193014570.000021com.activecampaign
572194716126250.000041com.samsung
5731947130627300.000012com.delawareonline
5741947086028480.000011com.topgear
575194682409990.000029edu.upenn
5761946549417600.000017uk.gov.metoffice
5771946435227330.000012com.sc
5781946429825730.000013br.inpe
5791946038618730.000017com.prweek
5801946008625890.000012com.ecowatch
58119459484720.000477net.jsfiddle
5821945859032930.000010com.algorithmia
5831945721420270.000016com.scotsman
584194571264290.000058com.slack
5851945537218870.000016com.impactbnd
5861945374810080.000029uk.ac.cam
5871945331622630.000014com.articulate
5881945314027800.000012com.nouw
5891945126628960.000011com.flock
5901944903825710.000013org.globalcitizen
591194470065380.000046com.proofpoint
5921944599823350.000014com.googledrive
5931944426224340.000013nz.co.radionz
5941944422427630.000012jp.riken
5951944369023880.000013de.greenpeace
596194431901190.000244com.youku
597194421181740.000141jp.co.yahoo
5981944159828310.000011com.mumsnet
5991943992418740.000017com.crashlytics
600194391749650.000030edu.umich
6011943902821140.000015uk.org.rspb
602194380282080.000116uk.co.amazon
603194374481010.000321de.google
6041943579027480.000012com.quickanddirtytips
6051943183426680.000012au.com.huffingtonpost
6061943121618960.000016uk.gov.london
6071943069825410.000013com.thejakartapost
6081942948630970.000011com.shanghaidaily
609194288604150.000061com.xinhuanet
6101942861430690.000011com.theminimalists
6111942848612710.000023com.sprinklr
6121942649612080.000025org.iea
6131942646625120.000013ie.thejournal
6141942615217850.000017com.jeffbullas
6151942490229790.000011com.art
6161942464028370.000011it.polito
6171942300818080.000017com.martechtoday
6181942242625990.000012uk.co.profilebusiness
6191942149225340.000013com.db
6201942075628510.000011org.onegreenplanet
6211941839623400.000014net.opendemocracy
6221941695218690.000017org.iucnredlist
6231941390826880.000012uk.org.savethechildren
6241941261423790.000014com.theyworkforyou
625194116666950.000037com.xiti
6261940919826610.000012org.oceanconservancy
6271940871826830.000012com.dreamgrow
6281940797622540.000014com.rabbitmq
6291940737225680.000013com.shoutmeloud
6301940717010280.000028com.mcafeesecure
631194068664490.000055fr.free
632194036403620.000069org.npr
6331940207218650.000017com.copyscape
6341940130827910.000012com.sitesell
635194008803120.000079gov.cdc
6361939982824230.000013com.cleantechnica
6371939968628090.000012pl.edu.uw
638193972743990.000063com.nypost
639193968285690.000044com.aol
6401939644631670.000010com.seeker
6411939639027600.000012uk.org.amnesty
642193962122650.000090com.sohu
6431939596216130.000019com.flashtalking
6441939530825160.000013com.generalmills
6451939347220490.000016com.cityam
6461939247433800.000010com.dremel
647193923703960.000063com.163
6481939176230200.000011com.brothersoft
6491939167020610.000016org.gnupg
65019388022360.001003com.createjs
6511938766010270.000028edu.ucla
652193866305110.000048com.dmca
6531938544214950.000020scot.gov
6541938380623470.000014org.grist
6551938359224740.000013uk.org.oxfam
6561938176624570.000013uk.co.thisismoney
6571938048032590.000010org.aqicn
6581937984825660.000013uk.org.rspca
6591937919011690.000025com.hollywoodreporter
6601937874627260.000012org.irena
6611937782629080.000011org.kuow
6621937586629340.000011eu.i-scoop
6631937528231370.000011com.winefolly
664193742302440.000096com.bandcamp
6651937380613500.000022net.leadpages
6661937129818550.000017net.noscript
6671937072614380.000021com.pastebin
6681937012026920.000012com.targetmarketingmag
6691936852435160.000010co.edureka
6701936837627730.000012com.ipsos-mori
6711936828425460.000013org.zsl
6721936804423930.000013com.moodys
6731936789611700.000025gov.fbi
6741936768621820.000015com.thermofisher
6751936619828000.000012uk.ac.ceh
676193654842730.000089com.surveymonkey
6771936445617030.000018uk.co.which
6781936311814310.000021uk.gov.defra
6791936209226260.000012com.wikidot
6801936186421120.000015com.problogger
6811936143227940.000012com.pnsegypt
6821936048631320.000011com.hatenadiary
683193595721690.000143com.taobao
684193595063330.000074com.pubmatic
685193587703770.000066com.scribd
6861935874829850.000011org.storyofstuff
6871935810631680.000010org.heartland
6881935699829020.000011com.nationalgrid
689193557283520.000070com.wiley
690193550148860.000033com.windowsphone
6911935152825110.000013uk.gov.forestry
6921934981827460.000012org.spie
693193495968160.000036com.mobirise
6941934682229630.000011uk.ac.mdx
695193459364630.000054com.oreilly
6961934522822980.000014com.iconarchive
6971934497432130.000010edu.uah
698193441308930.000032edu.columbia
6991934384621960.000015uk.gov.food
7001934249227700.000012edu.dukeupress
7011934192825180.000013com.wral
7021933730612390.000024google.blog
703193371804530.000055com.sxsw
704193371086860.000038com.steampowered
7051933297228910.000011com.almanac
706193324969150.000031com.docker
707193321384330.000057com.force
708193308909130.000032org.reactjs
7091933043431580.000011com.dbs
7101933001233200.000010uk.org.bornfree
7111932994412830.000023uk.org.greenpeace
7121932832811000.000026com.redhat
7131932800412480.000024com.elpais
714193279247850.000036com.webs
7151932493434010.000010org.sciencenewsforstudents
7161932454834760.000010org.sharktrust
7171932367834470.000010uk.org.caat
718193222183050.000080com.digg
719193203843250.000076com.typeform
7201932019627560.000012com.batchgeo
7211931955821160.000015com.fifa
7221931748023890.000013org.chathamhouse
7231931711613220.000023org.whatbrowser
7241931709420980.000016org.fsc
7251931602417060.000018com.nike
7261931592623570.000014uk.co.inews
7271931582413620.000022edu.ucsd
7281931545834000.000010com.artstation
729193153868550.000034org.unesco
7301931526026540.000012com.ingress
7311931341415610.000019com.technologyreview
7321931275823750.000014io.pantheon
7331931184629520.000011com.climatechangenews
7341931108229810.000011org.c2es
7351930971417710.000017com.ikea
7361930950630100.000011com.foodsafetynews
7371930659825740.000012uk.org.38degrees
7381930574426760.000012com.thecvf
7391930547825880.000012org.carbonbrief
7401930545829900.000011org.sourcewatch
741193049685710.000043com.cbsnews
7421930459429860.000011com.moneysupermarket
743193041684690.000053com.statista
7441930409434140.000010me.start
7451930150828440.000011com.tiddlywiki
7461929969226450.000012com.bnef
7471929862030950.000011uk.co.bristolpost
748192974461980.000122io.polyfill
7491929700230590.000011jp.ac.kobe-u
750192968021220.000238org.networkadvertising
751192963185020.000049com.atlassian
752192940763380.000073com.prnewswire
7531929152211280.000026com.canva
7541928897830120.000011org.twinery
7551928882827370.000012com.adcolony
7561928845831170.000011no.forskning
7571928624627850.000012com.doctoroz
7581928485035560.000010com.cmgdigital
7591928467831430.000011com.sunherald
7601928406231720.000010com.ibmbigdatahub
7611928399235170.000010com.2createawebsite
7621928371629960.000011net.organicfacts
7631928285822430.000014com.privacypolicies
7641928212229050.000011com.winemag
7651928174610560.000027com.ubuntu
7661928151214190.000021uk.co.thesun
767192810864700.000053com.inc
7681928101021430.000015org.cites
7691928099022900.000014uk.gov.dft
7701927928031460.000011com.insideevs
7711927917427340.000012de.ksta
7721927842226840.000012com.e-activist
7731927837614120.000021com.speakerdeck
7741927689427470.000012com.chubb
7751927391626080.000012org.rspo
776192738949640.000030net.2mdn
7771927314232650.000010com.jordantimes
778192720343190.000078gov.ca
7791926891035060.000010com.idt
7801926842627570.000012com.theinnovationenterprise
7811926754223490.000014uk.gov.environment-agency
7821926747834960.000010com.sutori
783192664061510.000163ru.mail
784192662241640.000152com.yelp
7851926551031840.000010com.galvanize
7861926480034250.000010com.thewritepractice
7871926477832120.000010org.carbontracker
7881926457034640.000010org.earthworksaction
7891926354817130.000018com.martechseries
790192626389810.000029com.visualstudio
7911926216833830.000010com.nutraingredients
7921926169432220.000010com.quandl
7931926145214840.000020uk.co.foe
794192609242320.000100to.amzn
7951926017417310.000018org.khanacademy
7961926013026990.000012com.businessgreen
797192599205240.000047com.airbnb
7981925963432000.000010com.thedrinksbusiness
7991925870433840.000010com.monbiot
8001925848826850.000012au.com.mumbrella
8011925710230720.000011fr.thelocal
8021925672833300.000010org.cnduk
803192566286600.000039org.eff
8041925647614410.000021com.tutsplus
8051925592230900.000011ai.fast
8061925542227230.000012com.goinswriter
8071925517033580.000010org.thechicagocouncil
8081925393630290.000011jp.hatenadiary
8091925273027430.000012gov.ferc
8101925263413840.000022com.uber
8111925209434440.000010com.visitdublin
8121925095425820.000012nz.govt.mfat
8131924984422230.000015uk.gov.charitycommission
8141924940611920.000025edu.utexas
8151924911232730.000010com.chemistryworld
8161924899833000.000010org.alaskapublic
8171924898414180.000021fr.lemonde
8181924881231440.000011com.tuck
8191924722631560.000011com.marksdailyapple
8201924628410050.000029com.americanexpress
821192462045790.000043com.patreon
8221924506228140.000012com.ing
823192450321660.000147jp.co.google
8241924424419320.000016uk.gov.education
8251924289627530.000012com.webestools
8261924250225040.000013com.instructables
8271924246011850.000025edu.princeton
8281924055236450.000010com.theppk
8291924053633050.000010com.machinelearningmastery
8301923886417160.000018se.haxx
8311923871211490.000026com.digiday
832192384628960.000032com.zoho
8331923826846690.000009com.9to5mac
8341923760237610.000010org.muslimaid
835192358365410.000046com.alibaba
8361923573628170.000012uk.ac.rcplondon
837192338825560.000045gov.sec
8381923288030430.000011com.platts
8391923268826510.000012com.recyclenow
8401923261834410.000010org.thebestschools
8411923199433520.000010com.beruby
842192318262020.000119com.constantcontact
8431923100223540.000014net.privacypolicytemplate
8441923014232070.000010com.gpsvisualizer
8451922777431040.000011com.rabobank
8461922721633060.000010com.seat61
8471922719834120.000010uk.co.lep
848192261223110.000079com.marriott
849192246662390.000098cn.com.sina
850192242827530.000036com.css-tricks
851192235322460.000095jp.co.amazon
8521922284612990.000023gd.is
8531922182823500.000014uk.co.vogue
8541922142413810.000022com.dell
855192211187220.000037fm.last
8561922110420090.000016io.getmdl
8571922043037560.000010uk.org.stopwar
8581922019626270.000012org.ramsar
8591921798819870.000016com.instapage
860192174345950.000042com.psychologytoday
8611921720235920.000010com.fox13memphis
8621921639631340.000011uk.org.sja
8631921634235380.000010com.breakingenergy
8641921607034360.000010com.star2
8651921578431030.000011org.scielo
86619215692970.000332com.sharethis
867192156868810.000033com.aliexpress
8681921532036130.000010it.diggita
869192148362100.000116jp.ne.hatena
8701921461411250.000026com.firefox
871192144926340.000040gov.nist
8721921294032520.000010org.beatthemicrobead
8731921237236030.000010nl.zoom
8741921231012320.000024com.convertkit
875192078205450.000046uk.co.eventbrite
8761920733431450.000011com.abnamro
8771920638429040.000011org.wildlifetrusts
8781920608816370.000019org.whales
8791920575010680.000027com.shutterstock
8801920467639810.000009com.visitguatemala
8811920384631280.000011uk.org.scope
8821920328810300.000028com.foxnews
8831920314826750.000012org.soilassociation
8841920284210190.000028com.cbslocal
8851920084837020.000010no.haugenbok
8861919958629140.000011com.ironsrc
887191994269520.000030com.variety
8881919934426220.000012com.feedreader
889191988765170.000048com.ea
8901919832235950.000010uk.co.theboltonnews
8911919805211900.000025com.globo
8921919681828630.000011com.itsma
8931919609815640.000019org.freecsstemplates
8941919597820780.000016com.hulu
8951919567231480.000011com.rebekahradice
896191952622890.000084com.discordapp
8971919476235430.000010info.e-ir
8981919454633640.000010org.swi-prolog
8991919228831820.000010com.wpxi
900191916984860.000051com.nasdaq
9011919086431700.000010uk.co.dennis
9021919066835510.000010com.alaskadispatch
9031919046011500.000026com.java
904191902762300.000100com.googletagservices
9051918967635040.000010es.ree
9061918956231620.000010com.sgx
9071918862637210.000010br.org.imazon
9081918833437760.000010com.citymayors
9091918818235820.000010au.com.hotfrog
9101918706632720.000010uk.org.cat
9111918628032790.000010aq.ats
912191861886780.000038com.newyorker
9131918469040020.000009net.politicalscrapbook
9141918434436060.000010com.southernfriedscience
9151918348431250.000011app.web
916191825782200.000106com.naver
9171918177017360.000018com.techrepublic
9181918097836320.000010com.theoildrum
9191918072837280.000010org.worldnuclearreport
920191806741560.000162gov.privacyshield
9211917911828710.000011uk.co.realbusiness
9221917725414630.000021edu.uchicago
9231917672614530.000021tv.ustream
9241917516418070.000017com.nba
9251917335232710.000010uk.org.cpre
9261917318817880.000017org.golang
9271917240229550.000011com.writetothem
9281917236820410.000016com.howstuffworks
9291917089614070.000021uk.co.theregister
930191706684640.000054com.adweek
931191706302430.000096com.stumbleupon
9321917048015790.000019edu.unc
9331916944422110.000015edu.virginia
9341916886036190.000010com.renewablesnow
9351916851813900.000022com.over-blog
9361916780014430.000021com.digitaltrends
9371916778240730.000009uk.co.moblog
9381916514014060.000021us.imageshack
9391916472436040.000010com.at0086
9401916449221440.000015org.coursera
9411916442837990.000010com.avivaromm
942191625829840.000029com.thinkwithgoogle
9431916245036290.000010com.eremnews
944191616604660.000053com.snapchat
9451915991814420.000021com.billboard
9461915990433940.000010uk.gov.peterborough
9471915950635300.000010org.campaigncc
948191586006420.000039org.pbs
9491915759032990.000010uk.co.siemens
9501915757434700.000010org.ilga-europe
951191562589780.000029com.dropboxusercontent
952191543808940.000032com.uservoice
9531915425215780.000019com.ssllabs
9541915399233670.000010com.trafficgenerationcafe
9551915225616140.000019com.warnerbros
956191520429220.000031com.libsyn
9571915185236650.000010uk.org.biofuelwatch
9581915171836170.000010uk.org.garyhall
9591915154823990.000013com.ehow
9601915082037710.000010no.universitetsforlaget
9611914843035590.000010br.org.idec
962191482908390.000035com.qz
9631914816429110.000011net.nend
964191474226900.000038com.webmd
9651914723816940.000018com.codeplex
9661914485213740.000022com.fiverr
9671914458439220.000009net.kjokkenutstyr
968191445724970.000049edu.cmu
9691914415840060.000009org.freedom-now
9701914393211670.000025com.smashingmagazine
9711914360431470.000011uk.org.refill
9721914337219710.000016com.invisionapp
9731914225628640.000011com.dzone
9741914215834900.000010io.dataquest
9751914153839840.000009org.alqaws
9761914123032420.000010io.dropwizard
9771914066238210.000010com.superiorthreads
9781914042033980.000010uk.co.firstnews
979191389943550.000070org.debian
9801913815421340.000015com.w3layouts
981191343888770.000033com.foursquare
9821913404030350.000011com.vungle
9831913371632050.000010org.corporateeurope
984191336448660.000034gov.census
9851913349639950.000009com.tinnedtomatoes
9861913338214000.000021com.blackberry
9871913333613350.000022jp.livedoor
9881913247633340.000010com.drillordrop
9891913183633290.000010com.ovoenergy
9901913117238040.000010com.descarteslabs
9911913077810640.000027com.politico
9921912888837880.000010org.ianfairlie
9931912872818660.000017com.nokia
9941912778626690.000012in.bbc
9951912751229870.000011org.vegsoc
9961912710833870.000010com.figure-eight
997191258183570.000070gov.ftc
998191255962420.000097org.icann
9991912535814010.000021com.xkcd
10001912521636090.000010br.com.ambev

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!