We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February/March, April and May 2021. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks.

What’s new?

The host-level graph now includes all hosts visited by the crawler even if there is no link pointing to the host and all visited URLs of a host failed (HTTP 404 and other error codes) or the host’s robots.txt does not allow crawling. Note that the links leading to these hosts may have been found in a prior crawl, not in one of the 3 crawls used to build this web graph.

Host-level graph

The graph consists of 515 million nodes and 2.82 billion edges. Both hyperlinks and HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including pure “technical” ones pointing to images, JavaScript libraries, web fonts, etc. However, only host names with a valid IANA TLD are used. Consequently, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 452 million dangling nodes (87.9%) and the largest strongly connected component contains 45.2 million (8.8%) nodes. Dangling nodes stem from

  • hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • hosts without any links pointing to a different host name
  • or hosts which did only return an error page (eg. HTTP 404)

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 515 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 72 gzip-compressed files listed in two path listings – one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using “gzip”. By adding the prefix s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl Feb/Apr/May 2021 host-level webgraph

SizeFileDescription
3.31 GBcc-main-2021-feb-apr-may-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 24 vertices files
12.94 GBcc-main-2021-feb-apr-may-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 48 edges files
5.57 GBcc-main-2021-feb-apr-may-host.graphgraph in BVGraph format
2 kBcc-main-2021-feb-apr-may-host.properties
6.22 GBcc-main-2021-feb-apr-may-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2021-feb-apr-may-host-t.properties
1 kBcc-main-2021-feb-apr-may-host.statsWebGraph statistics
7.69 GBcc-main-2021-feb-apr-may-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 88 million nodes and 1.58 billion edges. 50% or 44 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/domain/.

Download files of the Common Crawl Feb/Apr/May 2021 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2021-feb-apr-may-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.37 GBcc-main-2021-feb-apr-may-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.58 GBcc-main-2021-feb-apr-may-domain.graphgraph in BVGraph format
2 kBcc-main-2021-feb-apr-may-domain.properties
3.42 GBcc-main-2021-feb-apr-may-domain-t.graphtranspose of the graph
2 kBcc-main-2021-feb-apr-may-domain-t.properties
1 kBcc-main-2021-feb-apr-may-domain.statsWebGraph statistics
1.89 GBcc-main-2021-feb-apr-may-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 88 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Apr/May 2021)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13192093410.017627com.googleapis
23103278430.013762com.facebook
32968130420.013832com.google
42710169240.007844com.twitter
52695466050.007519org.w
62688662470.006967com.youtube
72551585080.005718com.instagram
82503149060.007143com.googletagmanager
92439611690.005506org.gmpg
1023807122120.003347com.linkedin
1122970992130.003048com.gstatic
1222854052100.003951com.cloudflare
1322698594190.001914com.gravatar
1422504168140.002908org.wordpress
1522434542220.001564com.pinterest
1622100870250.001270org.wikipedia
1721950578170.002031com.wordpress
1821940826180.001958com.apple
1921766696150.002258com.bootstrapcdn
2021762964300.001174com.vimeo
2121722198380.000914be.youtu
2221556142210.001842com.jquery
2321478118290.001182com.microsoft
2421432212530.000703com.blogspot
2521354260350.001025com.amazonaws
2621337432440.000765com.amazon
2721320702430.000789gl.goo
2821170722620.000600ly.bit
2921149628990.000409com.tumblr
3021148242500.000739com.wp
3121136818450.000758org.mozilla
3221110018570.000689eu.europa
3321104262200.001894com.adobe
3421048760160.002200com.github
3521040284340.001026com.google-analytics
3621027350360.001015net.jsdelivr
3720998320270.001218com.wixstatic
3820995232310.001119net.cloudfront
3920946148470.000744com.flickr
40209131041070.000338com.yahoo
4120851316830.000436com.googleusercontent
4220843068370.000929io.github
43208406701110.000317com.reddit
4420834398580.000677com.paypal
4520816886230.001554com.fontawesome
46207735821030.000368com.weebly
4720764576790.000455com.medium
4820764512330.001035com.googlesyndication
4920757582320.001118ru.yandex
5020741944480.000743com.whatsapp
5120708152680.000520org.w3
52207058261320.000240com.nytimes
5320696906590.000673co.t
54206780881020.000375org.creativecommons
55206758221150.000290com.soundcloud
5620644978600.000624org.schema
5720627114740.000479com.shopify
5820621162660.000543com.vk
59206047261810.000149org.wikimedia
60206047241470.000204com.dropbox
6120579720550.000702com.addthis
62205729501380.000211org.archive
63205706101980.000133com.cnn
64205581141520.000187gov.cdc
6520550306800.000446me.wp
66205388161930.000136com.imgur
6720530078490.000740net.doubleclick
68205122941990.000133uk.co.bbc
69205059642000.000133net.slideshare
70204998641710.000155com.theguardian
71204897561580.000175int.who
72204822561200.000263com.spotify
73204811181750.000151com.bing
74204783202130.000124com.businessinsider
75204774782530.000104com.bloomberg
76204773001440.000206gov.nih
7720473648460.000748com.macromedia
78204405202540.000103com.wsj
79204343202240.000118edu.stanford
8020419762410.000847net.fbcdn
8120417930390.000885org.apache
82204096361570.000175org.ietf
8320397792900.000420com.list-manage
84203955943680.000071com.googleblog
85203953502170.000123com.stackoverflow
86203931721700.000155com.giphy
87203912263140.000085edu.mit
88203819482230.000118com.washingtonpost
89203726021340.000232com.ytimg
90203635923620.000073com.appspot
91203602363510.000076com.theverge
92203596102860.000093com.bbc
93203588703960.000067uk.co.telegraph
94203560364990.000056edu.berkeley
95203480482660.000101edu.harvard
96203460123300.000080com.go
97203416762370.000112com.office
98203387101450.000206us.zoom
99203357822470.000109com.android
100203353663270.000082com.wired
101203341602880.000092com.techcrunch
102203317822380.000111com.oracle
103203236385470.000051com.livejournal
104202966701640.000170com.issuu
105202958402960.000090com.cnbc
106202921462110.000124gov.ca
107202917544020.000066com.ted
108202883803790.000069gov.nasa
109202834261490.000195com.forbes
110202830501480.000199com.wixsite
111202829721510.000192com.npmjs
112202825245180.000054com.zdnet
113202796564470.000062com.msn
114202777522920.000091com.reuters
115202755403500.000076com.nature
11620273474780.000459com.godaddy
117202717183710.000070com.myspace
118202704942220.000119com.etsy
119202688323210.000084com.prnewswire
120202557262090.000125org.ampproject
121202523864070.000065org.arxiv
122202522923120.000085org.npr
123202522182630.000101com.sciencedirect
12420248804980.000410com.unpkg
125202464022650.000101com.example
12620245616670.000524net.akamaihd
127202370562150.000123com.eventbrite
128202345323670.000072org.hbr
129202323381760.000151com.blogger
130202316581270.000247org.networkadvertising
131202315523990.000066com.latimes
132202286902680.000101org.acm
133202232423380.000079com.statista
134202094343890.000068com.fastcompany
135202058486600.000043com.economist
136202024823430.000078com.time
137202024522260.000117com.twimg
138202019026790.000042edu.upenn
139202015305500.000050edu.yale
140202008422580.000102com.githubusercontent
141201912724740.000060com.steampowered
142201898241430.000206com.opera
143201886204440.000062uk.co.dailymail
144201884863530.000076com.springer
145201868065760.000047com.scribd
146201847847800.000041edu.columbia
147201801005350.000052org.chromium
148201758765910.000046me.about
149201757326040.000046google.blog
150201752842850.000094com.squarespace
151201740503350.000079com.huffingtonpost
152201713564310.000063com.nationalgeographic
153201687882210.000119uk.co.google
154201653722080.000125com.unsplash
155201635803880.000068com.w3schools
156201589563390.000079com.dribbble
157201547863400.000079com.tiktok
158201533562930.000091org.un
159201379247940.000040com.qz
160201338142480.000108com.bandcamp
161201295984850.000058edu.cornell
162201259548210.000039edu.umich
163201211201190.000267com.ft
164201153424350.000063com.theatlantic
165201110289660.000033edu.princeton
166201108083410.000078com.usatoday
167201055567860.000040com.evernote
168201054821330.000235info.aboutads
169201048104080.000065com.meetup
170201026384380.000062com.goodreads
171201008946250.000045org.ieee
172200989728780.000036com.slate
173200978706770.000042com.mysql
174200976564530.000061com.patreon
175200975301370.000216me.t
176200956005150.000055com.cbsnews
177200842046560.000043com.docker
178200833362910.000092com.wiley
179200825204800.000059gov.usda
180200806644540.000061com.dailymotion
181200788188170.000039edu.washington
182200771604930.000057com.withgoogle
183200750645230.000054io.readthedocs
184200710146440.000044com.marketwatch
185200650106500.000043uk.co.blogspot
186200627348680.000037com.shutterstock
18720062652540.000703com.fb
188200596644970.000056uk.co.independent
18920056344760.000467com.wix
190200559328110.000039org.cambridge
191200518445590.000049com.pexels
192200485767790.000041org.sciencemag
193200480045920.000046com.buzzfeed
194200442488190.000039com.stackexchange
195200434661790.000149ru.mail
196200434468440.000038com.webs
197200430745730.000048com.git-scm
198200402084640.000060com.inc
199200373542720.000100net.behance
200200297444250.000063gov.whitehouse
201200253428320.000038com.apnews
202200235187690.000041com.vox
2032002203013650.000024uk.co.thesun
204200185482740.000098com.outlook
205200183187720.000041org.bitbucket
20620017276400.000871com.qq
207200148722440.000110org.doi
208200120828120.000039uk.ac.cam
209200119982550.000103com.disqus
210200073122360.000112com.feedburner
211200056306700.000043org.worldbank
212200012305840.000047org.unicef
213200009324190.000064com.mozilla
214199997405930.000046co.ibb
21519999080260.001261io.polyfill
216199979285250.000054com.booking
21719993488420.000808com.baidu
218199897842600.000101com.cloudinary
219199858562890.000092com.tinyurl
220199839803450.000077com.ibm
2211998302211630.000027com.speakerdeck
222199825065970.000046gov.noaa
223199782066120.000045ee.linktr
224199773105690.000048com.psychologytoday
225199737105310.000053gov.loc
226199729204000.000066com.getpocket
2271997276010410.000031edu.utexas
228199717943200.000084org.pewresearch
2291997131013660.000024edu.rutgers
230199708945510.000050com.sagepub
231199702003090.000087com.nbcnews
2321996796211340.000028org.eclipse
233199655866480.000043com.trello
234199642803260.000082net.windows
235199641943840.000068com.quora
236199614306000.000046net.azurewebsites
237199599102750.000098gov.ftc
2381995593810570.000030edu.uchicago
239199533083110.000086com.netdna-ssl
240199519607820.000041org.semver
241199512861240.000252com.mailchimp
242199502944360.000063com.nypost
2431994929611950.000027com.hatenablog
244199471426520.000043com.newyorker
245199439389850.000033uk.co.guardian
246199435645900.000046com.usnews
247199404982200.000119tv.twitch
248199397387840.000041au.net.abc
249199388201660.000167com.amazon-adsystem
2501993630812780.000025com.vogue
251199354662300.000113com.wpengine
252199340981060.000338com.stripe
2531993326612610.000025org.kernel
254199297389410.000034com.politico
2551992641611930.000027org.unicode
256199256025800.000047org.eff
257199251745410.000051br.com.uol
258199248068520.000037com.about
2591992364413580.000024edu.hbs
260199236009540.000034com.dropboxusercontent
261199234649110.000035edu.jhu
262199220629930.000032co.elastic
263199218889130.000035com.steamcommunity
2641992015019710.000018com.googlesource
265199197605220.000054com.tandfonline
266199180102770.000097com.criteo
267199157085520.000050org.pbs
2681991298611060.000029edu.umd
26919912224640.000549co.g
270199083408650.000037com.foxnews
271199074561230.000261com.sharethis
2721990417810270.000031com.rollingstone
273199030822280.000115com.imdb
274199027749770.000033com.scientificamerican
2751990194013920.000023com.urbandictionary
276199008767750.000041uk.ac.ox
277199004063910.000067com.arcgis
2781989852020160.000018com.lego
279198984202510.000107page.g
280198983186310.000044gov.census
281198900565300.000053com.oup
282198879683460.000077com.optimizely
283198874245820.000047com.indiatimes
284198871943760.000069com.cnet
285198840244220.000064com.wufoo
286198829307040.000042uk.co.eventbrite
287198828064210.000064com.bigcommerce
2881988030613500.000024ca.blogspot
289198790168330.000038org.fao
290198787329080.000035com.jetbrains
2911987104414670.000022ca.ubc
2921986765019380.000018com.warnerbros
293198660124460.000062org.d3js
294198655189460.000034org.greenpeace
295198646322060.000127net.sourceforge
296198634503230.000083fr.google
2971986291612790.000025com.history
298198618068510.000038com.gumroad
299198617509190.000035com.chicagotribune
300198598446360.000044gov.archives
301198589022840.000095com.googlecode
302198535023420.000078com.slack
303198519322290.000114com.eepurl
304198456261140.000292com.paypalobjects
305198417029270.000035com.sap
306198398301530.000180com.addtoany
307198374662900.000092com.typepad
3081983408215620.000021de.mpg
309198300546640.000043com.pinimg
310198281482820.000095com.calendly
311198275304910.000057gov.epa
312198257563540.000076com.proofpoint
3131982112814300.000023ch.ethz
3141982109410280.000031com.500px
3151982055417320.000019com.diigo
316198203983340.000079com.live
3171982003412770.000025org.postgresql
3181981854412570.000025org.wiktionary
3191981791012740.000025org.aclu
320198176989810.000033edu.si
3211981658613940.000023edu.msu
3221981621010290.000031com.thehill
323198149368900.000036de.spiegel
324198131729160.000035com.huffpost
325198112824720.000060gov.hhs
3261980924011140.000028com.scmp
32719806650730.000484me.fb
328198063067640.000042org.change
329198050703780.000069com.sohu
3301980433613290.000024edu.illinois
331198041641850.000147com.xing
3321980119213230.000024org.tensorflow
3331980108610080.000032com.ssrn
334198001841620.000171com.zendesk
335197984289040.000035com.netlify
336197972945080.000056com.squareup
3371979702013520.000024com.sky
338197944001960.000134org.iana
3391979271410780.000029uk.co.thetimes
340197924948470.000038gov.congress
341197887048090.000039org.pypi
3421978387814220.000023cn.com.chinadaily
343197811429720.000033edu.academia
344197809744560.000061com.kickstarter
345197800848020.000040gov.senate
3461977912824150.000015org.pydata
3471977812411400.000027org.semanticscholar
348197757166200.000045site.business
3491977501212750.000025com.over-blog
350197748667920.000040org.oecd
3511977484616600.000020org.phys
352197743349990.000032com.yarnpkg
353197722488160.000039com.deviantart
3541977093610840.000029uk.co.mirror
355197705221870.000145com.rawgit
3561977011413150.000024com.axios
357197697006230.000045gov.house
358197689988940.000036com.discordapp
359197688668800.000036com.sciencedaily
360197662925110.000055com.gmail
361197656784230.000064com.technorati
362197639442160.000123com.hubspot
3631976163814330.000023com.unity3d
3641976076821370.000017org.threejs
3651976023813640.000024com.aljazeera
366197595802450.000109org.nodejs
367197588468960.000036com.bmj
368197555642610.000101com.ebay
3691975519811970.000026au.com.smh
370197536282340.000113org.gnu
3711975196415160.000021edu.osu
3721975136210250.000031int.coe
373197503029940.000032com.britannica
3741974840813120.000024edu.gatech
3751974681826910.000013com.openai
376197443704950.000056org.openstreetmap
377197430864370.000062com.ssl-images-amazon
378197415827910.000040br.com.google
379197410308550.000037ca.cbc
380197404848690.000037com.theconversation
3811973985225820.000014edu.toronto
3821973865210440.000031gov.usgs
3831973830615560.000021com.newscientist
384197362263010.000088net.themeforest
385197356986050.000046com.udacity
386197356684730.000060edu.nyu
3871973408417160.000019edu.ucsc
3881972370817000.000020org.emojipedia
3891972219420680.000017it.scoop
3901972202427540.000013com.slides
3911972187214590.000022ca.sfu
392197200048450.000038au.gov.nsw
3931971790819030.000019org.propublica
3941971758613860.000023com.firebaseapp
3951971609422470.000016com.skyrock
396197105167760.000041com.freepik
39719707962970.000412net.facebook
3981970490014540.000022com.penguinrandomhouse
399197035721950.000135org.bbb
4001970343219340.000018jp.co.japantimes
4011970103017620.000019com.itv
40219700818820.000437net.jsfiddle
4031970061619850.000018org.maven
4041969974623700.000015com.deepmind
405196978446170.000045com.healthline
406196953245060.000056de.gesetze-im-internet
407196947204650.000060org.python
4081969442823310.000015com.mystrikingly
409196915368840.000036gov.dhs
4101968823812330.000026com.wikia
4111968598620900.000017org.sqlite
4121968297615440.000021ms.1drv
413196820941780.000150com.salesforce
414196799143220.000084net.php
415196714843240.000083com.surveymonkey
416196709626340.000044com.mashable
4171967033816280.000020com.motherjones
418196687241390.000211com.weibo
4191966855424530.000014com.fastcodesign
4201966744415060.000021com.flipboard
4211966674624350.000015edu.byu
4221966548217480.000019edu.cuny
423196648863170.000085ru.ok
424196626182870.000092net.azureedge
4251966210813390.000024com.thedailybeast
426196596722460.000109org.aboutcookies
4271965883822830.000015com.shutterfly
4281965610814130.000023com.reverbnation
4291965572226660.000013io.material
430196552545370.000052io.codepen
4311965277612960.000025com.dw
432196519861250.000250com.youtube-nocookie
4331965041617240.000019com.esri
434196501884900.000057fr.free
4351964841615090.000021com.substack
436196474385610.000049com.matterport
4371964658419560.000018com.hindustantimes
4381964583019090.000019com.insider
4391964234221100.000017edu.oregonstate
4401964181423900.000015org.wikibooks
441196408388910.000036int.wipo
4421964024428200.000013org.aclweb
443196392266070.000045gov.state
4441963889423660.000015com.wattpad
445196386521600.000172gle.forms
4461963669210520.000030org.jstor
4471963639819510.000018com.channel4
4481963612617520.000019edu.ucsb
4491963594213200.000024gov.supremecourt
45019633994560.000697com.googleadservices
4511963176024410.000015at.ac.univie
4521962909629240.000013com.pbase
453196265722780.000097uk.org.ico
454196248026390.000044com.licdn
4551962342215180.000021ch.ipcc
456196218749370.000034com.gallup
457196217804960.000056com.herokuapp
4581961858411410.000027edu.brookings
459196173889630.000033edu.psu
4601961679013330.000024mil.army
461196166264340.000063com.rackcdn
462196144843850.000068com.atlassian
4631961176012260.000026com.smashingmagazine
4641960963422270.000016blog.home
4651960845013620.000024gov.defense
4661960769811310.000028com.photoshelter
467196074644830.000058net.imgix
468196070121820.000149jp.co.yahoo
4691960531622840.000015com.contently
470196020408260.000039com.oreilly
4711959770811740.000027com.mediafire
4721959559621170.000017com.thecut
4731959460419600.000018google.ai
4741959456831510.000012cc.uxdesign
4751959428031610.000012edu.uvm
476195941005200.000054edu.cmu
4771959308631370.000012com.instapaper
4781959109015910.000020com.thestar
479195883783690.000071net.researchgate
4801958721435020.000011com.raywenderlich
481195870085270.000053com.thinkwithgoogle
4821958486821490.000016fr.liberation
483195822301090.000336de.google
4841958141815740.000021com.buzzfeednews
485195776487670.000041org.worldwildlife
4861957666210130.000032com.ecwid
4871957611814770.000022com.findlaw
4881957480410120.000032com.thelancet
489195739367740.000041com.vice
490195735068130.000039gov.nist
4911957287219640.000018org.google
4921957250815310.000021org.hrw
493195704107650.000042com.intel
4941956823826950.000013uk.co.ibtimes
4951956779023720.000015com.oprah
49619567558870.000428com.workplace
4971956719433290.000011com.pearltrees
4981956717421030.000017com.voanews
499195667629650.000033com.engadget
500195661881260.000247com.statcounter
5011956477233650.000011org.edublogs
5021956398012600.000025org.aiga
5031956282810310.000031de.stern
5041956206815830.000020fr.francetvinfo
5051956019626200.000014com.hm
506195593423150.000085org.drupal
5071955913237360.000010fr.unblog
508195587867470.000042com.canva
5091955836228700.000013edu.ucf
5101955806432040.000012ph.telegra
511195575349260.000035uk.co.pinterest
5121955707224020.000015edu.kit
513195563585440.000051it.placehold
5141955552822190.000016net.corporate-ir
5151955391027680.000013co.ello
516195534268810.000036com.arstechnica
5171955301814490.000022com.livescience
5181955096821500.000016com.gq
5191955083619530.000018uk.gov.tfl
520195502542100.000125com.iubenda
521195500425330.000053com.pixabay
5221954832814080.000023org.undp
523195476688070.000039ca.amazon
5241954722610200.000031it.smarturl
5251954703226450.000014org.icrc
5261954693424470.000015com.webbyawards
5271954556424230.000015uk.ac.kcl
528195455549490.000034edu.ucla
5291954446214440.000022link.page
5301954396828610.000013com.dummies
5311954136615810.000021org.ocks
53219540748650.000544net.typekit
5331954002211220.000028org.ilo
5341953888225640.000014com.depositphotos
5351953886625020.000014com.unilever
5361953695013480.000024org.acs
53719536262810.000440com.livestream
5381953509826720.000013org.rsf
539195350764890.000057com.adweek
5401953404420500.000017com.msnbc
5411953022025090.000014com.slidesharecdn
5421953008420350.000018com.chronicle
5431952983630880.000012com.bepress
5441952958025710.000014com.biography
5451952932233840.000011tl.de
546195278863320.000079com.typeform
5471952642821850.000016com.newrepublic
5481952540023030.000015com.thoughtco
549195238566060.000045com.samsung
5501952311211000.000029org.ohchr
551195226687900.000040com.fiverr
5521952151817430.000019io.gitlab
553195212401210.000262com.jimdo
5541952029211570.000027com.thenextweb
5551952007020090.000018fr.orange
5561951961832720.000012net.openreview
5571951893622940.000015com.channelnewsasia
5581951709012830.000025org.aarp
5591951691826340.000014org.pewsocialtrends
5601951647619980.000018com.straitstimes
5611951393623100.000015edu.nd
5621951095620990.000017com.dallasnews
5631951073221300.000017de.br
5641950881822780.000015org.fas
5651950800012970.000024org.altervista
566195079782560.000103uk.co.amazon
567195072902190.000121to.amzn
5681950662428350.000013com.thejakartapost
5691950512822110.000016gov.lbl
5701950455616100.000020de.berlin
5711950436210860.000029com.popularmechanics
5721950370627430.000013uk.ac.leeds
573195036444590.000061com.staticflickr
5741950321033970.000011org.neocities
5751950235829960.000012org.vim
5761950218628830.000013org.globalcitizen
577194994505720.000048com.deloitte
578194993929220.000035com.zoho
579194989642330.000113io.shields
5801949893623280.000015com.indianexpress
5811949890238890.000010com.stratechery
5821949772828190.000013app.web
5831949635833860.000011org.zotero
5841949362429390.000013uk.gov.scotland
585194933145670.000048com.photobucket
5861949152437560.000010com.bravesites
5871949055214640.000022org.iea
588194899764320.000063com.hp
5891948995427130.000013uk.co.timesonline
590194894783650.000073com.quantserve
591194893364040.000066com.digg
592194866605600.000049com.cisco
5931948661811550.000027uk.parliament
5941948501429140.000013com.nwsource
5951948501223620.000015com.fineartamerica
596194845982670.000101com.onesignal
5971948423822340.000016com.foreignpolicy
598194842007980.000040org.weforum
5991948339829900.000012com.thoughtworks
6001948320215480.000021com.treehugger
601194823983070.000087com.aliyuncs
602194822246020.000046org.js
6031948023215270.000021gov.uscis
6041947904032560.000012uk.ac.city
6051947747620770.000017com.washingtontimes
6061947719835040.000011com.mariadb
6071947631625650.000014org.oas
608194752364170.000065com.gitlab
6091947225825840.000014com.mathworks
6101947175228300.000013com.dezeen
611194712848350.000038com.investopedia
6121947063824970.000014uk.co.yougov
6131946931629340.000013org.heritage
614194693086140.000045com.netflix
6151946625232810.000011com.shell
6161946538825400.000014fr.paris
617194649564480.000061gov.irs
6181946273240880.000009tl.page
6191946133013610.000024com.upwork
620194611704620.000061com.sxsw
6211946091412550.000025com.digitaloceanspaces
6221946054840910.000009com.jigsy
623194600668610.000037com.venturebeat
6241945841812150.000026com.dell
6251945734810160.000031gov.fcc
6261945682832290.000012uk.co.walesonline
6271945634629610.000013org.project-syndicate
6281945569620240.000018com.fivethirtyeight
629194552429200.000035fm.last
6301945505620860.000017info.worldometers
631194542529310.000034org.mediawiki
6321945367023770.000015ly.rebrand
6331945315840770.000009net.myanimelist
6341945282420750.000017cn.gov.fmprc
6351945201215020.000021org.amnesty
636194505483490.000077com.adnxs
6371944935019450.000018com.justia
6381944871240190.000009edu.usfca
6391944829827050.000013com.monday
6401944657615150.000021ca.bc.gov
641194464869430.000034org.reactjs
6421944612622850.000015net.openid
643194459043830.000068com.newrelic
6441944536613630.000024com.imageshack
6451944514435680.000010org.globalnetworkinitiative
6461944394025490.000014com.kaggle
6471944356236930.000010com.doodlekit
648194397922590.000102com.getbootstrap
6491943867028310.000013uk.co.inews
6501943831231290.000012com.bangkokpost
651194382304090.000065com.force
6521943790821070.000017uk.ac.imperial
6531943543446290.000008net.vingle
6541943415019820.000018be.kuleuven
6551943406635300.000011com.intensedebate
656194329265680.000048com.entrepreneur
6571943235035180.000011be.blogspot
6581942974031660.000012se.blogspot
6591942971213180.000024co.lpages
6601942899232660.000012org.carnegieendowment
661194286748370.000038com.globenewswire
6621942866231750.000012is.good
6631942809822460.000016com.instructure
6641942769829650.000012net.alarabiya
6651942720440900.000009com.kongregate
6661942651427950.000013com.discovermagazine
6671942574626130.000014org.gnupg
668194255185560.000049com.visualstudio
669194241301910.000139com.atdmt
6701942352837730.000010com.openlearning
6711942323037940.000010ch.swissinfo
6721942198235470.000010com.pixar
6731942008021540.000016com.livemint
674194197089570.000033com.variety
6751941714228160.000013uk.gov.metoffice
6761941434620040.000018com.surveygizmo
6771941299433370.000011cn.globaltimes
678194112129290.000035uk.gov.legislation
6791941107026390.000014org.ballotpedia
680194097362430.000110org.whatwg
6811940862031480.000012com.coca-colacompany
6821940834213430.000024uk.gov.nationalarchives
6831940616823260.000015com.thebalancesmb
6841940482231450.000012uk.gov.companieshouse
6851940308835320.000011com.dailykos
686194010081650.000170com.yelp
687194005122570.000103com.automattic
6881940027041690.000009com.penzu
6891939968624890.000014com.bloomberglaw
690193996624120.000065org.opensource
6911939812615470.000021org.khanacademy
6921939737638340.000010com.sfweekly
6931939523627790.000013com.thumbtack
6941939420228800.000013org.royalsociety
6951939368416740.000020kr.co.google
6961939367825310.000014com.post-gazette
6971939352028000.000013org.panda
6981939064824210.000015com.thenation
6991938971428230.000013io.fabric
7001938897449360.000008org.arkive
7011938875626890.000013uk.co.bbci
7021938762440420.000009hk.edu.cityu
7031938740631940.000012com.scribblelive
7041938635235530.000010com.gimletmedia
7051938587234890.000011com.tweetmeme
7061938483025410.000014de.uni-heidelberg
707193842842980.000089ai.shortpixel
7081938387219200.000019gov.gao
7091938297444250.000008com.storeboard
7101938165028140.000013com.politifact
7111938020233490.000011org.cato
7121937928248890.000008com.uberant
7131937730631830.000012fr.lepoint
7141937719438090.000010edu.depaul
7151937612638440.000010net.thedailystar
716193755904060.000066com.aol
7171937557040460.000009edu.umt
7181937279419480.000018tv.ustream
7191937262810340.000031com.verisign
7201936958832790.000011com.theweek
721193679349050.000035com.box
7221936717037240.000010com.eklablog
7231936585034880.000011com.militarytimes
724193658328660.000037gov.uspto
7251936558034830.000011com.multiscreensite
7261936409831030.000012uk.ac.york
7271935948831650.000012org.openweathermap
7281935857415260.000021com.techrepublic
7291935807033150.000011org.jenkins-ci
7301935796828150.000013org.wnyc
731193574586380.000044gov.copyright
7321935683434330.000011com.lawfareblog
7331935461023570.000015co.pcdn
7341935300432630.000012com.nyt
7351935276631010.000012se.svt
7361935186610480.000030net.clickbank
7371935154631210.000012com.scotsman
7381934872011820.000027com.foursquare
7391934866012390.000026com.pingdom
7401934804824750.000014com.squarespace-cdn
7411934667823230.000015com.natlawreview
7421934635027690.000013org.wri
7431934580034300.000011com.bigthink
7441934505441320.000009com.newgrounds
7451934469238620.000010org.sourcewatch
7461934235637200.000010re.cli
7471934178831560.000012gov.ncjrs
7481934145830870.000012my.com.thestar
7491934069833070.000011gov.anl
7501933993231170.000012com.nationalreview
7511933913225970.000014ca.newswire
7521933809016030.000020org.webkit
7531933740237000.000010org.elasticsearch
754193352769280.000035com.hootsuite
755193349363000.000088com.caniuse
7561933425232360.000012gov.fec
7571933391023270.000015ru.rg
7581933312437410.000010org.constitutioncenter
7591933210216020.000020com.jwplayer
7601933175442530.000009com.etymonline
7611933167836200.000010it.eventbrite
7621933151029600.000013com.madmimi
7631933146034910.000011com.afp
7641933019219070.000019com.kinstacdn
7651932813631630.000012gov.ornl
766193270424610.000061com.pubmatic
767193258664010.000066gg.discord
7681932551812890.000025com.intuit
7691932548211680.000027com.ycombinator
7701932525832920.000011com.crashlytics
7711932430242700.000009com.underconsideration
7721932285625990.000014com.articulate
7731932223032460.000012de.uni-frankfurt
7741932149636920.000010uk.co.spectator
775193210968670.000037com.wikihow
7761932101042750.000009to.gplus
7771932080249200.000008pl.pastebin
7781932062237910.000010uk.co.manchestereveningnews
7791931985429380.000013edu.unh
7801931897625530.000014de.tagesschau
7811931880221160.000017gov.energystar
782193183724290.000063com.businesswire
783193180508290.000038com.moz
7841931484835500.000010org.avaaz
7851931455436830.000010com.mnn
7861931447611720.000027com.alexa
7871931415023320.000015net.vnexpress
788193132683480.000077com.constantcontact
7891931273236000.000010com.heraldscotland
7901931232638430.000010fm.audioboo
7911931175044810.000008tv.eurovision
792193116469740.000033com.fandom
7931931125637170.000010uk.ac.uea
7941931117436970.000010uk.ac.core
7951931026835140.000011com.hsbc
7961931025434920.000011org.sciencenews
7971931024249160.000008com.blackplanet
7981931009632890.000011com.realclearpolitics
7991930936616980.000020com.pastebin
8001930919631900.000012uk.org.rspb
8011930832213770.000023com.techradar
802193080945290.000053com.indeed
8031930754849850.000007dk.bloggersdelight
8041930714444910.000008com.xtgem
8051930610820730.000017ca.on.gov
8061930550035360.000011uk.co.thisismoney
807193049087970.000040gov.sec
8081930233011280.000028net.atlassian
8091930224039370.000009com.collinsdictionary
8101929994414790.000022edu.purdue
8111929902031790.000012com.wayfair
8121929890836110.000010org.chathamhouse
8131929790032180.000012org.rferl
814192972163970.000066com.skype
8151929653647380.000008edu.ualr
8161929601635230.000011org.diva-portal
8171929567227850.000013org.cfr
8181929480612490.000025com.merriam-webster
8191929296848350.000008com.designobserver
8201929273433990.000011org.pewforum
821192922002700.000100jp.co.amazon
8221929146839940.000009uk.co.dailyrecord
8231929093639510.000009edu.swarthmore
8241929057033390.000011com.ubs
8251928974810750.000030so.notion
8261928974228470.000013us.govtrack
8271928923612560.000025com.udemy
828192890403330.000079com.hackerone
8291928871637870.000010org.nationalinterest
8301928862631380.000012com.doubleclickbygoogle
831192880002790.000097de.amazon
8321928724420360.000018org.doxygen
8331928684016610.000020scot.gov
8341928665239330.000009de.berliner-zeitung
8351928586815190.000021com.billboard
836192839106810.000042com.gartner
8371928339046980.000008net.writeablog
8381928268824650.000014com.infoworld
839192820848230.000039com.sedo
8401928170032000.000012org.aei
84119280820710.000502com.oculus
8421928065215800.000021edu.ucsd
843192803963290.000081mp.mailchi
8441928028839170.000009edu.umaine
8451927922232620.000012org.iucnredlist
8461927913028270.000013com.lexology
8471927830448510.000008com.nation2
8481927815652900.000007com.anotepad
8491927805641280.000009za.co.mg
85019276824770.000467com.messenger
8511927646020830.000017org.dejure
8521927600244940.000008net.blogfreely
8531927563013020.000024org.owasp
8541927514233090.000011com.foreignaffairs
8551927509240670.000009tw.com.books
8561927491642670.000009ca.nfb
857192748223640.000073com.bitly
8581927456032250.000012org.osce
8591927402837260.000010uk.org.wwf
8601927400639710.000009org.truthout
861192731041550.000178gov.privacyshield
8621927270819810.000018edu.uci
8631927236820440.000017se.haxx
864192722888970.000036com.emarketer
8651927211045320.000008com.symbaloo
8661927150810040.000032com.playstation
8671927133821960.000016org.sundance
868192712163630.000073eu.youronlinechoices
8691927119634960.000011com.rev
8701927108040710.000009in.thewire
871192709761590.000174org.nginx
872192705289030.000036com.libsyn
8731926865024000.000015us.pa.state
874192676101460.000205me.line
8751926747852020.000007net.bravejournal
8761926738631400.000012ru.kp
8771926733440140.000009com.ecowatch
878192667005140.000055org.debian
879192663025390.000052com.gofundme
880192661949760.000033com.pcmag
8811926491441510.000009com.theoutline
8821926451243160.000009org.icj-cij
8831926362614700.000022org.coursera
8841926161020760.000017gov.healthcare
8851926062637210.000010com.iconarchive
8861925973416570.000020net.leadpages
8871925903414860.000022com.technologyreview
8881925803223670.000015ca.citizenlab
8891925788436900.000010com.governing
8901925778233220.000011com.wikidot
8911925726023850.000015org.raspberrypi
8921925645246210.000008jp.ac.kobe-u
8931925545410730.000030com.timeanddate
8941925483610960.000029com.buffer
8951925403239780.000009com.ogilvy
896192515309400.000034com.css-tricks
8971925109615010.000021com.msdn
8981925013839580.000009com.gab
8991924999436730.000010com.what3words
9001924926012410.000026com.tableau
9011924831613190.000024com.xkcd
9021924822436950.000010com.nestle
9031924767849820.000007net.postheaven
904192464284700.000060com.fc2
9051924623817950.000019com.pcworld
9061924602825890.000014mp.j
9071924575443180.000009org.kuow
9081924530039060.000009org.migrationpolicy
909192452825850.000047com.fortune
9101924432437690.000010de.morgenpost
9111924412032820.000011uk.gov.data
9121924355849520.000007cz.webgarden
9131924310021180.000017org.donorbox
9141924219239090.000009de.uni-konstanz
9151924168442180.000009org.birdlife
9161924098238750.000010org.people-press
9171924077821320.000017to.dev
918192398469060.000035org.golang
9191923873224250.000015net.noscript
9201923774212230.000026com.podbean
9211923590641300.000009com.scienceblogs
9221923570649480.000007it.clyp
9231923549833550.000011edu.fordham
9241923169640760.000009org.oyez
9251923065634410.000011com.joebiden
9261922996028670.000013com.washingtonexaminer
9271922972811150.000028com.gizmodo
9281922911227570.000013org.healthaffairs
9291922891012320.000026com.searchengineland
930192286788540.000037fm.anchor
9311922741250840.000007com.zcubes
9321922725819950.000018com.ssllabs
9331922596410720.000030org.poynter
9341922443616440.000020net.java
9351922363215140.000021edu.usc
9361922325236800.000010org.carbonbrief
9371922150251650.000007org.csgrid
938192212863080.000087jp.ameblo
9391922006415780.000021com.sun
9401922001039590.000009org.rfa
9411921858826160.000014uk.gov.defra
9421921855639120.000009com.exxonmobil
9431921810252490.000007com.topsitenet
9441921773230120.000012com.html5rocks
9451921749436600.000010ca.yelp
9461921657629400.000013com.instructables
9471921558222120.000016org.linuxfoundation
9481921541040690.000009uk.org.woodlandtrust
9491921385420580.000017org.json
950192137902140.000124com.tripadvisor
9511921249052330.000007net.squareblogs
9521921237838640.000010ru.mid
953192121702310.000113com.myshopify
9541921110833100.000011com.flippa
9551921109238500.000010com.townandcountrymag
9561921093812920.000025build.bazel
9571921081652950.000007net.werite
9581921021212400.000026com.uk
9591920974223540.000015com.storify
9601920950832800.000011org.cjr
9611920885431580.000012org.acog
9621920844839210.000009br.com.sebrae
963192083802500.000107org.icann
9641920787616350.000020fr.blogspot
965192075821220.000262com.bizjournals
9661920728034060.000011org.cites
9671920710226870.000013com.tutsplus
9681920705834090.000011tr.com.aa
9691920638011090.000028org.whatbrowser
9701920575046800.000008org.learner
9711920514434240.000011no.yr
9721920373842710.000009com.s-nbcnews
9731920315041660.000009org.spie
9741920308213350.000024com.indiegogo
975192026347080.000042com.airbnb
9761920228842170.000009com.revolut
9771920151443390.000009org.atsjournals
9781920138610330.000031com.redhat
9791920076040660.000009uk.co.zoopla
980191998263180.000084it.google
9811919924611370.000028com.windowsphone
9821919866614850.000022edu.unc
983191985084660.000060gov.fda
984191984086530.000043com.zapier
9851919827221610.000016com.gigaom
9861919731644570.000008ru.novayagazeta
9871919650419360.000018br.com.correios
9881919646841010.000009google.design
9891919535021940.000016org.eu
9901919223837580.000010com.mail-archive
9911919131044370.000008com.out
9921919100047590.000008tw.focustaiwan
9931919094642350.000009org.insideclimatenews
9941919077420380.000017com.freeprivacypolicy
9951919044242650.000009org.escardio
9961919035446630.000008com.theschooloflife
997191897662410.000111com.naver
9981918836247110.000008edu.uah
9991918823016110.000020com.nike
10001918736043190.000009edu.mtsu

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!