We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June/July and August 2022. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks. See below for a summary of changes and improvements implemented for the current web graph release.

Changes, improvements and bug fixes

  • Unicode internationalized domain names are always converted into their ASCII equivalents (IDNA). This is now ensured for node labels in the host-level webgraph (see cc-pyspark#35) and consequently also for the domain-level webgraph where non-ASCII characters were replaced by question marks (see cc-webgraph#6)
  • The nodes of the domain graph are now strictly sorted lexicographically by node label (the reverse domain name). This should allow for more efficient compression of the list of domain nodes.
  • The strict sorting was implemented to address a bug (cc-webgraph#3) which may cause duplicated nodes (two or more nodes with the same label) in the domain graph.
  • The domain graph includes domain names equal to multi-part public suffixes. Previously the assumption was that names of registered domains are exactly one level below any ICANN suffix in the public suffix list and host names which are equal to multi-part suffixes (including at least one dot) were excluded. Such host names are now included, eg. gov.uk, freight.aero or altoadige.it. No further validation (eg. DNS lookup) is performed, so also invalid domain names may be included. Generally, except for a valid domain name string with a valid TLD or public suffix, no further validation is performed for any domain name. For more details, see cc-webgraph#1.

Host-level graph

The graph consists of 449 million nodes and 2.69 billion edges. Both hyperlinks and HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including pure “technical” ones pointing to images, JavaScript libraries, web fonts, etc. However, only host names with a valid IANA TLD are used. Consequently, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 389 million dangling nodes (86.6%) and the largest strongly connected component contains 46.4 million (10.3%) nodes. Dangling nodes stem from

  • hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • hosts without any links pointing to a different host name
  • or hosts which did only return an error page (eg. HTTP 404)

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 449 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 96 gzip-compressed files listed in two path listings – one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using “gzip -d” or “gunzip”. By adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl May/Jun/Aug 2022 host-level webgraph

SizeFileDescription
3.09 GBcc-main-2022-may-jun-aug-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 32 vertices files
11.91 GBcc-main-2022-may-jun-aug-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 64 edges files
5.76 GBcc-main-2022-may-jun-aug-host.graphgraph in BVGraph format
2 kBcc-main-2022-may-jun-aug-host.properties
6.20 GBcc-main-2022-may-jun-aug-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2022-may-jun-aug-host-t.properties
1 kBcc-main-2022-may-jun-aug-host.statsWebGraph statistics
7.46 GBcc-main-2022-may-jun-aug-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org. Version (commit) e5ff0c7 of the public suffix list was used (commit date 2022-09-15).

The domain-level graph has 91 million nodes and 1.57 billion edges. 50% or 45 million nodes are dangling nodes, the largest strongly connected component covers 37 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/domain/.

Download files of the Common Crawl May/Jun/Aug 2022 domain-level webgraph

SizeFileDescription
0.63 GBcc-main-2022-may-jun-aug-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.52 GBcc-main-2022-may-jun-aug-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.77 GBcc-main-2022-may-jun-aug-domain.graphgraph in BVGraph format
2 kBcc-main-2022-may-jun-aug-domain.properties
3.59 GBcc-main-2022-may-jun-aug-domain-t.graphtranspose of the graph
2 kBcc-main-2022-may-jun-aug-domain-t.properties
1 kBcc-main-2022-may-jun-aug-domain.statsWebGraph statistics
1.96 GBcc-main-2022-may-jun-aug-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (May/Jun/Aug 2022)

13291468610.018077com.googleapis
23213156230.012273com.facebook
33177019020.015371com.google
42797171450.007018com.twitter
52783907470.006164com.youtube
62766270860.006892org.w
72725446880.005701com.instagram
82673948640.007602com.googletagmanager
925688664100.004673org.gmpg
102557859090.004792com.gstatic
1125033650120.003435com.linkedin
1224026588110.004116com.cloudflare
1323679430170.002013com.gravatar
1423526052130.002488org.wordpress
1523497052240.001546com.pinterest
1623111634280.001244org.wikipedia
1723073066140.002254com.apple
1822828954250.001434com.wordpress
1922794652310.001150com.vimeo
2022725476390.000940be.youtu
2122500412180.001913com.bootstrapcdn
2222420202320.001128com.microsoft
2322394764150.002193net.cloudfront
2422370010220.001568com.jquery
2522285798230.001553io.polyfill
2622278972510.000652com.blogspot
2722275856440.000799gl.goo
2822242208350.001012com.amazonaws
2922199000470.000701com.amazon
3022170846270.001252net.jsdelivr
3122147346460.000764eu.europa
3222143092410.000874ly.bit
3322058786420.000835org.mozilla
3422050970380.000958com.google-analytics
3522028542210.001626com.fontawesome
3621967818360.001001com.adobe
3721947388200.001865com.github
3821939440940.000371com.tumblr
3921919148190.001882com.googleusercontent
4021916910490.000687com.wp
4121896858520.000647com.paypal
4221790948610.000550co.t
4321769982480.000695com.whatsapp
4421761882540.000605com.flickr
4521753952990.000356com.yahoo
4621729404690.000515io.github
47217135181330.000248com.nytimes
4821675054340.001031ru.yandex
4921669788910.000382com.medium
5021638440300.001195com.wixstatic
5121614440670.000526com.shopify
52216013721190.000315com.reddit
53215943561580.000193com.forbes
5421576744400.000925com.googlesyndication
5521576166630.000546org.w3
56215626681310.000257com.soundcloud
57215395261080.000328com.weebly
5821506586590.000571org.schema
59214851141220.000306org.creativecommons
60214731801530.000207gov.nih
61214648621790.000156int.who
6221460442650.000529com.vk
63214215981770.000158com.theguardian
64214192922030.000129com.cnn
65214140341470.000213org.archive
66214109022180.000122uk.co.bbc
6721408640500.000660net.doubleclick
6821393638620.000550com.unpkg
69213901942170.000122com.businessinsider
70213842621490.000212com.tiktok
71213794501980.000134com.imgur
72213699061060.000332me.wp
7321362530780.000407com.android
74213616481480.000213com.wixsite
7521359636560.000603com.addthis
76213505402810.000098com.bloomberg
7721338508600.000564com.fb
78213326143370.000083edu.stanford
79213308843550.000078com.theverge
8021308772570.000588com.macromedia
81213068122400.000109com.imdb
82213056301170.000324me.t
83213017141810.000154com.bing
8421299082920.000379com.giphy
85212773663000.000093com.bbc
86212709581000.000353com.list-manage
8721266506430.000827net.fbcdn
88212651921430.000218gle.forms
89212633782520.000106com.wsj
90212378263680.000075com.go
91212366543210.000087com.reuters
92212366142200.000120org.ietf
93212335261320.000253com.statcounter
9421223988930.000375com.stripe
95212223321940.000137uk.gov
96212214063020.000093edu.mit
97212199642540.000105org.un
98212183022950.000096edu.harvard
99212181401840.000151com.issuu
100212155661750.000159gov.cdc
101212132921200.000314de.google
102212128762850.000097com.oracle
103212088481500.000209com.ytimg
104212065043960.000068com.cnet
105212047903380.000083com.techcrunch
106212030723650.000075gov.nasa
107211980901570.000198com.dropbox
108211974164760.000055com.msn
109211967222490.000107com.twimg
110211919143570.000077com.quora
111211909243670.000075com.wired
112211841722890.000097net.slideshare
113211841361900.000142com.unsplash
11421183394730.000469com.wix
115211817081740.000160org.apache
116211711024550.000058com.googleblog
117211693301350.000237com.mailchimp
118211687361820.000153com.etsy
119211675483640.000075org.hbr
120211629761250.000284com.spotify
121211597002480.000107com.stackoverflow
122211457102060.000127com.blogger
123211447963970.000067org.arxiv
124211403362920.000096com.slack
125211395362700.000101net.researchgate
126211385402630.000104uk.co.amazon
127211367883450.000080org.npr
128211343103740.000073com.example
129211285341560.000200us.zoom
130211280182360.000110com.washingtonpost
131211244943630.000076com.appspot
132211172901240.000298com.ft
133211156743250.000086com.cnbc
134211151363090.000091com.wiley
135211126043560.000078com.nature
136211049605100.000052edu.berkeley
137211000904830.000055com.myspace
138210953542160.000122com.outlook
139210936802980.000095org.acm
140210914961550.000203com.weibo
141210889361520.000208org.networkadvertising
142210803825720.000047com.cbsnews
143210791522790.000099org.gnu
144210747604370.000061uk.co.telegraph
14521073516680.000520com.godaddy
146210680682910.000096uk.co.google
147210650501300.000266com.youtube-nocookie
148210650301950.000136org.wikimedia
149210620663930.000069com.usatoday
150210618526450.000041com.intel
151210527524780.000055com.goodreads
152210481123780.000072com.time
153210411384810.000055com.theatlantic
154210403746330.000042com.box
155210385022740.000100com.squarespace
156210313082040.000129com.eventbrite
15721030242370.000977com.qq
158210300401850.000150com.yelp
159210256641370.000230com.opera
160210223303600.000076ee.linktr
1612102218211410.000026com.wikia
162210206703490.000079com.springer
163210175244650.000056com.latimes
164210172021700.000165com.zendesk
165210155964240.000062com.huffingtonpost
166210142021620.000185org.ampproject
167210118745740.000046com.indiatimes
168210098301450.000217info.aboutads
169210061828860.000035com.qz
170210051107040.000039org.chromium
171210035386820.000040com.buzzfeed
172210008982210.000120org.doi
173209994425850.000045com.vice
1742098945811160.000027com.thenextweb
175209879143040.000092com.typeform
176209836122610.000104com.sciencedirect
177209827025060.000053edu.cornell
178209822945440.000049com.mashable
179209771726260.000043com.scribd
180209736965230.000051edu.yale
181209712205010.000053uk.co.independent
182209708662580.000105net.behance
183209707766790.000040com.economist
184209682907470.000037edu.upenn
185209642822780.000099org.pewresearch
186209608285450.000049com.cisco
187209605824510.000058com.bigcommerce
188209560625640.000047com.psychologytoday
189209427265130.000052com.fortune
190209426181930.000139page.g
191209403023820.000071com.gitlab
192209391364620.000057uk.co.dailymail
193209361464320.000061com.pixabay
194209339223060.000091com.tinyurl
195209325364970.000053com.deloitte
196209320409560.000031com.evernote
197209254365420.000049io.codepen
198209244502120.000125com.calendly
199209232266940.000039com.vox
200209194947310.000038com.git-scm
201209185826100.000044org.unesco
2022091744810080.000030com.about
20320916974710.000469net.facebook
204209158325710.000047org.weforum
205209150844190.000062com.w3schools
206209149783260.000086com.typepad
207209112685150.000052com.squareup
208209077429040.000034com.arstechnica
209209009524730.000055com.nbcnews
210208999243730.000074co.ibb
211208995326320.000042com.withgoogle
212208989288090.000036edu.washington
213208966165210.000051com.inc
214208918208980.000034uk.ac.cam
215208863104050.000066com.sagepub
216208781345430.000049fm.anchor
217208768266830.000040com.apnews
218208756709670.000031com.slate
219208756504420.000059gov.whitehouse
220208726646890.000040com.venturebeat
221208711025300.000050com.pexels
222208666322420.000109org.iana
223208654282600.000105de.amazon
224208620545490.000048gov.noaa
225208608847550.000037me.about
22620858432330.001073com.baidu
2272085640613120.000023org.eclipse
228208542146090.000044com.mysql
229208470142440.000108com.live
230208460646540.000041com.nationalgeographic
231208443588760.000035edu.asu
232208428822990.000094com.ibm
233208390801960.000136jp.co.google
234208358383510.000078com.dribbble
235208354727160.000038ca.cbc
236208280325580.000048org.worldbank
2372082750012780.000023com.nike
238208149144590.000057gov.fda
239208130986030.000044org.pbs
240208114345860.000045gov.loc
241208104904670.000056gov.usda
242208102844850.000054com.gofundme
243208078083160.000088com.feedburner
244208070063290.000084net.windows
2452080527611320.000027com.hollywoodreporter
246208049301610.000187com.staticflickr
2472080469010030.000030org.greenpeace
248208023104920.000054com.tandfonline
249208023083390.000081eu.youronlinechoices
250208017249920.000031app.netlify
2512080137612810.000023com.billboard
252207993386420.000042com.newyorker
253207981948750.000035edu.wisc
254207969367020.000039au.net.abc
255207962729160.000033org.pypi
256207959001760.000159com.office
2572079531812660.000024com.technologyreview
258207849744770.000055com.theconversation
259207828988870.000035org.sciencemag
260207826402530.000105com.jotform
261207794809840.000031com.gizmodo
262207787086730.000040org.cambridge
2632077771412940.000023com.500px
264207772387300.000038com.walmart
265207759005250.000051com.oup
266207732166080.000044com.xinhuanet
267207721444290.000061com.getpocket
2682077059011810.000025edu.umd
269207687725000.000053gov.epa
270207678627090.000039org.bitbucket
2712076739011440.000026edu.purdue
2722076344013830.000022ms.1drv
2732076297410840.000028co.elastic
274207601208910.000034org.semver
275207555444300.000061org.debian
2762075363013080.000023org.kernel
277207497687570.000037com.britannica
278207497169630.000031com.nypost
279207471306380.000042com.elpais
280207446529290.000032com.foxnews
281207383605020.000053com.dailymotion
2822073661211540.000026com.sky
2832073567810000.000030com.uk
284207296882460.000108com.wpengine
2852072889016230.000019com.googlesource
2862072684610070.000030edu.princeton
287207254405480.000048gov.house
288207224365920.000045com.mozilla
28920721772860.000393com.wsimg
2902072165814040.000021com.over-blog
291207189064880.000054com.ted
2922071683816600.000018com.lego
293207159287540.000037gov.justice
2942071483210050.000030uk.co.guardian
295207138564250.000062com.arcgis
2962071348613190.000023com.digitaltrends
297207117507950.000036edu.umich
298207106504280.000061org.openstreetmap
299207095862410.000109net.sourceforge
300207086249470.000032com.ssrn
3012070308816970.000018org.usenix
302207003543860.000070com.netdna-ssl
303206983189350.000032com.ggpht
304206975182320.000113com.amazon-adsystem
305206968383140.000090tv.twitch
306206963309500.000032uk.co.blogspot
3072069613614160.000021com.hatenablog
3082069288411490.000026co.g
309206919642300.000114gov.ca
310206895888010.000036com.politico
3112068924613150.000023com.socialmediatoday
312206864407280.000038org.change
313206855282390.000110uk.org.ico
314206854982230.000119jp.co.yahoo
315206852125880.000045uk.gov.service
316206843541710.000162com.rawgit
317206842322800.000098net.azureedge
3182068182612100.000025io.itch
3192068034613180.000023de.mpg
3202067871415520.000019com.euronews
321206764909640.000031edu.jhu
322206761869400.000032edu.umn
323206750585310.000050site.business
324206727081690.000166com.addtoany
325206717744740.000055gov.hhs
326206701144120.000064com.ebay
3272066893815500.000019com.urbandictionary
3282066486611820.000025com.axios
3292066400812420.000024org.semanticscholar
3302066303411030.000027com.udemy
3312066250013950.000021com.reverbnation
3322065985815050.000020edu.indiana
3332065682414810.000020au.com.news
3342065492410790.000028edu.uchicago
335206542747520.000037org.fao
336206531126220.000043gov.census
3372065269811780.000025net.speedtest
3382065080817710.000017org.phys
33920650016740.000424net.akamaihd
340206479382290.000115com.hubspot
341206425949950.000030com.scientificamerican
3422064146613280.000023com.nymag
3432063887017880.000017com.martinfowler
3442063838216630.000018edu.gatech
345206376805550.000048com.kickstarter
346206355581870.000146com.xing
3472063550611290.000027org.wiktionary
3482063459210420.000029edu.utexas
3492063391223140.000015com.flipboard
350206336345670.000047com.snapchat
3512063324632040.000011com.openai
3522063031614230.000021ch.ethz
3532062952214200.000021com.businessweek
354206291008730.000035watch.fb
355206260181540.000206com.sharethis
356206256729480.000032com.timeanddate
357206250367200.000038org.d3js
3582062457817440.000017com.itv
3592062069012670.000024uk.ac.ucl
3602061883414550.000020uk.co.metro
361206177183200.000087com.statista
362206176025290.000050com.googlecode
3632061737811470.000026com.jetbrains
364206172926140.000044org.ohchr
365206172669150.000033de.spiegel
366206166544720.000055com.meetup
367206165803220.000086com.disqus
368206159663990.000067com.optimizely
3692061541428020.000013com.diigo
370206150482870.000097jp.ne.hatena
3712061436012850.000023com.smithsonianmag
3722061409811520.000026com.scmp
3732061401011100.000027com.foursquare
3742061049026360.000014blog.home
3752061020220200.000016com.knowyourmeme
376206078563530.000078net.themeforest
377206075067330.000038au.gov.nsw
3782060621810780.000028com.chicagotribune
3792060354811640.000026au.com.smh
3802060324815890.000019uk.co.express
3812059998611210.000027edu.nyu
382205995082680.000102com.npmjs
383205985666410.000042gov.senate
384205949746390.000042com.zdnet
3852059426411280.000027link.page
386205915689680.000031com.usps
387205887328900.000035gov.congress
388205868882930.000096com.eepurl
3892058531410020.000030com.history
390205840246770.000040com.pinimg
391205822661410.000221com.paypalobjects
39220581216660.000528com.googleadservices
393205803444500.000058es.google
3942057905227360.000014edu.byu
395205777488990.000034au.com.google
3962057758814500.000021uk.co.standard
397205766327110.000039com.istockphoto
39820572810970.000357net.jsfiddle
399205722022830.000097me.telegram
4002056853613330.000022cn.com.chinadaily
401205683225520.000048ca.google
4022056793611740.000025de.bild
4032056670413940.000022com.producthunt
404205660743920.000069com.proofpoint
405205647889550.000031edu.si
406205624166350.000042org.oecd
4072055958414790.000020ca.ubc
4082055917414670.000020com.wattpad
4092055814221320.000015app.web
410205579568880.000035google.blog
4112055788010950.000028com.dw
412205543187190.000038gov.archives
4132055317214910.000020com.buzzfeednews
414205529965030.000053nl.google
4152055192619210.000016com.mystrikingly
416205513844580.000057com.criteo
4172055086610350.000029uk.co.thetimes
418205496563520.000078com.prnewswire
4192054898214630.000020uk.ac.lse
420205487089740.000031in.co.google
421205482383800.000071com.sohu
4222054495614480.000021uk.co.wired
423205446863890.000069com.atlassian
424205443263590.000077net.php
425205420345270.000050com.matterport
4262054066616380.000018de.ebay
427205365067770.000036com.livejournal
428205354543280.000085ru.ok
4292053513010590.000029gov.treasury
4302053425011940.000025com.sun
4312053369817870.000017com.channel4
432205329003810.000071net.imgix
4332053263819320.000016gov.cia
4342053237010540.000029org.telegram
4352053150010530.000029uk.parliament
4362053109627590.000013ph.telegra
4372053055615090.000020uk.co.thesun
438205299285930.000045edu.cmu
4392052969410700.000028int.coe
440205280584940.000053com.media-amazon
4412052805218140.000017com.hindustantimes
442205277169190.000033com.iconfinder
4432052666410040.000030org.jstor
4442052529015900.000019com.straitstimes
4452052468017670.000017edu.tufts
446205233524180.000062com.elsevier
447205208684900.000054ru.gov
4482052020810830.000028gov.fbi
4492051631013710.000022edu.duke
450205149684080.000065com.adroll
4512051422613440.000022int.itu
4522051302613820.000022de.zeit
4532051242216540.000018com.newscientist
454205115743720.000074com.githubusercontent
4552051153214540.000021com.unity3d
4562050901417120.000018org.maven
457205086049880.000031de.focus
4582050828225250.000015com.storify
4592050653414750.000020com.irishtimes
460205064746270.000043gov.state
461205052687050.000039uk.nhs
4622050518817110.000018com.mercurynews
4632050514611960.000025edu.unc
464205044003110.000090com.mapbox
465205034206000.000044net.ctfassets
4662050308414060.000021jp.ne.goo
4672050137216900.000018org.propublica
468204999169000.000034gov.sba
4692049960027650.000013me.ogp
4702049896015410.000020com.mcafee
4712049829215640.000019com.nydailynews
4722049701213220.000023org.unhcr
4732049250619800.000016com.csmonitor
4742049141616450.000018ca.mcgill
475204909724960.000053org.python
476204884742590.000105gg.discord
4772048772835690.000010net.docdroid
4782048581818810.000016app.vercel
4792048497625570.000015com.instructure
4802048382812630.000024ch.ipcc
4812048115019430.000016io.gitlab
482204811442670.000102com.aliyuncs
4832048039819630.000016com.thoughtco
4842047852210250.000030gov.dhs
4852047793416350.000019com.lenovo
4862047744011240.000027gov.usgs
4872047531410370.000029org.ilo
4882047287812460.000024org.hrw
48920472770950.000363me.wa
490204721524530.000058com.samsung
491204704201420.000219com.salesforce
4922046719628180.000013com.oxforddictionaries
4932046655025390.000015au.com.sbs
494204656524360.000061com.filesusr
4952046410420510.000016com.brave
4962046213211070.000027com.thehill
4972046200812490.000024com.aljazeera
4982046132813270.000023com.brightcove
499204605327800.000036com.thinkwithgoogle
500204602985760.000046org.worldwildlife
5012045783028340.000013sg.edu.nus
502204558924350.000061com.visualstudio
5032045445438170.000009com.minds
5042045402410290.000029edu.brookings
5052045279810880.000028sg.com.google
506204520182960.000095gov.ftc
5072045125020810.000016com.rt
5082045057413350.000022de.welt
509204503268890.000035com.fandom
5102044911013070.000023de.sueddeutsche
511204488504980.000053com.fastcompany
512204482847680.000037com.oreilly
5132044815231820.000011cc.uxdesign
514204474089050.000034com.deviantart
515204426844490.000058com.ssl-images-amazon
5162044257228910.000013org.accessnow
5172044250038090.000009org.edublogs
518204410721590.000192com.jimdo
5192043955622450.000015tl.we
5202043864631430.000012com.instapaper
521204378422080.000125ru.mail
522204363964570.000057com.patreon
5232043619828410.000013com.bloglovin
5242043498415390.000020com.firebaseapp
5252043234235870.000010com.pearltrees
5262042997425650.000015edu.oregonstate
527204281203690.000074com.surveymonkey
528204262224030.000066com.businesswire
5292042590029070.000013org.wikibooks
5302042366211080.000027de.stern
5312042331016530.000018com.warnerbros
5322041899414070.000021be.google
5332041838011480.000026ly.rebrand
5342041610619130.000016edu.ucsb
535204157745980.000044com.airbnb
53620414102980.000356com.messenger
5372041310015660.000019org.rfc-editor
538204130103030.000093net.secureservercdn
5392041283419110.000016co.carrd
5402041263825550.000015it.scoop
541204119207870.000036com.zoho
542204117225390.000050com.gmail
543204112849230.000033com.thelancet
5442041047820230.000016com.dictionary
5452040866246620.000008com.folkd
546204084169530.000032edu.psu
5472040824819750.000016org.documentcloud
5482040767812030.000025org.undp
549204064646970.000039io.readthedocs
5502040627214030.000021net.codecanyon
5512040567631420.000012com.hubpages
552204039586400.000042com.entrepreneur
5532040231018550.000017com.france24
554204005242370.000110to.amzn
5552039957625500.000015gov.lbl
5562039799632960.000011google.ai
5572039739228120.000013com.aboutamazon
5582039573812840.000023com.snopes
5592039474814150.000021int.unfccc
560203943629540.000032com.ubuntu
561203942642130.000125com.aspnetcdn
562203930365610.000047com.steampowered
5632039234430410.000012com.dreamstime
5642039166615270.000020gov.defense
5652039024618290.000017org.iea
5662038772229910.000012com.oregonlive
5672038397233950.000011org.neocities
5682038375216520.000018io.ghost
5692037956826250.000014org.nature
5702037893411800.000025com.prweb
571203782425650.000047com.netflix
5722037787814590.000020mil.army
573203773884990.000053org.nodejs
5742037714422290.000015uk.bl
5752037700820490.000016org.archlinux
5762037639211190.000027com.dell
5772037607430490.000012org.paho
5782037576221030.000016com.thefreedictionary
579203730647010.000039com.docker
5802037249627210.000014org.computer
5812037243626500.000014com.googlegroups
5822037072022330.000015org.ap
5832036947031160.000012com.webbyawards
584203693941380.000229me.line
585203693746060.000044com.investopedia
5862036915431260.000012org.scala-lang
5872036834827380.000014com.msnbc
5882036599215470.000019ca.sfu
5892036351217640.000017com.patch
5902036291412390.000024net.clickbank
5912036208832230.000011de.chip
5922035964032070.000011org.vim
593203579489360.000032org.js
5942035751813900.000022io.shields
5952035737229650.000012org.rsf
5962035294217980.000017gov.usembassy
5972035115413850.000022com.mixpanel
598203505401020.000349com.uservoice
5992035011840860.000009com.bravesites
6002035005227390.000014edu.iastate
6012034994032020.000011com.slides
602203498608930.000034com.office365
6032034911834430.000010org.aclweb
6042034910433750.000011org.google
6052034713834110.000011uk.co.yougov
606203451885790.000046org.unicef
6072034509231950.000011com.dummies
6082034507243950.000008it.justpaste
6092034479633940.000011org.globalcitizen
6102034459018740.000016ca.globalnews
611203438845070.000053com.fc2
612203420345240.000051com.adweek
6132034136228670.000013jp.co.japantimes
6142034050412700.000023com.loom
615203393769370.000032com.digitaloceanspaces
61620339258720.000469com.oculus
617203391267440.000038uk.co.pinterest
6182033877010910.000028com.webs
6192033848634160.000011com.thecvf
6202033242826570.000014ca.ualberta
6212033187422520.000015com.channelnewsasia
6222033156029680.000012in.businessinsider
623203306089820.000031org.mediawiki
6242033043815250.000020com.bol
6252032881023250.000015com.foreignpolicy
626203280385570.000048com.digg
627203261242880.000097com.bandcamp
628203258469590.000031com.variety
6292032510012930.000023org.imf
6302032496211300.000027ly.cutt
6312032387232210.000011org.freedomhouse
6322032338417460.000017us.mn.state
6332032335849700.000007com.sendspace
6342032248043370.000008org.marxists
63520322396640.000540com.trustpilot
636203218343320.000084me.fb
6372032076616990.000018com.ipsos
6382032020216980.000018gov.uscis
639203196802110.000125org.whatwg
6402031912018110.000017eu.politico
6412031874455190.000006com.edocr
6422031826633510.000011de.diplo
6432031652420770.000016com.spreaker
6442031641629610.000012com.space
6452031513418660.000017com.voanews
6462031489627500.000014org.wikidata
6472031389620470.000016dk.google
6482030999810330.000029me.onelink
6492030916036950.000010com.prweek
6502030864435780.000010com.virgin
6512030788232060.000011com.slidesharecdn
652203062306050.000044com.canva
6532030582616150.000019com.indianexpress
6542030536233790.000011com.reason
655203034869080.000034com.imageshack
6562030321238150.000009org.cpj
6572030292211570.000026com.att
658203021067590.000037uk.co.eventbrite
6592030074832330.000011com.hm
660203000127610.000037com.gumroad
6612029980431510.000012de.taz
6622029767836710.000010uk.ac.nhm
663202974729450.000032com.fiverr
6642029732427290.000014com.verywellhealth
665202971105730.000046com.globenewswire
666202968328830.000035com.wikihow
6672029399822510.000015org.ocks
6682029348832110.000011org.iucnredlist
6692029343430080.000012edu.uoregon
6702029320426580.000014com.gfycat
6712029309434090.000011org.oxfam
672202930668050.000036int.wipo
6732029283428510.000013com.fineartamerica
6742029272015010.000020pl.gov
6752029212445360.000008com.backblazeb2
6762029170818180.000017com.jimdosite
6772029095817740.000017com.thestar
6782029048031390.000012org.eji
6792029043242600.000008com.theodysseyonline
6802028928615330.000020com.routledge
6812028671235500.000010uk.co.timesonline
6822028543229000.000013org.gnupg
6832028491825700.000015com.infogram
6842028474619610.000016uk.org.greenend
6852028469422690.000015org.rand
6862028438019010.000016com.surveygizmo
687202835706210.000043br.com.uol
688202832185110.000052org.drupal
6892028283434230.000011org.democracynow
6902028141810570.000029org.unicode
6912027968042480.000008com.roche
6922027910249780.000007re.cli
6932027847229590.000012com.kaggle
6942027834412080.000025cn.news
6952027808221820.000015cc.tiny
6962027756435130.000010org.bitcointalk
6972027696831800.000011com.gawker
6982027552434590.000010com.bigthink
6992027539613620.000022com.jekyllrb
7002027425017350.000017com.justia
7012027330010770.000028com.css-tricks
7022027237828210.000013com.motherjones
7032027215628500.000013edu.nd
7042027207616910.000018org.ourworldindata
7052027109818850.000016ca.on.gov
7062027016628030.000013com.timesofisrael
7072027014036460.000010org.project-syndicate
708202699985090.000052com.mckinsey
709202695961920.000140com.discord
7102026951425530.000015net.openid
7112026946614050.000021org.amnesty
7122026945228420.000013net.vnexpress
7132026829641610.000009com.crayola
7142026829214460.000021gov.uscourts
7152026768220290.000016gov.faa
716202673444840.000055com.onesignal
7172026667023800.000015com.lexisnexis
7182026573032970.000011com.nme
7192026500612310.000024ms.aka
7202026494820430.000016gov.usaid
7212026363210660.000028com.pcmag
7222026348029760.000012com.mathworks
7232026342227960.000013uk.ac.kcl
7242026300227460.000014fr.gouv.diplomatie
7252026213618540.000017org.worldcat
726202605225530.000048ca.youradchoices
7272025773630500.000012org.csis
7282025721633300.000011org.repec
7292025719220030.000016de.ndr
7302025691011930.000025com.playstation
7312025630830830.000012ru.kp
7322025487033760.000011no.uib
733202547126600.000041gov.nist
7342025367431290.000012org.ewg
7352025357025880.000014de.web
736202531329010.000034com.mobirise
7372025267830050.000012au.com.businessinsider
7382025202235860.000010org.polymer-project
739202518465410.000049com.sxsw
740202499586880.000040com.usnews
741202484482090.000125com.myshopify
742202475203420.000081mp.mailchi
743202474948120.000036net.b-cdn
7442024643042360.000008com.mail
7452024638825710.000015com.sina
7462024565815400.000020com.pastebin
7472024472446330.000008com.mysanantonio
7482024472026560.000014org.unctad
7492024392434920.000010com.thejakartapost
7502024340012890.000023org.coursera
7512024300212960.000023com.smashingmagazine
7522024239637140.000010io.fabric
753202423441640.000176de.bund
7542024167036060.000010com.shell
7552024142430640.000012com.biography
7562024111637510.000010com.nwsource
7572024046842890.000008build.bazel
7582024014425290.000015org.medrxiv
7592023691429140.000013com.coca-colacompany
760202365029460.000032com.shutterstock
761202362429490.000032uk.gov.legislation
762202358765180.000052com.herokuapp
763202346546290.000042it.placehold
7642023438057700.000006com.filedropper
7652023422843310.000008org.globalnetworkinitiative
7662023356615980.000019org.altervista
7672023356232980.000011com.sacbee
7682023342625380.000015org.biorxiv
7692023254032130.000011fr.rfi
7702023253429740.000012com.ericsson
7712023253040730.000009com.kinja
772202303829910.000031com.trello
7732022823031080.000012org.oas
7742022765011830.000025com.ycombinator
7752022670018100.000017org.donorbox
7762022631027170.000014com.e-monsite
7772022571210220.000030gov.fcc
7782022527420990.000016org.unodc
7792022406811590.000026com.tableau
78020223780750.000419net.cpanel
7812022230635800.000010org.tigris
7822022207813660.000022com.alexa
7832022199012480.000024gov.uspto
7842022120638280.000009com.wasabisys
7852022108818090.000017com.speakerdeck
7862021979424160.000015com.miamiherald
7872021913837710.000010com.bangkokpost
7882021836811250.000027gov.cms
7892021675011430.000026org.reactjs
790202160265620.000047com.gartner
7912021567211110.000027com.jwplayer
7922021500828440.000013edu.usf
7932021425829030.000013com.thenation
7942021360627300.000014com.washingtontimes
7952021305833550.000011com.wikidot
796202129729600.000031com.hp
797202109046510.000041gov.sec
7982021023220820.000016com.squarespace-cdn
7992020945026470.000014jp.nicovideo
8002020843044130.000008de.otto
8012020760826790.000014ru.kremlin
802202070882510.000106com.cloudinary
803202064125800.000046fr.free
8042020638410160.000030com.podbean
8052020623664750.000006com.uberant
806202061967140.000039org.apa
8072020488026260.000014se.haxx
8082020477040900.000009com.bloombergquint
8092020379219960.000016org.khanacademy
8102020330810810.000028com.engadget
8112020323037050.000010com.allafrica
8122020319032860.000011vn.com.google
8132020274651230.000007to.gplus
8142020174034680.000010my.com.thestar
8152020148436170.000010uk.org.asa
8162020094828960.000013com.simonandschuster
8172020083029040.000013com.lowes
8182020080422230.000015org.wto
819201999822070.000126com.caniuse
820201999822240.000118com.getbootstrap
8212019996626680.000014tv.ustream
8222019981441180.000009uk.co.spectator
823201988702270.000117org.icann
824201983946530.000041org.eff
8252019752234710.000010com.sputniknews
8262019634039820.000009com.manta
8272019599435090.000010uk.ac.qmul
8282019593033780.000011com.eiu
8292019540631850.000011com.financialpost
8302019539829540.000012uk.gov.metoffice
831201939822690.000102com.naver
8322019390022400.000015gov.gao
8332019313211140.000027edu.ucla
8342019242017130.000018fr.blogspot
8352019234234150.000011org.heritage
8362019205244450.000008org.scala-sbt
8372019181436520.000010com.thenationalnews
8382019140841790.000009com.rappler
8392019136042640.000008com.wusa9
8402019039632420.000011org.rferl
8412018971828070.000013ru.kommersant
8422018962238140.000009org.grist
8432018925414850.000020us.imageshack
8442018882213910.000022com.freeprivacypolicy
8452018878026280.000014org.wbur
8462018823650020.000007com.picsart
8472018803048270.000007org.frontlinedefenders
8482018770038440.000009com.newatlas
849201857805360.000050com.wufoo
8502018499614010.000021edu.northwestern
8512018341226730.000014com.fivethirtyeight
852201832527030.000039com.moz
8532018254420500.000016to.dev
8542018186037990.000009de.wwf
8552018174442660.000008com.iconarchive
8562018136637750.000009org.pri
857201794789430.000032com.redhat
85820178530550.000603com.dan
8592017848444900.000008tw.blogspot
8602017761028760.000013com.infoworld
861201757366640.000041com.aliexpress
862201756566850.000040com.photobucket
8632017454637100.000010int.au
8642017238636130.000010org.jenkins-ci
8652017196038070.000009com.obsproject
8662017018427270.000014com.discogs
8672017014845320.000008com.koreaherald
8682016968438380.000009ru.forbes
869201693809750.000031com.stackexchange
8702016757230070.000012com.yougov
8712016722835420.000010ly.plot
8722016687630860.000012org.panda
8732016668034000.000011com.law360
8742016575410640.000028com.emarketer
8752016381045490.000008org.article19
8762016377011090.000027com.merriam-webster
877201632063980.000067com.bitly
8782016195842270.000008com.prevention
8792016171263140.000006org.arkive
8802016165019360.000016com.hackerone
8812016140434500.000010com.news24
8822016138832770.000011com.foreignaffairs
8832016123244510.000008fr.huffingtonpost
884201605264430.000059com.skype
8852015839066530.000006com.booklikes
886201582829130.000033com.marketwatch
8872015800611990.000025org.webkit
8882015772640760.000009au.com.heraldsun
8892015756044760.000008org.siggraph
8902015695016480.000018com.newrelic
8912015638438840.000009gov.fec
8922015590238500.000009org.brainpickings
8932015515036310.000010de.uni-frankfurt
8942015484823390.000015com.w3techs
8952015442832670.000011edu.unh
8962015430243870.000008br.unicamp
89720153190580.000586com.afternic
8982015287657420.000006cc.kknews
8992015261010460.000029com.pwc
9002015235841820.000008com.wallethub
9012015149239870.000009com.collinsdictionary
902201502823120.000090com.webflow
9032015019445690.000008org.firstmonday
9042015016611620.000026com.appnexus
9052014971243710.000008uk.ac.westminster
9062014858447700.000007com.selfridges
9072014852233410.000011com.scotsman
9082014839419420.000016com.ssllabs
9092014767241690.000009com.datacenterknowledge
9102014658229530.000012com.washingtonexaminer
911201463964170.000063com.force
9122014571647390.000007br.ufrgs
9132014559826120.000014ru.ria
9142014507060180.000006com.armorgames
9152014441444780.000008net.middleeasteye
9162014300437430.000010com.thediplomat
9172014193043400.000008com.the-scientist
9182014185236760.000010gov.ornl
9192014133226180.000014gov.energystar
9202014031829270.000013org.wri
9212013986612730.000023org.owasp
9222013750635720.000010org.wilsoncenter
9232013722641940.000008uk.co.manchestereveningnews
924201371984520.000058gov.consumerfinance
9252013718013540.000022com.symantec
926201369368770.000035com.libsyn
9272013693014720.000020com.twilio
9282013678011770.000025com.semrush
9292013675657630.000006net.postheaven
9302013674241070.000009com.crashlytics
9312013634016080.000019com.techrepublic
9322013627814690.000020com.createjs
933201361749110.000033edu.columbia
934201355269580.000031com.buzzsprout
935201350185910.000045net.azurewebsites
9362013485827880.000013org.iucn
9372013445444840.000008com.googledrive
9382013417635560.000010org.sonatype
9392013411818040.000017ly.ow
9402013410445220.000008io.meduza
9412013399415590.000019net.msecnd
9422013397214170.000021com.weather
9432013279613750.000022com.rollingstone
9442013255041670.000009ru.aif
9452013248816490.000018com.upwork
9462013207818320.000017com.chrome
947201314184450.000059com.dmca
9482013093840970.000009org.avaaz
9492012964253340.000007cn.edu.sdu
9502012859025740.000015ru.rbc
951201285266610.000041com.figma
9522012748433910.000011nl.rug
9532012654052100.000007org.sourcewatch
9542012586651480.000007com.wsoctv
9552012554651490.000007com.linodeobjects
9562012546227240.000014int.reliefweb
9572012486031300.000012org.cfr
9582012454228590.000013com.springeropen
959201239703350.000083com.wistia
960201221989900.000031org.json
9612012198258400.000006com.grabcad
9622012111835510.000010ru.vedomosti
9632012088455050.000006org.sfpl
9642012018846370.000008ch.qos
9652011971636930.000010org.escholarship
9662011914839770.000009uk.ac.sussex
967201189823440.000081com.automattic
9682011877841310.000009com.gannett-cdn
9692011734640520.000009edu.scu
9702011727440680.000009org.nationalinterest
9712011708435100.000010com.tradingeconomics
9722011705237020.000010org.thinkprogress
9732011696442400.000008com.dawn
9742011628441660.000009cc.taplink
9752011517831060.000012ca.citizenlab
9762011482826520.000014com.bankrate
9772011468220300.000016com.tutsplus
9782011446214120.000021org.golang
9792011383257060.000006com.london2012
9802011363620330.000016org.linuxfoundation
9812011328018400.000017edu.rutgers
9822011304831790.000011org.undocs
9832011267248640.000007za.co.dailymaverick
9842011262831980.000011com.springernature
9852011255237400.000010au.edu.adelaide
9862011182246650.000008com.mnn
9872011028043120.000008ae.google
9882011027430890.000012org.crossref
9892011026235450.000010com.vox-cdn
9902011025641080.000009com.dailykos
9912010954848820.000007uk.ac.lancs
992201094728630.000036org.ieee
993201080507290.000038ca.canada
9942010684834440.000010org.cato
9952010659437720.000009gov.ustr
996201065624690.000056com.indeed
9972010655434790.000010com.cityam
9982010617241930.000008de.ebay-kleinanzeigen
9992010580410190.000030com.techtarget
1000201044946690.000040gov.copyright

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!