Host- and Domain-Level Web Graphs February/March, April and May 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February/March, April and May 2021. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks.

What’s new?

The host-level graph now includes all hosts visited by the crawler even if there is no link pointing to the host and all visited URLs of a host failed (HTTP 404 and other error codes) or the host’s robots.txt does not allow crawling. Note that the links leading to these hosts may have been found in a prior crawl, not in one of the 3 crawls used to build this web graph.

Host-level graph

The graph consists of 515 million nodes and 2.82 billion edges. Both hyperlinks and HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including pure “technical” ones pointing to images, JavaScript libraries, web fonts, etc. However, only host names with a valid IANA TLD are used. Consequently, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 452 million dangling nodes (87.9%) and the largest strongly connected component contains 45.2 million (8.8%) nodes. Dangling nodes stem from

  • hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • hosts without any links pointing to a different host name
  • or hosts which did only return an error page (eg. HTTP 404)

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 515 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 72 gzip-compressed files listed in two path listings – one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using “gzip”. By adding the prefix s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl Feb/Apr/May 2021 host-level webgraph

SizeFileDescription
3.31 GBcc-main-2021-feb-apr-may-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 24 vertices files
12.94 GBcc-main-2021-feb-apr-may-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 48 edges files
5.57 GBcc-main-2021-feb-apr-may-host.graphgraph in BVGraph format
2 kBcc-main-2021-feb-apr-may-host.properties
6.22 GBcc-main-2021-feb-apr-may-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2021-feb-apr-may-host-t.properties
1 kBcc-main-2021-feb-apr-may-host.statsWebGraph statistics
7.69 GBcc-main-2021-feb-apr-may-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 88 million nodes and 1.58 billion edges. 50% or 44 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2021-feb-apr-may/domain/.

Download files of the Common Crawl Feb/Apr/May 2021 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2021-feb-apr-may-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.37 GBcc-main-2021-feb-apr-may-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.58 GBcc-main-2021-feb-apr-may-domain.graphgraph in BVGraph format
2 kBcc-main-2021-feb-apr-may-domain.properties
3.42 GBcc-main-2021-feb-apr-may-domain-t.graphtranspose of the graph
2 kBcc-main-2021-feb-apr-may-domain-t.properties
1 kBcc-main-2021-feb-apr-may-domain.statsWebGraph statistics
1.89 GBcc-main-2021-feb-apr-may-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 88 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Apr/May 2021)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13192093410.017627com.googleapis
23103278430.013762com.facebook
32968130420.013832com.google
42710169240.007844com.twitter
52695466050.007519org.w
62688662470.006967com.youtube
72551585080.005718com.instagram
82503149060.007143com.googletagmanager
92439611690.005506org.gmpg
1023807122120.003347com.linkedin
1122970992130.003048com.gstatic
1222854052100.003951com.cloudflare
1322698594190.001914com.gravatar
1422504168140.002908org.wordpress
1522434542220.001564com.pinterest
1622100870250.001270org.wikipedia
1721950578170.002031com.wordpress
1821940826180.001958com.apple
1921766696150.002258com.bootstrapcdn
2021762964300.001174com.vimeo
2121722198380.000914be.youtu
2221556142210.001842com.jquery
2321478118290.001182com.microsoft
2421432212530.000703com.blogspot
2521354260350.001025com.amazonaws
2621337432440.000765com.amazon
2721320702430.000789gl.goo
2821170722620.000600ly.bit
2921149628990.000409com.tumblr
3021148242500.000739com.wp
3121136818450.000758org.mozilla
3221110018570.000689eu.europa
3321104262200.001894com.adobe
3421048760160.002200com.github
3521040284340.001026com.google-analytics
3621027350360.001015net.jsdelivr
3720998320270.001218com.wixstatic
3820995232310.001119net.cloudfront
3920946148470.000744com.flickr
40209131041070.000338com.yahoo
4120851316830.000436com.googleusercontent
4220843068370.000929io.github
43208406701110.000317com.reddit
4420834398580.000677com.paypal
4520816886230.001554com.fontawesome
46207735821030.000368com.weebly
4720764576790.000455com.medium
4820764512330.001035com.googlesyndication
4920757582320.001118ru.yandex
5020741944480.000743com.whatsapp
5120708152680.000520org.w3
52207058261320.000240com.nytimes
5320696906590.000673co.t
54206780881020.000375org.creativecommons
55206758221150.000290com.soundcloud
5620644978600.000624org.schema
5720627114740.000479com.shopify
5820621162660.000543com.vk
59206047261810.000149org.wikimedia
60206047241470.000204com.dropbox
6120579720550.000702com.addthis
62205729501380.000211org.archive
63205706101980.000133com.cnn
64205581141520.000187gov.cdc
6520550306800.000446me.wp
66205388161930.000136com.imgur
6720530078490.000740net.doubleclick
68205122941990.000133uk.co.bbc
69205059642000.000133net.slideshare
70204998641710.000155com.theguardian
71204897561580.000175int.who
72204822561200.000263com.spotify
73204811181750.000151com.bing
74204783202130.000124com.businessinsider
75204774782530.000104com.bloomberg
76204773001440.000206gov.nih
7720473648460.000748com.macromedia
78204405202540.000103com.wsj
79204343202240.000118edu.stanford
8020419762410.000847net.fbcdn
8120417930390.000885org.apache
82204096361570.000175org.ietf
8320397792900.000420com.list-manage
84203955943680.000071com.googleblog
85203953502170.000123com.stackoverflow
86203931721700.000155com.giphy
87203912263140.000085edu.mit
88203819482230.000118com.washingtonpost
89203726021340.000232com.ytimg
90203635923620.000073com.appspot
91203602363510.000076com.theverge
92203596102860.000093com.bbc
93203588703960.000067uk.co.telegraph
94203560364990.000056edu.berkeley
95203480482660.000101edu.harvard
96203460123300.000080com.go
97203416762370.000112com.office
98203387101450.000206us.zoom
99203357822470.000109com.android
100203353663270.000082com.wired
101203341602880.000092com.techcrunch
102203317822380.000111com.oracle
103203236385470.000051com.livejournal
104202966701640.000170com.issuu
105202958402960.000090com.cnbc
106202921462110.000124gov.ca
107202917544020.000066com.ted
108202883803790.000069gov.nasa
109202834261490.000195com.forbes
110202830501480.000199com.wixsite
111202829721510.000192com.npmjs
112202825245180.000054com.zdnet
113202796564470.000062com.msn
114202777522920.000091com.reuters
115202755403500.000076com.nature
11620273474780.000459com.godaddy
117202717183710.000070com.myspace
118202704942220.000119com.etsy
119202688323210.000084com.prnewswire
120202557262090.000125org.ampproject
121202523864070.000065org.arxiv
122202522923120.000085org.npr
123202522182630.000101com.sciencedirect
12420248804980.000410com.unpkg
125202464022650.000101com.example
12620245616670.000524net.akamaihd
127202370562150.000123com.eventbrite
128202345323670.000072org.hbr
129202323381760.000151com.blogger
130202316581270.000247org.networkadvertising
131202315523990.000066com.latimes
132202286902680.000101org.acm
133202232423380.000079com.statista
134202094343890.000068com.fastcompany
135202058486600.000043com.economist
136202024823430.000078com.time
137202024522260.000117com.twimg
138202019026790.000042edu.upenn
139202015305500.000050edu.yale
140202008422580.000102com.githubusercontent
141201912724740.000060com.steampowered
142201898241430.000206com.opera
143201886204440.000062uk.co.dailymail
144201884863530.000076com.springer
145201868065760.000047com.scribd
146201847847800.000041edu.columbia
147201801005350.000052org.chromium
148201758765910.000046me.about
149201757326040.000046google.blog
150201752842850.000094com.squarespace
151201740503350.000079com.huffingtonpost
152201713564310.000063com.nationalgeographic
153201687882210.000119uk.co.google
154201653722080.000125com.unsplash
155201635803880.000068com.w3schools
156201589563390.000079com.dribbble
157201547863400.000079com.tiktok
158201533562930.000091org.un
159201379247940.000040com.qz
160201338142480.000108com.bandcamp
161201295984850.000058edu.cornell
162201259548210.000039edu.umich
163201211201190.000267com.ft
164201153424350.000063com.theatlantic
165201110289660.000033edu.princeton
166201108083410.000078com.usatoday
167201055567860.000040com.evernote
168201054821330.000235info.aboutads
169201048104080.000065com.meetup
170201026384380.000062com.goodreads
171201008946250.000045org.ieee
172200989728780.000036com.slate
173200978706770.000042com.mysql
174200976564530.000061com.patreon
175200975301370.000216me.t
176200956005150.000055com.cbsnews
177200842046560.000043com.docker
178200833362910.000092com.wiley
179200825204800.000059gov.usda
180200806644540.000061com.dailymotion
181200788188170.000039edu.washington
182200771604930.000057com.withgoogle
183200750645230.000054io.readthedocs
184200710146440.000044com.marketwatch
185200650106500.000043uk.co.blogspot
186200627348680.000037com.shutterstock
18720062652540.000703com.fb
188200596644970.000056uk.co.independent
18920056344760.000467com.wix
190200559328110.000039org.cambridge
191200518445590.000049com.pexels
192200485767790.000041org.sciencemag
193200480045920.000046com.buzzfeed
194200442488190.000039com.stackexchange
195200434661790.000149ru.mail
196200434468440.000038com.webs
197200430745730.000048com.git-scm
198200402084640.000060com.inc
199200373542720.000100net.behance
200200297444250.000063gov.whitehouse
201200253428320.000038com.apnews
202200235187690.000041com.vox
2032002203013650.000024uk.co.thesun
204200185482740.000098com.outlook
205200183187720.000041org.bitbucket
20620017276400.000871com.qq
207200148722440.000110org.doi
208200120828120.000039uk.ac.cam
209200119982550.000103com.disqus
210200073122360.000112com.feedburner
211200056306700.000043org.worldbank
212200012305840.000047org.unicef
213200009324190.000064com.mozilla
214199997405930.000046co.ibb
21519999080260.001261io.polyfill
216199979285250.000054com.booking
21719993488420.000808com.baidu
218199897842600.000101com.cloudinary
219199858562890.000092com.tinyurl
220199839803450.000077com.ibm
2211998302211630.000027com.speakerdeck
222199825065970.000046gov.noaa
223199782066120.000045ee.linktr
224199773105690.000048com.psychologytoday
225199737105310.000053gov.loc
226199729204000.000066com.getpocket
2271997276010410.000031edu.utexas
228199717943200.000084org.pewresearch
2291997131013660.000024edu.rutgers
230199708945510.000050com.sagepub
231199702003090.000087com.nbcnews
2321996796211340.000028org.eclipse
233199655866480.000043com.trello
234199642803260.000082net.windows
235199641943840.000068com.quora
236199614306000.000046net.azurewebsites
237199599102750.000098gov.ftc
2381995593810570.000030edu.uchicago
239199533083110.000086com.netdna-ssl
240199519607820.000041org.semver
241199512861240.000252com.mailchimp
242199502944360.000063com.nypost
2431994929611950.000027com.hatenablog
244199471426520.000043com.newyorker
245199439389850.000033uk.co.guardian
246199435645900.000046com.usnews
247199404982200.000119tv.twitch
248199397387840.000041au.net.abc
249199388201660.000167com.amazon-adsystem
2501993630812780.000025com.vogue
251199354662300.000113com.wpengine
252199340981060.000338com.stripe
2531993326612610.000025org.kernel
254199297389410.000034com.politico
2551992641611930.000027org.unicode
256199256025800.000047org.eff
257199251745410.000051br.com.uol
258199248068520.000037com.about
2591992364413580.000024edu.hbs
260199236009540.000034com.dropboxusercontent
261199234649110.000035edu.jhu
262199220629930.000032co.elastic
263199218889130.000035com.steamcommunity
2641992015019710.000018com.googlesource
265199197605220.000054com.tandfonline
266199180102770.000097com.criteo
267199157085520.000050org.pbs
2681991298611060.000029edu.umd
26919912224640.000549co.g
270199083408650.000037com.foxnews
271199074561230.000261com.sharethis
2721990417810270.000031com.rollingstone
273199030822280.000115com.imdb
274199027749770.000033com.scientificamerican
2751990194013920.000023com.urbandictionary
276199008767750.000041uk.ac.ox
277199004063910.000067com.arcgis
2781989852020160.000018com.lego
279198984202510.000107page.g
280198983186310.000044gov.census
281198900565300.000053com.oup
282198879683460.000077com.optimizely
283198874245820.000047com.indiatimes
284198871943760.000069com.cnet
285198840244220.000064com.wufoo
286198829307040.000042uk.co.eventbrite
287198828064210.000064com.bigcommerce
2881988030613500.000024ca.blogspot
289198790168330.000038org.fao
290198787329080.000035com.jetbrains
2911987104414670.000022ca.ubc
2921986765019380.000018com.warnerbros
293198660124460.000062org.d3js
294198655189460.000034org.greenpeace
295198646322060.000127net.sourceforge
296198634503230.000083fr.google
2971986291612790.000025com.history
298198618068510.000038com.gumroad
299198617509190.000035com.chicagotribune
300198598446360.000044gov.archives
301198589022840.000095com.googlecode
302198535023420.000078com.slack
303198519322290.000114com.eepurl
304198456261140.000292com.paypalobjects
305198417029270.000035com.sap
306198398301530.000180com.addtoany
307198374662900.000092com.typepad
3081983408215620.000021de.mpg
309198300546640.000043com.pinimg
310198281482820.000095com.calendly
311198275304910.000057gov.epa
312198257563540.000076com.proofpoint
3131982112814300.000023ch.ethz
3141982109410280.000031com.500px
3151982055417320.000019com.diigo
316198203983340.000079com.live
3171982003412770.000025org.postgresql
3181981854412570.000025org.wiktionary
3191981791012740.000025org.aclu
320198176989810.000033edu.si
3211981658613940.000023edu.msu
3221981621010290.000031com.thehill
323198149368900.000036de.spiegel
324198131729160.000035com.huffpost
325198112824720.000060gov.hhs
3261980924011140.000028com.scmp
32719806650730.000484me.fb
328198063067640.000042org.change
329198050703780.000069com.sohu
3301980433613290.000024edu.illinois
331198041641850.000147com.xing
3321980119213230.000024org.tensorflow
3331980108610080.000032com.ssrn
334198001841620.000171com.zendesk
335197984289040.000035com.netlify
336197972945080.000056com.squareup
3371979702013520.000024com.sky
338197944001960.000134org.iana
3391979271410780.000029uk.co.thetimes
340197924948470.000038gov.congress
341197887048090.000039org.pypi
3421978387814220.000023cn.com.chinadaily
343197811429720.000033edu.academia
344197809744560.000061com.kickstarter
345197800848020.000040gov.senate
3461977912824150.000015org.pydata
3471977812411400.000027org.semanticscholar
348197757166200.000045site.business
3491977501212750.000025com.over-blog
350197748667920.000040org.oecd
3511977484616600.000020org.phys
352197743349990.000032com.yarnpkg
353197722488160.000039com.deviantart
3541977093610840.000029uk.co.mirror
355197705221870.000145com.rawgit
3561977011413150.000024com.axios
357197697006230.000045gov.house
358197689988940.000036com.discordapp
359197688668800.000036com.sciencedaily
360197662925110.000055com.gmail
361197656784230.000064com.technorati
362197639442160.000123com.hubspot
3631976163814330.000023com.unity3d
3641976076821370.000017org.threejs
3651976023813640.000024com.aljazeera
366197595802450.000109org.nodejs
367197588468960.000036com.bmj
368197555642610.000101com.ebay
3691975519811970.000026au.com.smh
370197536282340.000113org.gnu
3711975196415160.000021edu.osu
3721975136210250.000031int.coe
373197503029940.000032com.britannica
3741974840813120.000024edu.gatech
3751974681826910.000013com.openai
376197443704950.000056org.openstreetmap
377197430864370.000062com.ssl-images-amazon
378197415827910.000040br.com.google
379197410308550.000037ca.cbc
380197404848690.000037com.theconversation
3811973985225820.000014edu.toronto
3821973865210440.000031gov.usgs
3831973830615560.000021com.newscientist
384197362263010.000088net.themeforest
385197356986050.000046com.udacity
386197356684730.000060edu.nyu
3871973408417160.000019edu.ucsc
3881972370817000.000020org.emojipedia
3891972219420680.000017it.scoop
3901972202427540.000013com.slides
3911972187214590.000022ca.sfu
392197200048450.000038au.gov.nsw
3931971790819030.000019org.propublica
3941971758613860.000023com.firebaseapp
3951971609422470.000016com.skyrock
396197105167760.000041com.freepik
39719707962970.000412net.facebook
3981970490014540.000022com.penguinrandomhouse
399197035721950.000135org.bbb
4001970343219340.000018jp.co.japantimes
4011970103017620.000019com.itv
40219700818820.000437net.jsfiddle
4031970061619850.000018org.maven
4041969974623700.000015com.deepmind
405196978446170.000045com.healthline
406196953245060.000056de.gesetze-im-internet
407196947204650.000060org.python
4081969442823310.000015com.mystrikingly
409196915368840.000036gov.dhs
4101968823812330.000026com.wikia
4111968598620900.000017org.sqlite
4121968297615440.000021ms.1drv
413196820941780.000150com.salesforce
414196799143220.000084net.php
415196714843240.000083com.surveymonkey
416196709626340.000044com.mashable
4171967033816280.000020com.motherjones
418196687241390.000211com.weibo
4191966855424530.000014com.fastcodesign
4201966744415060.000021com.flipboard
4211966674624350.000015edu.byu
4221966548217480.000019edu.cuny
423196648863170.000085ru.ok
424196626182870.000092net.azureedge
4251966210813390.000024com.thedailybeast
426196596722460.000109org.aboutcookies
4271965883822830.000015com.shutterfly
4281965610814130.000023com.reverbnation
4291965572226660.000013io.material
430196552545370.000052io.codepen
4311965277612960.000025com.dw
432196519861250.000250com.youtube-nocookie
4331965041617240.000019com.esri
434196501884900.000057fr.free
4351964841615090.000021com.substack
436196474385610.000049com.matterport
4371964658419560.000018com.hindustantimes
4381964583019090.000019com.insider
4391964234221100.000017edu.oregonstate
4401964181423900.000015org.wikibooks
441196408388910.000036int.wipo
4421964024428200.000013org.aclweb
443196392266070.000045gov.state
4441963889423660.000015com.wattpad
445196386521600.000172gle.forms
4461963669210520.000030org.jstor
4471963639819510.000018com.channel4
4481963612617520.000019edu.ucsb
4491963594213200.000024gov.supremecourt
45019633994560.000697com.googleadservices
4511963176024410.000015at.ac.univie
4521962909629240.000013com.pbase
453196265722780.000097uk.org.ico
454196248026390.000044com.licdn
4551962342215180.000021ch.ipcc
456196218749370.000034com.gallup
457196217804960.000056com.herokuapp
4581961858411410.000027edu.brookings
459196173889630.000033edu.psu
4601961679013330.000024mil.army
461196166264340.000063com.rackcdn
462196144843850.000068com.atlassian
4631961176012260.000026com.smashingmagazine
4641960963422270.000016blog.home
4651960845013620.000024gov.defense
4661960769811310.000028com.photoshelter
467196074644830.000058net.imgix
468196070121820.000149jp.co.yahoo
4691960531622840.000015com.contently
470196020408260.000039com.oreilly
4711959770811740.000027com.mediafire
4721959559621170.000017com.thecut
4731959460419600.000018google.ai
4741959456831510.000012cc.uxdesign
4751959428031610.000012edu.uvm
476195941005200.000054edu.cmu
4771959308631370.000012com.instapaper
4781959109015910.000020com.thestar
479195883783690.000071net.researchgate
4801958721435020.000011com.raywenderlich
481195870085270.000053com.thinkwithgoogle
4821958486821490.000016fr.liberation
483195822301090.000336de.google
4841958141815740.000021com.buzzfeednews
485195776487670.000041org.worldwildlife
4861957666210130.000032com.ecwid
4871957611814770.000022com.findlaw
4881957480410120.000032com.thelancet
489195739367740.000041com.vice
490195735068130.000039gov.nist
4911957287219640.000018org.google
4921957250815310.000021org.hrw
493195704107650.000042com.intel
4941956823826950.000013uk.co.ibtimes
4951956779023720.000015com.oprah
49619567558870.000428com.workplace
4971956719433290.000011com.pearltrees
4981956717421030.000017com.voanews
499195667629650.000033com.engadget
500195661881260.000247com.statcounter
5011956477233650.000011org.edublogs
5021956398012600.000025org.aiga
5031956282810310.000031de.stern
5041956206815830.000020fr.francetvinfo
5051956019626200.000014com.hm
506195593423150.000085org.drupal
5071955913237360.000010fr.unblog
508195587867470.000042com.canva
5091955836228700.000013edu.ucf
5101955806432040.000012ph.telegra
511195575349260.000035uk.co.pinterest
5121955707224020.000015edu.kit
513195563585440.000051it.placehold
5141955552822190.000016net.corporate-ir
5151955391027680.000013co.ello
516195534268810.000036com.arstechnica
5171955301814490.000022com.livescience
5181955096821500.000016com.gq
5191955083619530.000018uk.gov.tfl
520195502542100.000125com.iubenda
521195500425330.000053com.pixabay
5221954832814080.000023org.undp
523195476688070.000039ca.amazon
5241954722610200.000031it.smarturl
5251954703226450.000014org.icrc
5261954693424470.000015com.webbyawards
5271954556424230.000015uk.ac.kcl
528195455549490.000034edu.ucla
5291954446214440.000022link.page
5301954396828610.000013com.dummies
5311954136615810.000021org.ocks
53219540748650.000544net.typekit
5331954002211220.000028org.ilo
5341953888225640.000014com.depositphotos
5351953886625020.000014com.unilever
5361953695013480.000024org.acs
53719536262810.000440com.livestream
5381953509826720.000013org.rsf
539195350764890.000057com.adweek
5401953404420500.000017com.msnbc
5411953022025090.000014com.slidesharecdn
5421953008420350.000018com.chronicle
5431952983630880.000012com.bepress
5441952958025710.000014com.biography
5451952932233840.000011tl.de
546195278863320.000079com.typeform
5471952642821850.000016com.newrepublic
5481952540023030.000015com.thoughtco
549195238566060.000045com.samsung
5501952311211000.000029org.ohchr
551195226687900.000040com.fiverr
5521952151817430.000019io.gitlab
553195212401210.000262com.jimdo
5541952029211570.000027com.thenextweb
5551952007020090.000018fr.orange
5561951961832720.000012net.openreview
5571951893622940.000015com.channelnewsasia
5581951709012830.000025org.aarp
5591951691826340.000014org.pewsocialtrends
5601951647619980.000018com.straitstimes
5611951393623100.000015edu.nd
5621951095620990.000017com.dallasnews
5631951073221300.000017de.br
5641950881822780.000015org.fas
5651950800012970.000024org.altervista
566195079782560.000103uk.co.amazon
567195072902190.000121to.amzn
5681950662428350.000013com.thejakartapost
5691950512822110.000016gov.lbl
5701950455616100.000020de.berlin
5711950436210860.000029com.popularmechanics
5721950370627430.000013uk.ac.leeds
573195036444590.000061com.staticflickr
5741950321033970.000011org.neocities
5751950235829960.000012org.vim
5761950218628830.000013org.globalcitizen
577194994505720.000048com.deloitte
578194993929220.000035com.zoho
579194989642330.000113io.shields
5801949893623280.000015com.indianexpress
5811949890238890.000010com.stratechery
5821949772828190.000013app.web
5831949635833860.000011org.zotero
5841949362429390.000013uk.gov.scotland
585194933145670.000048com.photobucket
5861949152437560.000010com.bravesites
5871949055214640.000022org.iea
588194899764320.000063com.hp
5891948995427130.000013uk.co.timesonline
590194894783650.000073com.quantserve
591194893364040.000066com.digg
592194866605600.000049com.cisco
5931948661811550.000027uk.parliament
5941948501429140.000013com.nwsource
5951948501223620.000015com.fineartamerica
596194845982670.000101com.onesignal
5971948423822340.000016com.foreignpolicy
598194842007980.000040org.weforum
5991948339829900.000012com.thoughtworks
6001948320215480.000021com.treehugger
601194823983070.000087com.aliyuncs
602194822246020.000046org.js
6031948023215270.000021gov.uscis
6041947904032560.000012uk.ac.city
6051947747620770.000017com.washingtontimes
6061947719835040.000011com.mariadb
6071947631625650.000014org.oas
608194752364170.000065com.gitlab
6091947225825840.000014com.mathworks
6101947175228300.000013com.dezeen
611194712848350.000038com.investopedia
6121947063824970.000014uk.co.yougov
6131946931629340.000013org.heritage
614194693086140.000045com.netflix
6151946625232810.000011com.shell
6161946538825400.000014fr.paris
617194649564480.000061gov.irs
6181946273240880.000009tl.page
6191946133013610.000024com.upwork
620194611704620.000061com.sxsw
6211946091412550.000025com.digitaloceanspaces
6221946054840910.000009com.jigsy
623194600668610.000037com.venturebeat
6241945841812150.000026com.dell
6251945734810160.000031gov.fcc
6261945682832290.000012uk.co.walesonline
6271945634629610.000013org.project-syndicate
6281945569620240.000018com.fivethirtyeight
629194552429200.000035fm.last
6301945505620860.000017info.worldometers
631194542529310.000034org.mediawiki
6321945367023770.000015ly.rebrand
6331945315840770.000009net.myanimelist
6341945282420750.000017cn.gov.fmprc
6351945201215020.000021org.amnesty
636194505483490.000077com.adnxs
6371944935019450.000018com.justia
6381944871240190.000009edu.usfca
6391944829827050.000013com.monday
6401944657615150.000021ca.bc.gov
641194464869430.000034org.reactjs
6421944612622850.000015net.openid
643194459043830.000068com.newrelic
6441944536613630.000024com.imageshack
6451944514435680.000010org.globalnetworkinitiative
6461944394025490.000014com.kaggle
6471944356236930.000010com.doodlekit
648194397922590.000102com.getbootstrap
6491943867028310.000013uk.co.inews
6501943831231290.000012com.bangkokpost
651194382304090.000065com.force
6521943790821070.000017uk.ac.imperial
6531943543446290.000008net.vingle
6541943415019820.000018be.kuleuven
6551943406635300.000011com.intensedebate
656194329265680.000048com.entrepreneur
6571943235035180.000011be.blogspot
6581942974031660.000012se.blogspot
6591942971213180.000024co.lpages
6601942899232660.000012org.carnegieendowment
661194286748370.000038com.globenewswire
6621942866231750.000012is.good
6631942809822460.000016com.instructure
6641942769829650.000012net.alarabiya
6651942720440900.000009com.kongregate
6661942651427950.000013com.discovermagazine
6671942574626130.000014org.gnupg
668194255185560.000049com.visualstudio
669194241301910.000139com.atdmt
6701942352837730.000010com.openlearning
6711942323037940.000010ch.swissinfo
6721942198235470.000010com.pixar
6731942008021540.000016com.livemint
674194197089570.000033com.variety
6751941714228160.000013uk.gov.metoffice
6761941434620040.000018com.surveygizmo
6771941299433370.000011cn.globaltimes
678194112129290.000035uk.gov.legislation
6791941107026390.000014org.ballotpedia
680194097362430.000110org.whatwg
6811940862031480.000012com.coca-colacompany
6821940834213430.000024uk.gov.nationalarchives
6831940616823260.000015com.thebalancesmb
6841940482231450.000012uk.gov.companieshouse
6851940308835320.000011com.dailykos
686194010081650.000170com.yelp
687194005122570.000103com.automattic
6881940027041690.000009com.penzu
6891939968624890.000014com.bloomberglaw
690193996624120.000065org.opensource
6911939812615470.000021org.khanacademy
6921939737638340.000010com.sfweekly
6931939523627790.000013com.thumbtack
6941939420228800.000013org.royalsociety
6951939368416740.000020kr.co.google
6961939367825310.000014com.post-gazette
6971939352028000.000013org.panda
6981939064824210.000015com.thenation
6991938971428230.000013io.fabric
7001938897449360.000008org.arkive
7011938875626890.000013uk.co.bbci
7021938762440420.000009hk.edu.cityu
7031938740631940.000012com.scribblelive
7041938635235530.000010com.gimletmedia
7051938587234890.000011com.tweetmeme
7061938483025410.000014de.uni-heidelberg
707193842842980.000089ai.shortpixel
7081938387219200.000019gov.gao
7091938297444250.000008com.storeboard
7101938165028140.000013com.politifact
7111938020233490.000011org.cato
7121937928248890.000008com.uberant
7131937730631830.000012fr.lepoint
7141937719438090.000010edu.depaul
7151937612638440.000010net.thedailystar
716193755904060.000066com.aol
7171937557040460.000009edu.umt
7181937279419480.000018tv.ustream
7191937262810340.000031com.verisign
7201936958832790.000011com.theweek
721193679349050.000035com.box
7221936717037240.000010com.eklablog
7231936585034880.000011com.militarytimes
724193658328660.000037gov.uspto
7251936558034830.000011com.multiscreensite
7261936409831030.000012uk.ac.york
7271935948831650.000012org.openweathermap
7281935857415260.000021com.techrepublic
7291935807033150.000011org.jenkins-ci
7301935796828150.000013org.wnyc
731193574586380.000044gov.copyright
7321935683434330.000011com.lawfareblog
7331935461023570.000015co.pcdn
7341935300432630.000012com.nyt
7351935276631010.000012se.svt
7361935186610480.000030net.clickbank
7371935154631210.000012com.scotsman
7381934872011820.000027com.foursquare
7391934866012390.000026com.pingdom
7401934804824750.000014com.squarespace-cdn
7411934667823230.000015com.natlawreview
7421934635027690.000013org.wri
7431934580034300.000011com.bigthink
7441934505441320.000009com.newgrounds
7451934469238620.000010org.sourcewatch
7461934235637200.000010re.cli
7471934178831560.000012gov.ncjrs
7481934145830870.000012my.com.thestar
7491934069833070.000011gov.anl
7501933993231170.000012com.nationalreview
7511933913225970.000014ca.newswire
7521933809016030.000020org.webkit
7531933740237000.000010org.elasticsearch
754193352769280.000035com.hootsuite
755193349363000.000088com.caniuse
7561933425232360.000012gov.fec
7571933391023270.000015ru.rg
7581933312437410.000010org.constitutioncenter
7591933210216020.000020com.jwplayer
7601933175442530.000009com.etymonline
7611933167836200.000010it.eventbrite
7621933151029600.000013com.madmimi
7631933146034910.000011com.afp
7641933019219070.000019com.kinstacdn
7651932813631630.000012gov.ornl
766193270424610.000061com.pubmatic
767193258664010.000066gg.discord
7681932551812890.000025com.intuit
7691932548211680.000027com.ycombinator
7701932525832920.000011com.crashlytics
7711932430242700.000009com.underconsideration
7721932285625990.000014com.articulate
7731932223032460.000012de.uni-frankfurt
7741932149636920.000010uk.co.spectator
775193210968670.000037com.wikihow
7761932101042750.000009to.gplus
7771932080249200.000008pl.pastebin
7781932062237910.000010uk.co.manchestereveningnews
7791931985429380.000013edu.unh
7801931897625530.000014de.tagesschau
7811931880221160.000017gov.energystar
782193183724290.000063com.businesswire
783193180508290.000038com.moz
7841931484835500.000010org.avaaz
7851931455436830.000010com.mnn
7861931447611720.000027com.alexa
7871931415023320.000015net.vnexpress
788193132683480.000077com.constantcontact
7891931273236000.000010com.heraldscotland
7901931232638430.000010fm.audioboo
7911931175044810.000008tv.eurovision
792193116469740.000033com.fandom
7931931125637170.000010uk.ac.uea
7941931117436970.000010uk.ac.core
7951931026835140.000011com.hsbc
7961931025434920.000011org.sciencenews
7971931024249160.000008com.blackplanet
7981931009632890.000011com.realclearpolitics
7991930936616980.000020com.pastebin
8001930919631900.000012uk.org.rspb
8011930832213770.000023com.techradar
802193080945290.000053com.indeed
8031930754849850.000007dk.bloggersdelight
8041930714444910.000008com.xtgem
8051930610820730.000017ca.on.gov
8061930550035360.000011uk.co.thisismoney
807193049087970.000040gov.sec
8081930233011280.000028net.atlassian
8091930224039370.000009com.collinsdictionary
8101929994414790.000022edu.purdue
8111929902031790.000012com.wayfair
8121929890836110.000010org.chathamhouse
8131929790032180.000012org.rferl
814192972163970.000066com.skype
8151929653647380.000008edu.ualr
8161929601635230.000011org.diva-portal
8171929567227850.000013org.cfr
8181929480612490.000025com.merriam-webster
8191929296848350.000008com.designobserver
8201929273433990.000011org.pewforum
821192922002700.000100jp.co.amazon
8221929146839940.000009uk.co.dailyrecord
8231929093639510.000009edu.swarthmore
8241929057033390.000011com.ubs
8251928974810750.000030so.notion
8261928974228470.000013us.govtrack
8271928923612560.000025com.udemy
828192890403330.000079com.hackerone
8291928871637870.000010org.nationalinterest
8301928862631380.000012com.doubleclickbygoogle
831192880002790.000097de.amazon
8321928724420360.000018org.doxygen
8331928684016610.000020scot.gov
8341928665239330.000009de.berliner-zeitung
8351928586815190.000021com.billboard
836192839106810.000042com.gartner
8371928339046980.000008net.writeablog
8381928268824650.000014com.infoworld
839192820848230.000039com.sedo
8401928170032000.000012org.aei
84119280820710.000502com.oculus
8421928065215800.000021edu.ucsd
843192803963290.000081mp.mailchi
8441928028839170.000009edu.umaine
8451927922232620.000012org.iucnredlist
8461927913028270.000013com.lexology
8471927830448510.000008com.nation2
8481927815652900.000007com.anotepad
8491927805641280.000009za.co.mg
85019276824770.000467com.messenger
8511927646020830.000017org.dejure
8521927600244940.000008net.blogfreely
8531927563013020.000024org.owasp
8541927514233090.000011com.foreignaffairs
8551927509240670.000009tw.com.books
8561927491642670.000009ca.nfb
857192748223640.000073com.bitly
8581927456032250.000012org.osce
8591927402837260.000010uk.org.wwf
8601927400639710.000009org.truthout
861192731041550.000178gov.privacyshield
8621927270819810.000018edu.uci
8631927236820440.000017se.haxx
864192722888970.000036com.emarketer
8651927211045320.000008com.symbaloo
8661927150810040.000032com.playstation
8671927133821960.000016org.sundance
868192712163630.000073eu.youronlinechoices
8691927119634960.000011com.rev
8701927108040710.000009in.thewire
871192709761590.000174org.nginx
872192705289030.000036com.libsyn
8731926865024000.000015us.pa.state
874192676101460.000205me.line
8751926747852020.000007net.bravejournal
8761926738631400.000012ru.kp
8771926733440140.000009com.ecowatch
878192667005140.000055org.debian
879192663025390.000052com.gofundme
880192661949760.000033com.pcmag
8811926491441510.000009com.theoutline
8821926451243160.000009org.icj-cij
8831926362614700.000022org.coursera
8841926161020760.000017gov.healthcare
8851926062637210.000010com.iconarchive
8861925973416570.000020net.leadpages
8871925903414860.000022com.technologyreview
8881925803223670.000015ca.citizenlab
8891925788436900.000010com.governing
8901925778233220.000011com.wikidot
8911925726023850.000015org.raspberrypi
8921925645246210.000008jp.ac.kobe-u
8931925545410730.000030com.timeanddate
8941925483610960.000029com.buffer
8951925403239780.000009com.ogilvy
896192515309400.000034com.css-tricks
8971925109615010.000021com.msdn
8981925013839580.000009com.gab
8991924999436730.000010com.what3words
9001924926012410.000026com.tableau
9011924831613190.000024com.xkcd
9021924822436950.000010com.nestle
9031924767849820.000007net.postheaven
904192464284700.000060com.fc2
9051924623817950.000019com.pcworld
9061924602825890.000014mp.j
9071924575443180.000009org.kuow
9081924530039060.000009org.migrationpolicy
909192452825850.000047com.fortune
9101924432437690.000010de.morgenpost
9111924412032820.000011uk.gov.data
9121924355849520.000007cz.webgarden
9131924310021180.000017org.donorbox
9141924219239090.000009de.uni-konstanz
9151924168442180.000009org.birdlife
9161924098238750.000010org.people-press
9171924077821320.000017to.dev
918192398469060.000035org.golang
9191923873224250.000015net.noscript
9201923774212230.000026com.podbean
9211923590641300.000009com.scienceblogs
9221923570649480.000007it.clyp
9231923549833550.000011edu.fordham
9241923169640760.000009org.oyez
9251923065634410.000011com.joebiden
9261922996028670.000013com.washingtonexaminer
9271922972811150.000028com.gizmodo
9281922911227570.000013org.healthaffairs
9291922891012320.000026com.searchengineland
930192286788540.000037fm.anchor
9311922741250840.000007com.zcubes
9321922725819950.000018com.ssllabs
9331922596410720.000030org.poynter
9341922443616440.000020net.java
9351922363215140.000021edu.usc
9361922325236800.000010org.carbonbrief
9371922150251650.000007org.csgrid
938192212863080.000087jp.ameblo
9391922006415780.000021com.sun
9401922001039590.000009org.rfa
9411921858826160.000014uk.gov.defra
9421921855639120.000009com.exxonmobil
9431921810252490.000007com.topsitenet
9441921773230120.000012com.html5rocks
9451921749436600.000010ca.yelp
9461921657629400.000013com.instructables
9471921558222120.000016org.linuxfoundation
9481921541040690.000009uk.org.woodlandtrust
9491921385420580.000017org.json
950192137902140.000124com.tripadvisor
9511921249052330.000007net.squareblogs
9521921237838640.000010ru.mid
953192121702310.000113com.myshopify
9541921110833100.000011com.flippa
9551921109238500.000010com.townandcountrymag
9561921093812920.000025build.bazel
9571921081652950.000007net.werite
9581921021212400.000026com.uk
9591920974223540.000015com.storify
9601920950832800.000011org.cjr
9611920885431580.000012org.acog
9621920844839210.000009br.com.sebrae
963192083802500.000107org.icann
9641920787616350.000020fr.blogspot
965192075821220.000262com.bizjournals
9661920728034060.000011org.cites
9671920710226870.000013com.tutsplus
9681920705834090.000011tr.com.aa
9691920638011090.000028org.whatbrowser
9701920575046800.000008org.learner
9711920514434240.000011no.yr
9721920373842710.000009com.s-nbcnews
9731920315041660.000009org.spie
9741920308213350.000024com.indiegogo
975192026347080.000042com.airbnb
9761920228842170.000009com.revolut
9771920151443390.000009org.atsjournals
9781920138610330.000031com.redhat
9791920076040660.000009uk.co.zoopla
980191998263180.000084it.google
9811919924611370.000028com.windowsphone
9821919866614850.000022edu.unc
983191985084660.000060gov.fda
984191984086530.000043com.zapier
9851919827221610.000016com.gigaom
9861919731644570.000008ru.novayagazeta
9871919650419360.000018br.com.correios
9881919646841010.000009google.design
9891919535021940.000016org.eu
9901919223837580.000010com.mail-archive
9911919131044370.000008com.out
9921919100047590.000008tw.focustaiwan
9931919094642350.000009org.insideclimatenews
9941919077420380.000017com.freeprivacypolicy
9951919044242650.000009org.escardio
9961919035446630.000008com.theschooloflife
997191897662410.000111com.naver
9981918836247110.000008edu.uah
9991918823016110.000020com.nike
10001918736043190.000009edu.mtsu

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

May 2021 crawl archive now available

The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.28 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The May crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-21/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2021-21/segment.paths.gz100
WARC filesCC-MAIN-2021-21/warc.paths.gz6400066.17
WAT filesCC-MAIN-2021-21/wat.paths.gz6400017.61
WET filesCC-MAIN-2021-21/wet.paths.gz640007.65
Robots.txt filesCC-MAIN-2021-21/robotstxt.paths.gz640000.17
Non-200 responses filesCC-MAIN-2021-21/non200responses.paths.gz640001.86
URL index filesCC-MAIN-2021-21/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-21/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

April 2021 crawl archive now available

The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content. It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-17/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2021-17/segment.paths.gz100
WARC filesCC-MAIN-2021-17/warc.paths.gz6400069.78
WAT filesCC-MAIN-2021-17/wat.paths.gz6400021.05
WET filesCC-MAIN-2021-17/wet.paths.gz640009.25
Robots.txt filesCC-MAIN-2021-17/robotstxt.paths.gz640000.16
Non-200 responses filesCC-MAIN-2021-17/non200responses.paths.gz640001.76
URL index filesCC-MAIN-2021-17/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-17/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February/March 2021 crawl archive now available

The crawl archive for February/March 2021 is now available! The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.2 billion new URLs, not visited in any of our prior crawls.

Improvements and Fixes

The ISO639-3 code for the Hmong language was updated to "hmn" – the code "blu" used so far was already deprecated in 2008. Crawl archives prior to this crawl will still use the code "blu". More details about this update are found here.

Archive Location and Download

The February/March crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-10/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2021-10/segment.paths.gz100
WARC filesCC-MAIN-2021-10/warc.paths.gz6400062.51
WAT filesCC-MAIN-2021-10/wat.paths.gz6400018.44
WET filesCC-MAIN-2021-10/wet.paths.gz640008.06
Robots.txt filesCC-MAIN-2021-10/robotstxt.paths.gz640000.2
Non-200 responses filesCC-MAIN-2021-10/non200responses.paths.gz640001.58
URL index filesCC-MAIN-2021-10/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-10/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2020 and January 2021. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks.

Host-level graph

The graph consists of 490 million nodes and 2.57 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 414 million dangling nodes (84.4%) and the largest strongly connected component contains 42.6 million (8.7%) nodes.

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 490 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 36 gzip-compressed files listed in two path listings – one for the nodes, one for the edges. First, download the paths listing and uncompress it using “gzip”. By adding the prefix s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl Oct/Nov/Jan 2020-2021 host-level webgraph

SizeFileDescription
3.08 GBcc-main-2020-21-oct-nov-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
11.76 GBcc-main-2020-21-oct-nov-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
5.18 GBcc-main-2020-21-oct-nov-jan-host.graphgraph in BVGraph format
2 kBcc-main-2020-21-oct-nov-jan-host.properties
5.63 GBcc-main-2020-21-oct-nov-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-21-oct-nov-jan-host-t.properties
1 kBcc-main-2020-21-oct-nov-jan-host.statsWebGraph statistics
7.04 GBcc-main-2020-21-oct-nov-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 86 million nodes and 1.47 billion edges. 50% or 43 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/domain/.

Download files of the Common Crawl Oct/Nov/Jan 2020-2021 domain-level webgraph

SizeFileDescription
0.59 GBcc-main-2020-21-oct-nov-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.00 GBcc-main-2020-21-oct-nov-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.40 GBcc-main-2020-21-oct-nov-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2020-21-oct-nov-jan-domain.properties
3.26 GBcc-main-2020-21-oct-nov-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2020-21-oct-nov-jan-domain-t.properties
1 kBcc-main-2020-21-oct-nov-jan-domain.statsWebGraph statistics
1.85 GBcc-main-2020-21-oct-nov-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 86 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Oct/Nov/Jan 2020-2021)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13035556610.017956com.googleapis
22942716430.012871com.facebook
32817356220.012899com.google
42570281250.007348com.twitter
52562831440.007628org.w
62529780860.007231com.youtube
72419546690.005352com.instagram
82335535680.005532org.gmpg
92323367470.006500com.googletagmanager
1022492432110.003277com.linkedin
1121576402100.004076com.cloudflare
1221468510140.002649com.gravatar
1321395642130.003020org.wordpress
1421353798220.001726com.pinterest
1520946722300.001242org.wikipedia
1620926308190.001834com.wordpress
1720877776160.002056com.gstatic
1820799472150.002451com.bootstrapcdn
1920795402180.001943com.apple
2020626472320.001165com.vimeo
2120527986410.000886be.youtu
2220419038210.001769com.jquery
2320391686280.001246com.microsoft
2420327544240.001500com.wp
2520314602450.000769com.blogspot
2620231490370.001025com.amazonaws
2720208912510.000691com.amazon
2820199388470.000740gl.goo
2920093688710.000448com.tumblr
3020070176350.001070com.google-analytics
3120050256610.000598ly.bit
3220030452200.001794com.adobe
3319998314170.002005com.github
3419989010500.000715org.mozilla
3519962834580.000639eu.europa
3619945306340.001103net.cloudfront
3719849112520.000682com.flickr
3819843288400.000909net.jsdelivr
3919833032910.000369com.googleusercontent
40198235601050.000347com.yahoo
4119752300560.000650co.t
4219722088330.001114com.googlesyndication
4319712406230.001517com.fontawesome
4419708354810.000392com.weebly
4519706054550.000653com.paypal
46196952881090.000308com.reddit
4719641534310.001231me.wp
4819640398730.000435com.medium
4919635162670.000491io.github
50195904441370.000225com.nytimes
51195878801210.000280com.soundcloud
5219585192270.001262ru.yandex
5319583494430.000786com.addthis
5419582250440.000776com.macromedia
5519560416660.000504org.w3
5619549714700.000451com.shopify
57195186721460.000201com.forbes
58195024481440.000205org.archive
5919496300900.000371org.creativecommons
60194903481940.000131uk.co.bbc
6119482926590.000630org.schema
6219479528390.000910com.baidu
6319464572360.001035net.doubleclick
64194599662000.000129com.cnn
6519451100530.000677com.whatsapp
6619449068600.000611com.vk
67194449662060.000126net.slideshare
68194439561580.000169com.bing
69194198781740.000152com.imdb
70193859561860.000140com.imgur
71193725202360.000112com.washingtonpost
72193710761760.000150com.theguardian
73193569522540.000102com.wsj
74193564742100.000123org.wikimedia
75193521282190.000117com.businessinsider
76193476982090.000123com.stackoverflow
77193427124090.000065com.msn
78193266543270.000079com.appspot
79193243341570.000172int.who
80193211122160.000119edu.stanford
81193167961790.000148org.apache
82193103903330.000078com.ibm
83193093543370.000077edu.mit
84193049382250.000116net.sourceforge
85192929321160.000288com.ytimg
8619287812570.000649net.fbcdn
87192824862850.000091com.techcrunch
88192765002690.000094com.bbc
89192754801550.000181com.wixsite
90192752221520.000189gov.nih
91192752002200.000117com.livejournal
92192706502330.000113uk.co.google
93192706104400.000062gov.nasa
9419263354540.000666com.googleadservices
95192434042620.000097edu.harvard
96192431542700.000094com.oracle
97192431262760.000093org.acm
98192386502180.000117org.ietf
99192384501850.000142com.blogger
100192384262230.000116gov.ca
101192346304650.000059fr.free
102192320582590.000098com.bloomberg
103192218442750.000093com.android
104192186363040.000085com.live
105192108121260.000271com.jimdo
106192088961690.000159com.issuu
107192058021660.000162com.giphy
108191941564380.000062com.ted
109191901783480.000075com.huffingtonpost
110191877821300.000254com.weibo
111191868621540.000186us.zoom
112191857942520.000103org.gnu
113191763324030.000066com.myspace
1141916212210390.000030com.wikia
115191525823730.000071net.researchgate
116191500583430.000075com.usatoday
117191483323090.000084com.reuters
118191439884000.000067uk.co.telegraph
119191412024460.000061com.latimes
120191309763720.000071com.example
121191295523450.000075com.githubusercontent
12219127344930.000366com.unpkg
123191271163840.000069com.nature
124191253963360.000077com.wired
12519124320250.001485com.wixstatic
126191148422990.000087org.npr
127191110183080.000084com.cnbc
128191077723280.000079com.ebay
129191037042930.000088com.wiley
130191028141110.000299de.google
131190977321910.000135com.npmjs
132190954543440.000075com.hp
133190885505390.000050com.cisco
134190840489320.000034com.stackexchange
135190817361320.000251com.youtube-nocookie
136190806381340.000250com.ft
137190788142130.000120org.ampproject
138190772325320.000051com.steampowered
139190746383650.000072com.patreon
140190729184550.000061com.theatlantic
141190728804760.000057com.gitlab
142190723448900.000035com.pcmag
143190684361950.000131com.unsplash
144190654948770.000036edu.psu
145190639263760.000070com.time
146190611422080.000125com.twimg
147190610641640.000165com.yelp
148190593328730.000036edu.washington
149190571965330.000051edu.cornell
150190541521480.000197com.dropbox
151190517386030.000046org.arxiv
152190476263790.000070com.statista
153190430503240.000080org.un
154190426022490.000104com.bandcamp
155190409148240.000038com.venturebeat
15619040684750.000432me.fb
157190398828410.000037org.chromium
15819033464650.000519com.wix
159190262442840.000092com.sciencedirect
160190197666290.000045edu.yale
161190163265840.000047com.pexels
162190152308260.000038org.bitbucket
163190104528320.000038org.ieee
164190076363880.000068com.springer
165190018107650.000041com.evernote
166189975068550.000037edu.upenn
167189949262580.000098jp.ameblo
168189937721490.000195me.t
169189928344160.000065org.hbr
170189920282960.000088com.outlook
171189859541680.000160jp.co.yahoo
172189832385770.000048com.cbsnews
173189825467920.000040me.about
174189812288910.000035com.git-scm
175189803368290.000038com.economist
176189803281500.000193com.opera
177189780561380.000223me.line
178189749964500.000061com.goodreads
179189733646450.000044com.mysql
180189731148420.000037com.docker
181189697085620.000048com.buzzfeed
182189695665650.000048com.mashable
183189683985870.000047com.mozilla
184189645409510.000034com.about
185189626327970.000040org.worldbank
186189561288150.000039com.newyorker
187189546683420.000076com.dribbble
188189542362650.000096net.behance
189189518763900.000068com.theverge
190189518385010.000054gov.whitehouse
191189501424560.000061uk.co.dailymail
192189438903470.000075com.xinhuanet
193189428123200.000080com.w3schools
194189411243780.000070com.fc2
1951893648811510.000027edu.wisc
196189350747640.000041gov.noaa
197189323962940.000088com.disqus
1981893122813370.000023co.elastic
19918927646380.000956com.qq
200189266944480.000061com.bigcommerce
201189264426240.000045gov.loc
202189256201560.000179gov.cdc
203189246329290.000035gov.fcc
204189228161360.000228info.aboutads
205189216308210.000039com.qz
2061892130822950.000015com.wikidot
207189192403850.000069com.scribd
208189151047480.000042org.unesco
209189144189590.000033com.apnews
210189124263750.000070com.digg
211189110827790.000040com.vox
212189103701800.000147com.amazon-adsystem
213189101102720.000094com.squareup
214189074104950.000054uk.co.independent
215189062242560.000100org.iana
2161890560812510.000025edu.uchicago
217189013984200.000064com.force
218188987026460.000044com.usnews
219188981086470.000044com.gartner
220188949182950.000088com.nbcnews
221188901604700.000058com.dailymotion
2221888348810040.000031com.dropboxusercontent
223188782766170.000045org.pbs
224188764541810.000147jp.co.google
225188761641130.000292com.sharethis
226188758244670.000059com.nationalgeographic
227188741128110.000039uk.co.blogspot
228188733408440.000037au.net.abc
229188680009340.000034com.foxnews
2301886532215590.000020org.eclipse
231188594643990.000067com.getpocket
232188592289470.000034com.slate
233188590622660.000095org.doi
23418858866630.000541com.fb
235188566389680.000033com.politico
236188499929070.000035com.playstation
237188493346000.000046org.semver
2381884846815650.000020gd.is
2391884700413110.000024edu.unc
2401884675815230.000021org.kernel
241188463108390.000037org.sciencemag
242188460382570.000099com.typepad
2431884499811520.000027com.hatenablog
2441884400419810.000018com.googlesource
245188421802020.000128com.naver
246188405482480.000104com.feedburner
2471883983010280.000030edu.umn
248188375184210.000064com.ecwid
249188330483320.000078net.windows
250188310429140.000035com.trello
251188291765540.000049com.tandfonline
2521882917213690.000023cn.com.chinadaily
253188283821890.000138org.allaboutcookies
254188258447460.000042gov.senate
255188239461190.000286com.paypalobjects
2561881998010050.000031ly.ow
2571881872420140.000017org.tensorflow
258188187109010.000035edu.umich
259188179362910.000089com.tinyurl
260188172124790.000056org.pewresearch
26118815000760.000423com.list-manage
262188111322390.000111com.wpengine
263188069088340.000038ca.cbc
264188051447400.000043co.ibb
265188040444770.000057gov.fda
266188029342220.000117com.eepurl
267188024623180.000081it.google
26818798744790.000413net.facebook
2691879704620190.000017com.instructables
2701879556212000.000026edu.northwestern
271187947107520.000042org.change
272187936103940.000068es.google
273187934848930.000035org.cambridge
274187902022510.000103com.calendly
275187848629620.000033gov.congress
2761878486210220.000030uk.co.guardian
277187820145550.000049com.bigcartel
2781877780813480.000023org.semanticscholar
2791877634010060.000031com.gumroad
280187756906370.000044org.plos
2811877495613410.000023com.nikkei
282187737123130.000083com.optimizely
283187729884050.000066com.googlecode
284187666748960.000035gov.justice
2851876478810440.000029com.huffpost
286187643121530.000186com.addtoany
287187634083980.000067me.m
28818761658800.000403com.wsimg
289187600464110.000065com.tripod
290187548849570.000033ee.linktr
2911875452610210.000030gov.usgs
2921875316414590.000021uk.co.wired
293187527283380.000077fr.google
2941875184610590.000029com.500px
295187516364520.000061ca.google
2961874941819960.000017com.amd
2971874444419440.000018com.azure
298187429647770.000040au.com.google
299187425064810.000056com.163
3001874129210910.000028com.ssrn
3011874075810650.000029com.newsweek
3021873491016880.000019ca.utoronto
303187346201390.000218com.spotify
304187311127440.000042cn.com.people
305187303843340.000078page.g
3061873007427510.000012com.nabble
3071872840014540.000021com.howstuffworks
3081872293821070.000016com.lego
3091871976216750.000019com.storify
3101871933211400.000027uk.co.thetimes
311187179308010.000039site.business
312187177268840.000036uk.ac.ox
313187162063110.000083com.bitly
3141871506012180.000026com.scmp
315187136187980.000040com.adage
316187135526540.000044com.indiatimes
3171871256419080.000018de.mpg
3181871236810570.000029com.thehill
319187054665190.000052com.criteo
3201870475410780.000028org.ohchr
3211870447415310.000020com.aljazeera
322187033488020.000039uk.gov.service
3231870148215450.000020org.greenpeace
324186990643310.000078com.netdna-ssl
325186983789670.000033ch.google
326186939947840.000040us.icio
3271869369011530.000027int.coe
328186925569330.000034org.d3js
3291869045614990.000021com.history
3301868979410180.000030com.netlify
3311868806413200.000023com.nymag
3321868706413630.000023org.wiktionary
333186848682870.000091ru.ok
3341868379212930.000024com.intuit
3351868279614190.000022uk.co.standard
3361868138819950.000017edu.arizona
337186790589440.000034gov.archives
338186787949530.000034ru.google
3391867708410540.000029sg.com.google
340186758909000.000035br.com.google
34118674402850.000385co.g
3421867406819750.000018com.wattpad
343186737545260.000051ru.gov
3441867337013510.000023com.ikea
3451866859814610.000021com.reverbnation
3461866844426810.000013edu.drexel
3471866827611210.000027edu.si
3481866699411740.000027uk.co.mirror
3491866684625720.000013org.maven
350186667244120.000065com.cnet
351186645425800.000048org.openstreetmap
3521866371013730.000023com.jetbrains
3531866368810320.000030com.theconversation
3541866354219210.000018com.newscientist
355186614728470.000037gov.state
3561866114615720.000020ms.1drv
3571866015226440.000013com.mystrikingly
358186553609730.000032org.fao
359186544585900.000047cn.google
360186534722350.000112com.etsy
3611865223214850.000021com.flipboard
362186518207670.000041com.deviantart
3631865151413750.000023com.thedailybeast
3641865140412200.000026org.jstor
3651864902412700.000024com.strikingly
3661864742220450.000017blog.home
367186468126340.000044com.zdnet
368186448283250.000079tv.twitch
3691864227227810.000012com.diigo
3701864048211230.000027com.britannica
3711863925419040.000018ca.ubc
372186388403670.000072com.jotform
3731863518819590.000018com.gettyimages
3741863425416850.000019com.channel4
3751863127814940.000021org.pypi
376186303868130.000039in.co.google
377186278144170.000064com.ssl-images-amazon
378186269781610.000166gle.forms
3791862331019820.000018org.hrw
380186231322810.000092com.cloudinary
3811861861213820.000022au.com.smh
3821861723415660.000020uk.co.metro
3831861718020310.000017hk.com.google
3841861707215990.000020edu.ufl
3851861359023320.000015ly.rebrand
386186127864570.000061net.imgix
387186097464180.000064com.webflow
3881860905023110.000015com.shutterfly
389186077825680.000048com.feedly
390186038505380.000050gov.epa
391186024701040.000348com.stripe
39218601118830.000391net.jsfiddle
3931859979634230.000010org.aclweb
3941859716623480.000014com.yarnpkg
39518596278690.000461net.akamaihd
3961859620219070.000018gov.supremecourt
3971859524423440.000014com.thefreedictionary
398185938164680.000058nl.google
3991859207215780.000020com.dw
4001858829429550.000012com.upi
401185879329810.000032com.thelancet
402185879264250.000064com.slack
403185876803960.000067com.kickstarter
404185873787870.000040com.urldefense
4051858595017130.000019ca.sfu
406185835824600.000060com.livechatinc
407185810826230.000045com.quora
408185809644280.000063com.rackcdn
4091858062019670.000018com.euronews
410185805524510.000061com.go
4111858013013680.000023com.tunein
412185780765940.000046ru.liveinternet
413185767124750.000057com.googleblog
4141857177625970.000013pt.sapo
4151857121221090.000016com.itv
4161857063019450.000018uk.co.huffingtonpost
4171857054212860.000024edu.brookings
4181857052844230.000008tl.page
4191857005823690.000014com.angelfire
4201856888226140.000013org.wikibooks
4211856730216920.000019com.ifttt
422185641348610.000036com.freepik
4231856324622440.000015com.netvibes
424185626021330.000251com.mailchimp
425185625643640.000072me.telegram
426185624005610.000048com.microsoftonline
4271856222419760.000018uk.co.express
4281855920628880.000012sg.edu.nus
4291855909219280.000018io.webflow
430185572927720.000041pl.google
431185559004800.000056com.meetup
4321855548247520.000007com.newgrounds
4331855494423970.000014google.ai
4341855451224390.000014com.yolasite
4351855391221240.000016jp.geocities
4361855298633940.000011com.instapaper
437185513383620.000072com.proofpoint
4381854884413580.000023com.people
43918546296640.000531net.typekit
4401854369421040.000016org.c-span
441185419181590.000169ru.mail
4421854183420430.000017com.avg
4431854065022490.000015app.netlify
4441853939430040.000011com.000webhostapp
445185393164850.000055com.elsevier
4461853800834940.000010cn.edu.pku
4471853687216090.000020com.asahi
448185354228760.000036org.worldwildlife
4491853520411270.000027uk.parliament
4501853482219560.000018uk.gov.ons
451185336941880.000138com.iubenda
4521853279021130.000016org.documentcloud
4531853233830740.000011uk.co.timesonline
454185311182640.000096com.office
455185277642370.000112com.eventbrite
4561852701226990.000013com.self
4571852617225110.000013com.foreignpolicy
4581852480424210.000014org.sundance
459185247022140.000120com.aliyuncs
4601852414012130.000026be.google
4611852324222000.000016ie.google
4621852300014320.000022gov.weather
4631852269431360.000011com.openai
464185225888790.000036org.mediawiki
4651852112428060.000012com.pearltrees
4661852030617040.000019com.firebaseapp
4671851652036200.000010com.dailycaller
468185145124980.000054it.placehold
4691851416826950.000013com.france24
470185130266440.000044edu.berkeley
471185121384920.000055cn.360
4721851142822960.000015com.msnbc
4731851098620890.000017com.thestar
4741851025837320.000009me.site123
4751850939221330.000016com.gfycat
476185089063410.000076com.rawgit
477185079205210.000052com.gmail
4781850768619520.000018org.ocks
4791850687227390.000012org.rsc
4801850431024860.000014edu.hawaii
4811850376623660.000014de.br
4821850325024470.000014edu.colostate
483185025781710.000154com.zendesk
4841850142422220.000015org.nobelprize
4851850109632930.000011net.pixnet
4861850018815280.000020net.seesaa
4871850016424710.000014com.motherjones
488184997207560.000042com.vice
4891849937842340.000008com.masslive
4901849663423550.000014com.cision
491184950581010.000361com.godaddy
492184921048860.000036gov.nist
4931849195612490.000025org.ilo
4941849065420700.000017com.surveygizmo
4951849062833780.000011com.minds
496184905766350.000044com.matterport
4971848985826560.000013ph.com.google
498184881063690.000071org.python
499184870329800.000032gov.va
5001848580011660.000027at.google
5011848515213180.000023se.google
5021848364419610.000018ru.ucoz
5031848299624010.000014com.freep
5041848219038740.000009com.wizards
5051848173835830.000010edu.uvm
5061847814237110.000010org.tvtropes
5071847698815060.000021com.cognitoforms
5081847651614930.000021gov.uscourts
5091847602435300.000010org.oxfam
5101847399222350.000015cn.t
5111847305443310.000008fm.ask
5121847303417080.000019dk.google
5131847052631220.000011de.dw
5141846720420090.000017ua.com.google
5151846712639350.000009com.youdao
516184640161280.000262org.networkadvertising
5171846296810310.000030com.arstechnica
5181846267423100.000015int.unfccc
5191846184433230.000011ch.nzz
520184601561230.000276com.statcounter
5211846012637570.000009net.hinet
5221846001824840.000014com.washingtontimes
5231845977833910.000011edu.miami
5241845964850250.000007tw.com.gamer
5251845912043130.000008ch.qos
526184587747880.000040com.intel
5271845658422200.000015mx.com.google
5281845573422410.000015gov.ky
5291845550434260.000010com.nwsource
530184549488560.000037io.readthedocs
5311845373021870.000016gov.cisa
5321845198822560.000015com.straitstimes
533184494663710.000071io.codepen
534184470063610.000072com.prnewswire
5351844622440970.000009com.smore
5361844613221880.000016pt.google
5371844592027190.000012net.bplaced
5381844580253490.000007net.wargaming
5391844523232720.000011org.csis
5401844473214350.000022org.aarp
541184440802890.000090net.php
5421844375822820.000015no.google
5431844322839240.000009com.steemit
5441844314613040.000024tw.com.google
545184420183140.000083com.squarespace
546184408727430.000043com.oreilly
547184405961990.000130com.hubspot
5481843935448770.000007com.bonanza
5491843880220200.000017co.lpages
5501843860610790.000028net.ovh
551184382088350.000037com.imageshack
5521843787440230.000009com.doodlekit
5531843681824250.000014com.voanews
554184366803580.000073ru.rambler
5551843604828050.000012com.nationalpost
5561843542045340.000008by.google
557184352566140.000045org.nodejs
558184352003970.000067com.onesignal
5591843447033740.000011fr.rfi
560184344664630.000060gov.irs
5611843444425840.000013com.snopes
5621843423018990.000018link.page
5631843419036370.000010org.vim
5641843401822400.000015th.co.google
5651843378233950.000010org.scala-lang
5661843243431420.000011com.inquirer
5671843089828870.000012org.ballotpedia
5681843088833240.000011com.real
569184286006490.000044br.com.uol
570184280045130.000052com.pixabay
5711842665821420.000016uk.co.which
5721842663440700.000009com.viki
5731842567410380.000030com.thenextweb
5741842430231460.000011org.aps
5751842405027640.000012com.post-gazette
5761842351624990.000014net.openid
5771842270226270.000013edu.usf
57818421138820.000391com.livestream
579184204149610.000033jp.shinobi
580184202729560.000033int.wipo
5811841714644500.000008com.bravesites
5821841554228810.000012ru.aif
5831841457429060.000012io.gitlab
5841841428433870.000011org.pri
5851841427619320.000018gov.ct
5861841398426020.000013il.co.google
5871841390619100.000018org.oxfordjournals
5881841321846640.000008com.ucoz
589184124225660.000048com.photobucket
5901841234421910.000016com.xrea
5911841219822340.000015nz.co.google
5921841092020880.000017net.cnki
5931841082828470.000012com.webbyawards
594184101644330.000063com.staticflickr
5951840993436750.000010org.heritage
5961840890819930.000018tr.com.google
5971840857420530.000017com.treehugger
5981840606216950.000019net.leadpages
5991840528221120.000016fi.google
6001840276451530.000007kz.google
601184027082110.000121to.amzn
602184026705690.000048com.deloitte
6031840266211000.000028cz.google
6041840252645620.000008com.freehostia
6051840233421560.000016gov.faa
6061840232627240.000012com.detroitnews
6071840222027740.000012com.slidesharecdn
608184021023460.000075com.adnxs
609183967268120.000039com.thinkwithgoogle
6101839281614710.000021com.trustwave
6111839237626400.000013org.iea
6121839226228830.000012jp.blog
6131839114844260.000008com.goal
6141839018432840.000011com.financialpost
6151838914036360.000010net.alarabiya
6161838908235700.000010org.neocities
6171838858037840.000009co.ello
618183882562070.000126com.salesforce
6191838647835000.000010com.archdaily
6201838598445170.000008com.alamy
6211838592422970.000015gr.google
622183853981600.000168gov.privacyshield
6231838502025690.000013org.kqed
624183831962770.000093org.drupal
625183821103540.000074com.snapchat
6261838149623380.000015ro.google
6271838139233670.000011uk.ac.leeds
628183813162710.000094com.mapbox
6291838014439070.000009uk.gov.scotland
6301837962019460.000018hu.google
6311837822443990.000008co.aeon
632183774463740.000070com.cdninstagram
6331837606235450.000010gov.fec
6341837602233120.000011com.virgin
6351837562822190.000015ar.com.google
6361837506041280.000009cn.globaltimes
6371837468843330.000008com.corel
638183740664640.000059com.herokuapp
6391837320040620.000009jp.go.ndl
640183731107910.000040google.blog
6411837231622080.000016com.justia
6421837221623200.000015za.co.google
6431837061622160.000016ru.ria
6441837023236940.000010com.intensedebate
6451836979437930.000009com.visualcapitalist
6461836909427220.000012si.google
6471836851241820.000008com.rediff
6481836760438340.000009ca.uvic
6491836723625770.000013ru.rosminzdrav
650183659184390.000062com.nypost
6511836588046780.000008org.wikimapia
6521836535034390.000010com.nationalreview
6531836496221340.000016uk.org.asa
6541836428238500.000009tw.edu.ntu
655183639745980.000046com.samsung
6561836319027030.000012is.google
6571836259838690.000009com.podomatic
658183612423160.000082cn.bshare
6591836042434840.000010org.wri
6601836002841600.000009uk.co.spectator
6611835985817110.000019ly.cutt
6621835831649890.000007to.gplus
6631835808649080.000007com.atwebpages
664183578261770.000150com.tripadvisor
6651835743850030.000007org.scala-sbt
6661835648842760.000008ru.msu
6671835645011610.000027com.udemy
6681835535829730.000011com.timesofisrael
6691835250652130.000007edu.csulb
6701835162247440.000007com.authorstream
6711835094441270.000009gy.rb
6721835011032040.000011us.ny.state
6731834987636440.000010com.linuxquota
6741834979835630.000010com.udn
6751834957838450.000009org.jenkins-ci
6761834950816860.000019com.pcworld
6771834910424810.000014uk.ac.imperial
6781834878452380.000007com.etymonline
6791834802634920.000010eg.com.google
6801834777433630.000011uk.co.bbci
6811834733823860.000014com.name
6821834693837450.000009com.novell
6831834592414870.000021com.digitaloceanspaces
6841834537660400.000006net.vingle
6851834535026150.000013us.pa.state
686183450406420.000044com.xiti
6871834500623020.000015fr.pagesjaunes
6881834424646040.000008by.tut
68918341982780.000417com.messenger
6901834150216720.000019id.co.google
6911834149240120.000009com.donaldjtrump
6921833972423590.000014co.pcdn
693183386746060.000046com.indeed
694183384464590.000060com.sxsw
6951833787023790.000014sk.google
696183371262460.000105uk.co.amazon
697183368263510.000074com.atlassian
6981833681012250.000025com.dell
6991833644249470.000007fr.online
7001833622619330.000018com.law
7011833564837830.000009com.wmtransfer
7021833542222420.000015kr.co.google
7031833540247090.000008edu.odu
7041833513029710.000011cl.google
7051833502443000.000008il.ac.huji
7061833478242710.000008tw.gov.cdc
7071833379428860.000012my.com.google
7081833301433850.000011com.scotsman
7091833286433220.000011com.instructure
7101833283245630.000008com.hackaday
7111833219421310.000016gov.pa
712183320546270.000045com.withgoogle
7131833110819970.000017scot.gov
7141833091231780.000011com.broadwayworld
715183308048580.000036com.canva
7161833069445250.000008com.mongabay
7171832980245080.000008com.macobserver
7181832968637250.000010org.sonatype
7191832811823910.000014gov.wi
7201832773626830.000013org.usgbc
7211832766241130.000009gov.peacecorps
7221832762446520.000008cn.tianya
7231832671034950.000010pk.com.google
724183263028700.000036com.marketwatch
7251832616414900.000021com.billboard
726183249761070.000316net.gandi
7271832487828450.000012com.thecut
72818324686890.000372me.ogp
7291832398045850.000008io.meduza
7301832389828270.000012uk.org.nationaltrust
7311832375839110.000009au.edu.adelaide
7321832339847660.000007de.uni-erlangen
7331832248237590.000009uk.org.rspb
7341832237637730.000009cv.google
7351832125651350.000007cat.bcn
7361831973637280.000009com.ipage
7371831972653110.000007com.brother
7381831814824100.000014my.com.thestar
7391831787234010.000010uk.ac.york
7401831750433150.000011com.politifact
7411831740831280.000011ee.google
7421831717833260.000011org.thinkprogress
7431831703421020.000016se.haxx
7441831676445540.000008au.edu.rmit
7451831627229590.000011hr.google
7461831529652120.000007com.selfridges
7471831524437720.000009au.com.telstra
7481831374614360.000022com.fiverr
7491831304434200.000010de.hu-berlin
7501831151635720.000010com.nola
7511831109434580.000010sa.com.google
7521831043641450.000009ca.dal
7531831012662370.000006org.arkive
7541830942227590.000012bg.google
7551830869634290.000010com.monday
7561830866446350.000008at.tugraz
7571830843235080.000010com.eiseverywhere
7581830829837640.000009uk.co.cfdr
7591830810232980.000011org.iucn
7601830744435710.000010app.web
7611830693237020.000010org.iucnredlist
762183069082920.000088com.surveymonkey
7631830639038060.000009gi.com.google
7641830603850560.000007ec.com.google
7651830596238750.000009de.uni-freiburg
7661830552842440.000008au.com.heraldsun
767183052225150.000052io.shields
768183049146100.000046org.eff
7691830487838290.000009com.psmag
7701830450647210.000007ua.at
771183027989300.000034gov.uspto
772183026481900.000137com.automattic
7731830128639480.000009com.mozello
7741830061211080.000028com.gizmodo
7751830041835960.000010pl.wp
7761830032234710.000010org.royalsociety
7771829962228190.000012org.unep
7781829945236060.000010com.realclearpolitics
7791829829835310.000010jp.coocan
7801829829626130.000013vn.com.google
7811829821844340.000008jp.hatenablog
7821829789642810.000008com.waitrose
7831829787646760.000008info.webry
7841829785244270.000008net.inquirer
7851829770442740.000008jp.gree
7861829717846110.000008org.nationalinterest
7871829633029810.000011edu.uconn
788182956109460.000034edu.columbia
7891829555455310.000006org.mises
7901829545212740.000024com.smashingmagazine
7911829522433030.000011uk.gov.companieshouse
7921829486644420.000008gov.ourdocuments
7931829466638940.000009sl.com.google
7941829291262180.000006com.rhino3d
7951829284234350.000010org.cfr
796182927807900.000040com.airbnb
797182927122830.000092jp.co.amazon
798182915704130.000065com.pubmatic
799182909208780.000036com.box
8001829042656100.000006com.coroflot
8011829034643480.000008com.thediplomat
8021828690240660.000009com.inhabitat
8031828666832770.000011com.bp
8041828652245920.000008cat.uab
8051828348038270.000009uk.co.villiers-london
8061828301441400.000009org.grist
8071828245240160.000009com.foreignaffairs
8081828132410810.000028com.tapad
8091828037813470.000023org.altervista
810182803583820.000069com.skype
8111828032443490.000008com.worldsecuresystems
8121827968024090.000014com.volusion
8131827951629070.000012ru.nethouse
8141827948035270.000010pe.com.google
8151827943847790.000007be.lesoir
8161827887432880.000011co.com.google
8171827881638850.000009de.uni-koeln
8181827877829100.000012org.gnupg
8191827802246560.000008com.mihanblog
8201827755433600.000011org.panda
8211827718634400.000010lv.google
8221827667453000.000007lu.google
823182764424840.000055com.inc
8241827567651030.000007cn.com.caijing
8251827513433310.000011uk.gov.metoffice
82618274258680.000471com.oculus
8271827373223640.000014org.donorbox
8281827331230380.000011rs.google
8291827325611970.000026com.merriam-webster
8301827144850510.000007ee.ut
8311827106025190.000013com.amebaownd
8321827092244820.000008com.marksandspencer
8331827078064470.000006su.clan
8341826994840960.000009ru.interfax
8351826962038520.000009org.rferl
8361826875629040.000012gov.nd
837182679945480.000049com.fortune
8381826777646930.000008it.unitn
8391826771456650.000006am.google
8401826676235020.000010org.iaea
8411826374838930.000009pr.com.google
8421826215850450.000007com.tok2
8431826193819010.000018ch.ethz
8441826192233420.000011gov.la
8451826118245070.000008org.democracynow
8461826117625930.000013net.noscript
847182602168360.000037com.mix
848182598624080.000066net.adform
8491825960852080.000007tn.google
8501825797842120.000008jp.hateblo
8511825788860290.000006hk.edu.hkbu
8521825768038840.000009nl.wur
8531825759450090.000007gr.auth
854182574069970.000031com.webs
8551825676045120.000008com.mnn
8561825670257590.000006ru.nnov
8571825623839540.000009com.afp
8581825574413650.000023com.format
8591825566252090.000007nf.co
860182539543290.000079com.getbootstrap
8611825298849610.000007jp.hatenadiary
8621825215447280.000007hk.com.hkex
8631825125811930.000026com.redhat
8641825097456000.000006com.gust
8651825008810670.000029com.symantec
8661824946625620.000013net.ucoz
867182493202680.000095com.typeform
8681824869463270.000006com.x10host
8691824833235470.000010uk.co.saveourschools
8701824789829340.000012com.squarespace-cdn
8711824729229770.000011lt.google
872182468725250.000051com.adweek
8731824684442950.000008com.scienceblogs
8741824647248480.000007de.uni-konstanz
8751824556263620.000006com.ueuo
8761824504838560.000009uk.gov.data
8771824475640050.000009tr.com.hurriyet
8781824365230700.000011ae.google
8791824357018910.000019com.speakerdeck
8801824333050790.000007com.blogsky
8811824313420440.000017tv.ustream
8821824037467110.000006su.moy
883182392987610.000041gov.copyright
8841823909652920.000007ru.novayagazeta
8851823904427890.000012gov.nh
8861823899040570.000009org.hathitrust
8871823894836480.000010org.annualreviews
8881823893211540.000027pl.home
8891823888238150.000009com.businesscatalyst
890182377404720.000058com.ea
8911823772630870.000011uk.gov.hmrc
8921823694039300.000009cc.uxdesign
8931823689460150.000006com.artfire
894182367043660.000072org.opensource
8951823653034670.000010it.beniculturali
8961823613225070.000014gov.mn
8971823607610190.000030com.engadget
8981823590236820.000010ve.co.google
8991823545249730.000007com.teslamotors
9001823403874750.000005com.hangame
901182339664270.000063com.fastcompany
9021823360042630.000008com.hsbc
9031823307424620.000014com.netsolhost
9041823258255560.000006me.google
9051823234456430.000006mu.google
9061823157055290.000006com.yam
9071823124239690.000009tz.co.google
908182309989740.000032com.verisign
9091823091633640.000011tw.com.pchome
9101823066272930.000005com.addr
9111823062826360.000013com.shell
9121823060265990.000006com.dropmark
9131822970856350.000006li.google
9141822911650020.000007com.gab
9151822910644930.000008com.tapatalk
9161822819413250.000023edu.ucla
9171822795835570.000010uk.co.newmedianow
9181822793849880.000007edu.whoi
9191822781037380.000009ng.com.google
9201822763054440.000007ni.com.google
9211822607641100.000009uk.co.sainsburys
9221822545844120.000008com.iconarchive
9231822508053800.000007gr.ntua
9241822492461520.000006com.epochtimes
9251822471651980.000007org.birdlife
9261822461035320.000010uk.co.intersol
9271822417856150.000006id.co.kaskus
928182237629500.000034com.zoho
9291822316654030.000007cr.co.google
9301822304656950.000006sv.com.google
9311822288240740.000009vn.zing
9321822271445370.000008uk.co.zoopla
9331822248040390.000009uk.ac.jisc
9341822103438360.000009com.prweek
9351822042230980.000011int.wmo
9361822041054660.000006mz.co.google
9371822020249660.000007edu.umb
9381822019612900.000024uk.co.freeukbusinessdirectory
9391822006814760.000021org.owasp
9401821972666690.000006net.comunidades
9411821897641410.000009com.scotusblog
9421821884056360.000006com.cyberlink
9431821873838280.000009do.com.google
9441821867229660.000011io.termly
9451821826247350.000007com.fatcow
9461821817238510.000009mt.com.google
9471821811035890.000010uk.org.oxonaa
9481821795837740.000009gt.com.google
9491821690837370.000009com.solidworks
9501821678236410.000010uk.co.profilebusiness
9511821627036250.000010uk.co.heatall
9521821603445060.000008com.theringer
9531821538825580.000013nl.jouwweb
954182153208000.000039com.wikihow
9551821506059530.000006com.symbaloo
9561821476851710.000007pl.cba
9571821416257400.000006kg.google
9581821359423210.000015com.freeprivacypolicy
9591821285012220.000026com.att
9601821268052030.000007pl.lublin
9611821267215410.000020edu.umd
9621821217454850.000006uk.org.labour
9631821207442880.000008us.ms.state
9641821182834490.000010com.wantedly
9651821157043960.000008org.ametsoc
9661821154237010.000010uy.com.google
9671821148655530.000006jp.ifdef
9681821143852180.000007es.usal
969182113987690.000041com.netflix
9701821119663290.000006org.cgsociety
9711821085438970.000009hn.google
9721821054456020.000006org.svoboda
9731820782844320.000008org.ascd
9741820778445000.000008uk.co.dailystar
9751820771236510.000010uk.co.articlelistings
976182073705030.000054com.dmca
977182071149160.000035com.ggpht
9781820703251990.000007com.curseforge
9791820643252650.000007org.nsidc
9801820634015200.000021com.technologyreview
9811820590856680.000006ug.co.google
9821820582240300.000009org.lacity
9831820534848430.000007com.cbn
984182047164340.000063com.businesswire
9851820471258600.000006mn.google
9861820439468680.000005kr.ac.postech
9871820433256130.000006it.unige
9881820352633140.000011uk.gov.food
9891820331463530.000006com.skepticalscience
990182030529090.000035org.weforum
9911820243449070.000007com.globalpost
9921820241651720.000007com.weightwatchers
9931820200034030.000010com.lexology
9941820073859440.000006tt.google
9951820021052820.000007com.betfair
9961819996854280.000007py.com.google
9971819892848150.000007com.abcnews
998181986987630.000041com.psychologytoday
9991819851269740.000005org.toile-libre
10001819841432910.000011net.vnexpress

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2021 crawl archive now available

The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content. It includes page captures of 1.15 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-04/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2021-04/segment.paths.gz100
WARC filesCC-MAIN-2021-04/warc.paths.gz7984078.98
WAT filesCC-MAIN-2021-04/wat.paths.gz7984022.92
WET filesCC-MAIN-2021-04/wet.paths.gz7984010.04
Robots.txt filesCC-MAIN-2021-04/robotstxt.paths.gz798400.23
Non-200 responses filesCC-MAIN-2021-04/non200responses.paths.gz798402.11
URL index filesCC-MAIN-2021-04/cc-index.paths.gz3020.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-04/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November/December 2020 crawl archive now available

The crawl archive for November/December 2020 is now available! The data was crawled between November 23 and December 6 and contains 2.64 billion web pages or 270 TiB of uncompressed content. It includes page captures of 1.4 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The November/December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-50/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-50/segment.paths.gz100
WARC filesCC-MAIN-2020-50/warc.paths.gz7200059.95
WAT filesCC-MAIN-2020-50/wat.paths.gz7200017.82
WET filesCC-MAIN-2020-50/wet.paths.gz720007.89
Robots.txt filesCC-MAIN-2020-50/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-50/non200responses.paths.gz720001.71
URL index filesCC-MAIN-2020-50/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-50/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

October 2020 crawl archive now available

The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-45/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-45/segment.paths.gz100
WARC filesCC-MAIN-2020-45/warc.paths.gz7200063.79
WAT filesCC-MAIN-2020-45/wat.paths.gz7200018.39
WET filesCC-MAIN-2020-45/wet.paths.gz720008.23
Robots.txt filesCC-MAIN-2020-45/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-45/non200responses.paths.gz720001.75
URL index filesCC-MAIN-2020-45/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Interactive Webgraph Statistics Notebook Released

We are pleased to announce the release of an interactive Jupyter notebook that is used to provide:

  • Visualization of web graph statistics
  • An interface for interacting with the webgraph

The visualization of the web graph statistics is done by leveraging the WebGraph framework, which provides means of gathering many interesting data points of a web graph, such as the frequency distribution of indegrees/outdegrees in the graph, or size distributions of the connected components. We then are able to use pandas and matplotlib to provide a visualization for the data provided by WebGraph. This effort was largely inspired by the Topology of the 2012 WDC Hyperlink Graph document. Further details of WebGraph tool installation/usage, and the data visualization may be found in the cc-notebooks repository.

The interface for interacting with the webgraph is done by using pyWebGraph, a front end that interfaces Jython with WebGraph. First, before using this interface we must re-build the string maps, in order to create a mapping between the node ID (a numerical value), to domain name (and vice versa). Once this is established we are able to simply load up the graph into pyWebGraph, and you will be able to traverse the graph interactively.

Further details of pyWebGraph installation/usage, and how to rebuild the string maps may be found in interactive webgraph README of the cc-notebooks repository.

The Jupyter notebook is available on Github in the same repository. More details about how to navigate the repository can be found in the notebook itself, as well as in the README.

We hope that users will be able to use these notebooks to gain more insight into the web graph in a numerical and practical sense.

We are grateful for WebGraph for providing extremely useful tools for processing the web graph itself, and Massimo Santini for developing pyWebGraph.

Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 539 million nodes and 3.02 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 467 million dangling nodes (86.7%) and the largest strongly connected component contains 46 million (8.5%) nodes.

You can download the graph and the ranks of all 539 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/ as prefix to access the files from everywhere.

SizeFileDescription
3.32 GBcc-main-2020-jul-aug-sep-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
13.7 GBcc-main-2020-jul-aug-sep-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
5.95 GBcc-main-2020-jul-aug-sep-host.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-host.properties
6.76 GBcc-main-2020-jul-aug-sep-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-jul-aug-sep-host-t.properties
1 kBcc-main-2020-jul-aug-sep-host.statsWebGraph statistics
7.77 GBcc-main-2020-jul-aug-sep-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 89 million nodes and 1.71 billion edges. 51% or 45 million nodes are dangling nodes, the largest strongly connected component covers 35 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/.

Download files of the Common Crawl Jul/Aug/Sep 2020 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2020-jul-aug-sep-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.80 GBcc-main-2020-jul-aug-sep-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.75 GBcc-main-2020-jul-aug-sep-domain.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-domain.properties
3.69 GBcc-main-2020-jul-aug-sep-domain-t.graphtranspose of the graph
2 kBcc-main-2020-jul-aug-sep-domain-t.properties
1 kBcc-main-2020-jul-aug-sep-domain.statsWebGraph statistics
1.91 GBcc-main-2020-jul-aug-sep-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 89 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Jul/Aug/Sep 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13202792810.018888com.googleapis
23031294430.012001com.facebook
32902594820.013237com.google
42656047240.007343org.w
52651653450.007172com.twitter
62601646460.006600com.youtube
72461419090.004795com.instagram
82422071280.005190org.gmpg
92357297070.005599com.googletagmanager
1023188190110.003202com.linkedin
1122457894150.002590com.gravatar
1222451350100.003967com.cloudflare
1322364152140.002726com.gstatic
1422350042120.003105org.wordpress
1521926906220.001505com.pinterest
1621699168210.001752com.wordpress
1721599006260.001181org.wikipedia
1821538264160.002431com.bootstrapcdn
1921497526180.001836com.apple
2021314410300.001106com.vimeo
2121248994410.000830be.youtu
2221186566200.001794com.jquery
2321081822230.001444com.microsoft
2421073240450.000773com.blogspot
2520994964390.000952com.amazonaws
2620975988460.000732gl.goo
2720971574250.001384com.wp
2820921220470.000723com.amazon
2920788608720.000439com.tumblr
3020716256190.001804com.adobe
3120694562670.000535ly.bit
3220675418340.001018com.google-analytics
3320627694530.000673org.mozilla
3420618998170.001975com.github
3520617620310.001059net.cloudfront
3620579928710.000449com.yahoo
3720571130290.001127com.googlesyndication
3820570586600.000612eu.europa
3920562028520.000679com.flickr
4020560188420.000818net.jsdelivr
4120526264970.000347com.googleusercontent
4220481758620.000606co.t
43204802181090.000313com.reddit
4420451670240.001419com.fontawesome
4520436180830.000389com.weebly
4620387228560.000628com.paypal
4720375802400.000910com.macromedia
4820372972700.000450com.medium
4920370180430.000808com.addthis
5020360678280.001156ru.yandex
5120338498270.001156me.wp
5220331252640.000559org.w3
5320326560790.000411io.github
54202928361380.000223com.nytimes
5520275824760.000414org.creativecommons
5620274244590.000615org.schema
57202553261500.000192com.forbes
58202460681730.000151com.imgur
5920227930360.000979net.doubleclick
60202196121940.000133uk.co.bbc
61202109241140.000285com.soundcloud
6220171070660.000548com.vk
63201552221950.000133com.cnn
6420142696440.000803org.apache
6520134806630.000587com.whatsapp
66201295823140.000082edu.mit
67201230321800.000146com.imdb
68201183102080.000124net.slideshare
69201166262430.000101com.wsj
70201157681970.000128org.wikimedia
7120089462850.000388com.shopify
72200822042150.000120edu.stanford
73200766841540.000181gov.cdc
74200756323280.000079com.wired
75200697242680.000094com.techcrunch
76200570662550.000096edu.harvard
77200513363530.000076com.appspot
78200512922070.000124net.sourceforge
79200512642570.000096com.oracle
80200512501550.000177int.who
81200508882060.000124com.businessinsider
82200460501370.000227org.archive
83200381982300.000113com.washingtonpost
84200358102500.000097com.live
85200299401640.000163com.bing
86200282105490.000054com.livejournal
87200276224240.000069com.go
88200246664560.000066com.msn
89200199924070.000072uk.co.telegraph
90200093061700.000154com.theguardian
91200025145270.000056edu.cornell
92199971461990.000128org.ietf
93199967144860.000063gov.nasa
94199954762590.000096com.android
95199862523020.000084com.reuters
9619983946510.000702net.fbcdn
97199748902400.000102com.bloomberg
98199664641620.000164com.giphy
9919960428770.000414com.list-manage
100199590465200.000057com.googleblog
101199565582690.000093com.bbc
102199552044090.000071com.slack
103199420561430.000205com.spotify
104199388285910.000049com.zdnet
10519936894480.000721net.facebook
106199350105860.000050com.quora
107199310721260.000265com.ytimg
108199227744440.000067com.myspace
109199220467570.000038edu.umich
110199201787150.000040edu.upenn
111199174821510.000185gov.nih
112199078863440.000077com.usatoday
113199038966540.000045com.economist
114199037223130.000082com.cnbc
115199027003080.000083com.example
116198965525250.000056com.pixabay
117198950144180.000070net.researchgate
118198827904490.000066com.latimes
119198811641880.000138com.blogger
120198700463870.000075org.python
12119864804650.000555com.wix
122198607604330.000068com.githubusercontent
123198587326930.000042org.ieee
124198542544990.000061com.mashable
125198509185710.000052edu.berkeley
126198475541350.000241com.youtube-nocookie
127198451301600.000167com.issuu
128198430682180.000118org.acm
129198397368340.000036org.chromium
130198395502350.000106uk.co.google
131198357905510.000054org.arxiv
132198330202460.000099net.behance
133198326822910.000086org.npr
134198319941080.000320com.unpkg
135198311368840.000034com.arstechnica
136198268402130.000121com.unsplash
137198228843410.000078com.outlook
138198226701100.000303de.google
13919812430540.000654com.googleadservices
140198108723470.000077com.prnewswire
141198064586780.000043edu.columbia
142198053821710.000153me.t
143198048862970.000085com.dribbble
144198041422560.000096com.squarespace
145197990321390.000215gov.privacyshield
146197988063060.000083com.huffingtonpost
147197979642600.000096com.bandcamp
148197951123980.000074com.time
14919793874370.000975com.baidu
150197920826160.000048com.gitlab
151197904063340.000079com.nationalgeographic
152197882144430.000067com.nature
153197851787940.000037com.stackexchange
154197821141790.000147gle.forms
155197816762580.000096org.ampproject
156197785345480.000054com.fortune
157197779028130.000036com.git-scm
15819776608330.001030com.wixstatic
159197740307710.000038com.qz
160197723902810.000089com.wiley
161197722686460.000046au.net.abc
162197709306380.000046edu.yale
163197695824280.000068com.meetup
164197678764680.000064com.ted
1651976138611600.000026com.hatenablog
166197590524480.000066com.patreon
167197574722830.000089com.disqus
168197567489360.000032edu.ucla
169197539981470.000195com.dropbox
170197533801680.000158com.yelp
171197506782710.000093org.un
172197463842120.000122com.twimg
173197431182540.000096org.drupal
174197414746890.000042org.bitbucket
175197365404220.000069com.statista
176197354409030.000033uk.ac.cam
177197319407180.000040com.evernote
178197319166820.000043com.newyorker
179197256386030.000049com.buzzfeed
180197195446060.000049me.about
181197186547220.000040com.mysql
182197168048500.000035com.thenextweb
183197154204950.000061com.theatlantic
184197109202790.000091com.sciencedirect
185197108264030.000073com.getpocket
186197053266690.000043uk.co.blogspot
1871970212612930.000023com.tinypic
188196967304500.000066com.booking
189196956525140.000058com.xinhuanet
190196949047430.000039org.weforum
191196942682470.000098gov.ca
192196923226020.000049gov.loc
1931969099812820.000023org.postgresql
194196899088280.000036edu.princeton
195196879542390.000103uk.co.amazon
196196859424800.000063com.dailymotion
1971967967214520.000021ru.narod
198196789261890.000138com.xing
199196759148790.000034edu.jhu
200196736705000.000060gov.whitehouse
201196718466650.000044org.worldbank
2021966870613650.000022org.eclipse
203196677704000.000073com.springer
204196676844450.000067com.nypost
205196658723160.000081com.ft
20619660930610.000606com.fb
207196589862040.000125com.feedburner
208196583948260.000036org.cambridge
209196547624760.000063uk.co.dailymail
210196543867660.000038edu.washington
211196542424960.000061org.eff
21219653044320.001054com.qq
213196501444730.000064com.goodreads
214196495242640.000095org.doi
215196495025120.000058com.w3schools
2161964124213110.000023edu.virginia
217196412124400.000067com.googlecode
218196383486330.000047com.vice
219196331285060.000059com.force
220196329767230.000040com.trello
221196327808360.000035com.about
222196305625230.000056com.inc
223196294824530.000066com.scribd
2241962936820530.000016com.wikidot
225196284366190.000048org.semver
226196144966070.000049com.cbsnews
227196077946510.000045com.withgoogle
228196055121460.000196me.line
2291960341020890.000016com.googlesource
230196014762190.000118org.iana
231196014525460.000054gov.usda
232195998003090.000083com.tinyurl
2331959829010900.000027com.techradar
234195976748580.000035com.dropboxusercontent
235195974463840.000076com.ibm
2361959520012840.000023co.elastic
237195940242890.000087com.squareup
2381959333614340.000021org.linuxfoundation
2391959238811340.000026org.coursera
2401958983010270.000029gov.fbi
2411958828411580.000026edu.unc
242195860087050.000041com.vox
243195833501930.000134de.amazon
244195830965500.000054uk.co.independent
2451958055414230.000021ms.1drv
246195789503830.000076com.digg
2471956761213930.000022org.kernel
248195639481130.000287com.sharethis
249195634687510.000039org.d3js
250195574908010.000037gov.fcc
2511955729210260.000029com.hollywoodreporter
2521955625813690.000022com.howstuffworks
253195537004300.000068com.cnet
254195520688040.000037com.foxnews
255195471341520.000183com.addtoany
256195470066440.000046com.indiatimes
257195469289950.000029com.steamcommunity
2581954686411050.000026cn.com.chinadaily
259195456285840.000050com.psychologytoday
260195441308230.000036uk.co.guardian
2611954392014630.000021it.scoop
262195437541330.000247com.mailchimp
263195422348370.000035com.slate
264195422141530.000182com.opera
265195384125890.000050com.mckinsey
2661953681610200.000029com.sap
2671953641826050.000013org.wikiquote
268195343343070.000083com.bitly
269195333086270.000047com.mozilla
270195330542620.000095jp.ameblo
271195312607350.000039org.sciencemag
272195282461160.000284com.paypalobjects
2731952810823450.000014org.wikibooks
274195271041760.000151com.amazon-adsystem
275195269486880.000042gov.noaa
276195248683050.000083com.netdna-ssl
277195245443100.000083com.nbcnews
278195233309890.000030com.target
2791952277615230.000020com.instructables
280195175269750.000030edu.umn
281195165309650.000031com.merriam-webster
2821951626014310.000021hk.com.google
283195148521850.000140com.tripadvisor
2841951460823770.000014com.diigo
285195039164970.000061ca.google
286194992622360.000106com.wpengine
2871949924610290.000028com.sun
2881949656211890.000025com.digitaltrends
289194963403910.000075com.stumbleupon
290194918461150.000284com.weibo
2911949163816260.000019com.ign
2921949121013140.000023com.mercurynews
2931949096413520.000022de.zeit
294194906362290.000114com.etsy
295194891067970.000037uk.ac.ox
296194874542840.000089com.optimizely
29719485106730.000425net.akamaihd
2981948436812070.000025net.speedtest
2991948428415220.000020org.greenpeace
3001948362215530.000020net.seesaa
301194794507200.000040au.com.google
302194786049040.000033de.spiegel
3031947633610770.000027com.podbean
304194751426280.000047org.pbs
305194747225160.000058com.gofundme
306194744844160.000070com.kickstarter
3071947359013400.000022com.urbandictionary
308194724224720.000064org.pewresearch
309194713205190.000057com.bigcommerce
3101946791221370.000015de.bild
311194672402310.000112com.eepurl
312194653005150.000058com.theverge
313194647922730.000092com.stackoverflow
314194645989260.000032com.politico
315194630368110.000036co.ibb
316194623943320.000079it.google
3171946216221100.000016ly.visual
318194618409550.000031org.unicef
3191946093220200.000016org.tensorflow
3201945759216880.000018com.itv
3211945715010130.000029com.lifehacker
322194565121060.000334com.stripe
3231945627213490.000022edu.msu
324194554123120.000083net.windows
325194533748050.000037edu.academia
3261945028413910.000022com.storify
3271944963812570.000024com.crunchbase
328194493865950.000049com.tandfonline
3291944913219580.000017com.lego
3301944468211870.000025com.jetbrains
331194437966770.000043gov.senate
332194436648550.000035com.chicagotribune
3331944323423010.000014com.rottentomatoes
334194402247700.000038ca.cbc
335194399342050.000125com.eventbrite
3361943949612730.000023hk.hku
3371943640210350.000028edu.wisc
338194361046910.000042com.libsyn
3391943574210510.000028edu.northwestern
340194332129440.000031com.scientificamerican
3411943279810430.000028edu.uchicago
3421943118212880.000023uk.co.wired
343194255461900.000137jp.co.google
3441942434620020.000016org.maven
3451942373210300.000028com.mediafire
346194233504150.000070me.telegram
347194184403960.000074com.criteo
348194172083570.000076fr.google
349194170386640.000044us.icio
3501941640214770.000020com.deadline
351194158086400.000046com.sagepub
352194142567300.000039com.ecwid
3531941346612750.000023org.aclu
354194132585760.000051com.typepad
355194121684710.000064com.photobucket
356194072945330.000055com.oup
3571940716811990.000025com.reverbnation
3581940696815140.000020de.mpg
3591940533013890.000022edu.rutgers
3601940479010670.000027com.scmp
36119403976810.000392net.jsfiddle
362194036924210.000069com.calendly
363194036188440.000035com.sciencedaily
364194034687270.000039gov.justice
365194008305750.000051gov.hhs
366193982589190.000032com.theconversation
367193975969910.000030com.apnews
368193974429380.000032com.huffpost
3691939493415180.000020com.newscientist
370193946566080.000049org.openstreetmap
3711939330012870.000023com.aljazeera
372193932302160.000119com.hubspot
373193900186450.000046gov.house
3741938811826820.000012uk.co.timesonline
3751938803425640.000013com.space
376193839107000.000041com.pinimg
377193835044320.000068page.g
3781938199012410.000024com.sky
379193818448660.000035gov.congress
380193810269120.000033com.500px
3811938063212170.000024org.wiktionary
382193803409580.000031com.ssrn
3831937974217090.000018edu.bu
3841937764017570.000018gov.cia
385193757402140.000120org.bbb
3861937563414380.000021com.foxbusiness
387193718146240.000047ru.gov
3881937105615980.000019ca.mcgill
389193679267900.000037com.qualtrics
3901936605412900.000023org.semanticscholar
391193657787610.000038site.business
392193657602670.000094ru.ok
393193637989770.000030edu.si
394193637588870.000034br.com.google
395193636888470.000035co.g
3961936320410210.000029uk.co.thetimes
3971936212226630.000012com.discovermagazine
398193599201820.000142us.zoom
399193594928890.000034org.fao
400193593526830.000043org.change
4011935786614690.000020com.salon
402193566502280.000114com.aliyuncs
403193562809970.000029com.thehill
404193548189730.000030gov.usgs
405193515842980.000085com.ebay
4061935098812220.000024com.nikkei
407193501423380.000078com.rawgit
408193496605780.000051it.placehold
409193488241570.000173com.wixsite
4101934812212380.000024com.smithsonianmag
411193465527580.000038org.oecd
4121934651410880.000027ee.linktr
4131934525433120.000011com.openai
4141934228810480.000028uk.co.mirror
415193416566790.000043com.deviantart
4161934133215760.000019org.phys
417193405984130.000070tv.twitch
418193401384040.000072com.mapbox
4191933524615460.000020ca.sfu
4201933246427540.000012com.instapaper
421193306562440.000100org.gnu
4221933050421150.000016au.edu.unimelb
4231932872410440.000028int.coe
4241932832020780.000016org.nobelprize
425193282866670.000043pl.google
4261932768013330.000022com.irishtimes
427193275782930.000086com.office
4281932753619620.000017org.torproject
429193249364840.000063net.imgix
4301932462812810.000023uk.ac.ucl
4311932092610540.000028org.ohchr
4321931877212130.000025com.strikingly
433193155025090.000059org.hbr
4341931504014110.000021uk.co.metro
435193143041230.000270com.statcounter
436193134689720.000030gov.dhs
437193133802870.000088com.thedailybeast
4381931323418110.000017com.bankofamerica
4391931253412650.000024com.buzzsprout
440193119408630.000035gov.nps
4411930986824260.000014au.com.theage
442193074729330.000032com.aweber
4431930676615570.000020blog.home
444193054488480.000035gov.bls
445193052964900.000062edu.nyu
4461930434620870.000016com.oxforddictionaries
4471930407411620.000025gov.nyc
44819303568930.000356org.reactjs
4491930277813820.000022au.com.news
4501930088222910.000014sg.edu.nus
4511929990014290.000021com.flipboard
452192998964810.000063com.scorecardresearch
4531929801025170.000013com.dummies
4541929584024650.000013org.rsc
4551929547210100.000029com.britannica
456192949847140.000040gov.state
4571929421617000.000018org.gutenberg
4581929289235650.000010fm.ask
4591929086629700.000011com.pearltrees
460192899907930.000037com.zapier
4611928649425620.000013com.mystrikingly
462192840928760.000034com.cctv
463192835008160.000036com.healthline
4641928304419550.000017com.chrome
4651928263814840.000020com.rt
466192825509670.000031com.newsweek
4671928053823620.000014com.biography
4681927964610050.000029ch.google
4691927050414120.000021com.ifttt
4701927023815840.000019com.axios
471192700424660.000065es.google
472192696588820.000034au.gov.nsw
4731926744434830.000010hk.edu.cuhk
474192671508620.000035com.stitcher
4751926700025200.000013com.boredpanda
4761926558211920.000025fr.lemonde
477192639925540.000053com.steampowered
4781926387810550.000028org.jstor
4791926215013350.000022org.imf
480192619188730.000034com.venturebeat
481192611968250.000036org.poynter
4821925957416840.000018com.straitstimes
4831925945233900.000010com.chosun
4841925932215020.000020edu.asu
4851925876223510.000014io.gitlab
486192568109560.000031ru.google
487192559969520.000031sg.com.google
4881925379813310.000022uk.co.standard
489192529066120.000048de.gesetze-im-internet
490192515169480.000031gov.archives
4911925027023850.000014th.co.google
492192497304230.000069io.codepen
4931924893030330.000011com.nola
4941924889420230.000016edu.gmu
4951924524628360.000012app.netlify
4961924515811160.000026com.wikia
4971924265613530.000022com.history
4981924216010070.000029com.thelancet
4991924183029180.000011com.coca-colacompany
5001924064026540.000012google.ai
501192406008560.000035com.freepik
5021924043015480.000020com.buzzfeednews
5031923864828940.000012org.cato
504192377004310.000068net.datatables
505192374565010.000060com.rackcdn
5061923616815900.000019gov.supremecourt
5071923330225340.000013edu.byu
508192332686420.000046fr.amazon
5091923292028720.000012tw.blogspot
510192319448030.000037in.co.google
5111923153019770.000017org.edx
5121923122813090.000023com.tunein
5131923115617790.000018org.ocks
514192304785220.000057nl.google
515192283705550.000053com.gmail
5161922706823980.000014com.nationalpost
5171922691018670.000017edu.ucsb
5181922641823830.000014edu.nd
5191922639213720.000022com.dw
520192262561270.000262com.jimdo
5211922586024120.000014no.uio
5221922540010060.000029google.blog
5231922239814090.000021cn.cntv
5241922216432850.000011cn.org.china
5251922113616390.000019org.unwomen
526192189509460.000031com.airtable
5271921778825100.000013edu.uoregon
5281921537621720.000015org.britishcouncil
5291921467426680.000012org.icrc
530192144629510.000031com.gallup
5311921337822650.000015ru.kremlin
5321921289413320.000022com.globalsign
533192108508750.000034gov.uspto
534192104929590.000031edu.psu
5351921002215090.000020com.penguinrandomhouse
5361920931813450.000022com.netdna-cdn
5371920868632690.000011is.archive
5381920834415310.000020uk.ac.lse
5391920795225030.000013fi.helsinki
5401920762020420.000016edu.pitt
5411920723621700.000015net.openid
5421920625611550.000026edu.brookings
543192052907860.000037com.imageshack
544192047701720.000152com.npmjs
5451920448632900.000011de.diplo
5461920438019560.000017edu.unl
5471920383215440.000020edu.georgetown
5481920321021250.000015org.metmuseum
5491920275012400.000024org.nejm
550192022447260.000040com.adage
5511920043419900.000017com.channel4
5521920029015110.000020com.findlaw
5531920003022240.000015com.france24
554191989382820.000089net.php
5551919869817840.000017com.csmonitor
556191978664190.000069com.proofpoint
557191953201920.000135com.iubenda
5581919437210110.000029gov.treasury
5591919402817080.000018com.euronews
5601919144622860.000014com.thoughtco
5611919013637420.000009com.doodlekit
562191898621070.000320com.godaddy
5631918933412980.000023edu.duke
5641918865220710.000016com.foreignpolicy
5651918511819960.000017org.documentcloud
5661918375613000.000023com.livescience
5671918370625080.000013com.upi
5681918310420850.000016com.gq
569191822601780.000148com.zendesk
5701918207430200.000011com.authorstream
5711918207439150.000009com.mysanantonio
5721918169441330.000008tw.edu.sinica
5731917789427190.000012org.wikisource
5741917738222200.000015com.insider
575191771808510.000035gov.nist
5761917700016250.000019com.thestar
577191766421810.000145jp.co.yahoo
5781917454613040.000023au.com.smh
5791917402820250.000016org.ncsl
5801917380042520.000008hk.edu.cityu
5811917374433490.000010com.sina
5821917310821970.000015ie.independent
5831917226621560.000015edu.uky
58419171704960.000349me.ogp
5851917093634130.000010uk.ac.sussex
5861917079217550.000018gov.doc
587191707041310.000250org.networkadvertising
588191695663200.000080io.shields
589191680586490.000045gov.usa
5901916699042910.000008org.china-embassy
5911916681031370.000011com.udn
592191637741610.000166ru.mail
5931916371234740.000010com.worldatlas
594191635225050.000060com.netflix
595191632548570.000035com.thinkwithgoogle
5961916235614410.000021gov.defense
5971916195213180.000023tw.com.google
5981916082616040.000019org.hrw
5991915981214950.000020com.asahi
600191595707850.000037io.readthedocs
6011915876826880.000012org.freedomhouse
6021915865414130.000021tv.ustream
603191578228930.000034org.mediawiki
6041915644617150.000018org.pypi
6051915180030280.000011org.adb
6061915140620990.000016fr.leparisien
6071915115226150.000013com.abc7news
6081915065020630.000016com.voanews
6091915004810190.000029com.pcmag
610191486984470.000067org.nodejs
6111914855442880.000008com.theundefeated
6121914781638600.000009org.gephi
6131914717613270.000023org.undp
6141914646232770.000011org.iucnredlist
6151914645425830.000013com.sacbee
6161914620415940.000019com.treehugger
6171914560822920.000014no.google
6181914446224710.000013co.ello
6191914335419860.000017com.msnbc
620191433542520.000097com.myshopify
621191428109810.000030uk.parliament
6221914252022870.000014co.pcdn
6231914194212550.000024gov.uscourts
6241914189614220.000021co.lpages
6251914078023440.000014org.fas
626191397687810.000037com.intel
627191387408070.000036com.marketwatch
6281913691420470.000016com.infogram
6291913384825380.000013com.sputniknews
6301913370424300.000014ie.google
6311913258213440.000022se.google
632191317989900.000030com.netlify
633191310009250.000032com.jekyllrb
6341913061230550.000011int.interpol
635191303085240.000056fr.free
6361913018011980.000025be.google
6371912975015750.000019uk.co.huffingtonpost
6381912931023230.000014ly.rebrand
6391912910415040.000020link.page
6401912870417940.000017com.sched
6411912772422180.000015jp.co.japantimes
6421912725428290.000012org.tigris
6431912715228390.000012org.pri
6441912700623190.000014nz.co.nzherald
6451912562212040.000025at.google
6461912546452920.000007org.arkive
647191253262220.000116com.salesforce
648191232966500.000045br.com.uol
6491912101842420.000008kr.co.kbs
6501911937416650.000018com.thebalance
6511911912614550.000021org.oxfordjournals
6521911863837380.000009com.encyclopedia
6531911726222040.000015org.eji
6541911650628180.000012org.heritage
6551911629823710.000014com.popsci
6561911451821990.000015com.snopes
6571911409826010.000013org.oas
658191133481560.000174com.aspnetcdn
6591911271210310.000028org.ilo
6601910965422630.000015com.insidehighered
6611910898015870.000019gov.usembassy
6621910893216220.000019dk.google
6631910804033920.000010org.jenkins-ci
6641910738828270.000012org.project-syndicate
6651910655619630.000017com.justia
6661910412015630.000019gov.govinfo
6671910315216990.000018com.firebaseapp
6681910206820930.000016edu.uga
6691910202836780.000010edu.wm
6701910161432840.000011com.cgtn
6711910159618810.000017org.worldcat
672191012269000.000033com.zoho
673191005903920.000074com.atlassian
6741910029026760.000012org.transparency
6751909977613170.000023org.aarp
6761909968616750.000018org.americanbar
6771909916422390.000015com.timeshighereducation
6781909796432700.000011com.pastemagazine
6791909590225980.000013org.csis
680190943426290.000047com.samsung
681190940587740.000038com.pexels
6821909337419640.000017com.washingtontimes
6831909271420160.000016gov.usaid
6841909016613340.000022org.heart
685190887641910.000136com.automattic
686190884288650.000035com.verisign
6871908766021080.000016com.motherjones
6881908703429440.000011org.vim
6891908649820620.000016edu.nap
690190861729240.000032com.webs
6911908477815930.000019org.amnesty
6921908434421010.000016ua.com.google
6931908355239880.000009org.globalnetworkinitiative
6941908319625460.000013org.globalcitizen
6951908250017540.000018com.surveygizmo
6961908205822620.000015org.wbur
6971908104823530.000014uk.gov.companieshouse
6981908039824680.000013jp.mainichi
6991908028631810.000011com.podomatic
7001907811617510.000018org.unhcr
7011907627621180.000016ca.ctvnews
7021907531025650.000013uk.co.bbci
703190738129680.000031uk.gov.legislation
7041907152226810.000012com.nationalreview
7051907083225230.000013com.cleveland
7061907047438140.000009org.neocities
7071906988410730.000027ly.snip
708190688644380.000067com.herokuapp
709190685106560.000045com.oreilly
7101906673011540.000026cz.google
7111906646421640.000015org.nrdc
7121906576826710.000012org.thinkprogress
7131906565417950.000017ca.globalnews
714190651062700.000093jp.co.amazon
7151906284013280.000023org.altervista
7161906173231190.000011uk.ac.nottingham
7171906116812670.000024uk.gov.nationalarchives
7181906093421060.000016au.edu.anu
7191906023630350.000011com.intensedebate
7201906010227340.000012de.hu-berlin
721190598027360.000039com.airbnb
7221905980023260.000014de.auswaertiges-amt
7231905937623160.000014nz.co.google
7241905917026720.000012org.unenvironment
7251905897831320.000011org.rsf
7261905793241100.000008com.koreaherald
7271905777819600.000017org.pewtrusts
7281905767828670.000012com.techinasia
7291905748822760.000014com.thecut
7301905617437000.000009com.viki
7311905606827240.000012org.gnupg
7321905459024690.000013ro.google
7331905439420570.000016edu.gwu
7341905411630570.000011com.bangkokpost
7351905362625720.000013fr.rfi
736190528684140.000070com.pubmatic
7371905190623090.000014com.tutsplus
7381905164810790.000027tr.com.google
739190515162480.000098com.getbootstrap
7401905090844240.000008com.wonderhowto
7411905062636190.000010com.upworthy
7421905049628830.000012org.sonatype
743190503822880.000087com.typeform
7441904957428060.000012il.co.google
7451904938427390.000012uk.ac.leeds
746190481162010.000127to.amzn
7471904798627030.000012vn.com.google
748190475782740.000092com.surveymonkey
749190473809220.000032int.wipo
7501904628810570.000028com.gizmodo
751190461448740.000034com.box
7521904557822980.000014com.oregonlive
753190449165470.000054gg.discord
7541904444433560.000010com.theepochtimes
7551904440024800.000013ar.com.google
7561904414429430.000011bg.google
7571904363220610.000016com.squarespace-cdn
7581904340034790.000010io.soup
7591904277825450.000013com.webbyawards
7601904238427440.000012io.fabric
7611904229815880.000019com.speakerdeck
762190416841360.000232info.aboutads
763190406069070.000033com.docker
7641903881418170.000017com.miamiherald
7651903792431910.000011ph.com.google
7661903776224630.000013com.channelnewsasia
7671903755631980.000011uk.co.vogue
7681903755426190.000013edu.fsu
769190358704850.000063com.staticflickr
7701903528424950.000013za.co.google
7711903367826960.000012com.thejakartapost
7721903244212360.000024edu.ucsd
773190322584870.000062com.fc2
7741903203854150.000007com.armorgames
7751903194421550.000015fi.google
7761903123438850.000009com.alamy
7771903086822210.000015id.co.google
7781903046227940.000012com.rd
7791902971229510.000011com.cartodb
7801902958420920.000016com.newrepublic
7811902934834360.000010com.benzinga
782190283646610.000044com.entrepreneur
7831902796053760.000007org.gwtproject
7841902666029880.000011com.sciencealert
7851902653827630.000012org.iaea
7861902640223760.000014com.thenation
7871902369234110.000010si.google
7881902304624000.000014pt.google
7891902012429650.000011au.gov.nla
7901901983835130.000010com.dailykos
791190197564940.000061com.aol
7921901912825190.000013edu.emory
7931901901235730.000010com.inhabitat
7941901895634150.000010uk.ac.soas
795190184026660.000044com.deloitte
7961901823011850.000025com.today
797190168389780.000030com.windowsphone
7981901618636590.000010org.cpj
7991901616421190.000016kr.co.google
8001901590629810.000011se.lu
8011901578027740.000012org.cfr
802190148564290.000068me.fb
8031901367832880.000011com.joins
8041901298042640.000008sa.com.google
8051901287828140.000012com.politifact
806190122929640.000031com.alexa
8071901144241310.000008edu.utm
8081901106827350.000012com.law360
809190105469830.000030com.engadget
8101900866235830.000010hr.google
8111900853821460.000015hu.google
812190068606310.000047fm.last
8131900654024760.000013eu.politico
8141900624840470.000009com.chinatimes
8151900611625210.000013mx.com.google
8161900606031410.000011com.jezebel
8171900594238680.000009com.iconarchive
8181900531834710.000010com.ogilvy
8191900486623990.000014gr.google
8201900408628160.000012com.monday
8211900325227380.000012com.digitaljournal
8221900324831490.000011com.nyt
8231900322033000.000011audio.breaker
8241900264028230.000012uk.co.guim
825190023846250.000047com.cisco
8261900203833910.000010cn.globaltimes
8271900180826480.000012com.instructure
8281900064633210.000011com.crashlytics
8291899972027230.000012au.com.businessinsider
8301899933834300.000010org.grist
8311899828012090.000025com.pastebin
832189981183150.000082ai.shortpixel
8331899807839900.000009org.constitutioncenter
8341899796048420.000007jp.hatenadiary
8351899678037700.000009edu.ttu
8361899607629970.000011uk.ac.york
8371899593616710.000018com.eater
83818995084900.000364com.livestream
8391899503627720.000012com.bepress
8401899475228980.000012org.wri
8411899226220430.000016my.com.thestar
8421899112237750.000009com.minds
8431899059223520.000014mp.j
8441899057037080.000009app.web
8451899006234100.000010org.carnegieendowment
8461898978636450.000010tr.com.aa
847189894187110.000041gov.sec
8481898774638120.000009com.hyperallergic
8491898728234080.000010com.foreignaffairs
8501898664037970.000009au.edu.uts
851189853924700.000064com.fastcompany
8521898503235600.000010org.hypotheses
8531898446838960.000009com.japantoday
8541898275235070.000010edu.wayne
8551898204837130.000009uk.ac.kent
8561898198836970.000009rs.google
8571898053240710.000009org.sourcewatch
858189793668320.000036com.symantec
8591897842425390.000013fr.paris
8601897799629420.000011com.prweek
8611897790217650.000018ch.ipcc
8621897696022170.000015com.kinstacdn
8631897626210460.000028edu.cmu
8641897546220390.000016int.unfccc
8651897506241960.000008eg.com.google
8661897480431800.000011org.nationalgeographic
8671897454826430.000013gov.doi
8681897394034060.000010de.uni-frankfurt
8691897349442430.000008by.google
8701897202250500.000007com.symbaloo
8711897101034170.000010nl.wur
8721896995023280.000014org.unodc
8731896843015990.000019com.routledge
8741896841245090.000008com.ipsos-mori
8751896696236580.000010ae.google
8761896615244820.000008com.etymonline
8771896588849820.000007build.bazel
8781896556633200.000011org.brainpickings
8791896454431430.000011com.scotsman
8801896379642950.000008com.oilprice
8811896338035970.000010uk.ac.westminster
8821896326645450.000008lk.google
8831896257612600.000024fr.blogspot
8841896136034120.000010org.rferl
8851896131031730.000011org.epi
8861895990041150.000008lv.google
8871895981239090.000009au.edu.griffith
8881895942242190.000008kr.ac.snu
8891895728013120.000023com.upwork
8901895707624360.000014com.html5rocks
8911895671454930.000007me.nimbusweb
8921895650229400.000011fr.archives-ouvertes
8931895639842930.000008com.delawareonline
8941895546217920.000017ru.rbc
895189549687450.000039com.gartner
8961895493011270.000026edu.utexas
8971895364225260.000013net.noscript
8981895346627170.000012ae.thenational
8991895333633800.000010com.study
900189530924270.000068com.hp
9011895307436410.000010uk.co.spectator
9021895276238690.000009com.cleantechnica
9031895220828030.000012org.unctad
9041895120042550.000008com.teslamotors
9051895011816140.000019com.billboard
9061894936630740.000011com.theculturetrip
9071894789624540.000013com.multiscreensite
908189477387040.000041com.visualstudio
9091894758839850.000009uk.ac.plymouth
9101894745426600.000012sk.google
9111894731238110.000009net.aljazeera
9121894711024130.000014com.theintercept
9131894655634210.000010uk.ac.exeter
9141894649433320.000010social.mastodon
9151894587628280.000012com.euractiv
9161894586436350.000010com.db
9171894273644470.000008org.mises
9181894231646800.000008ng.com.google
9191894201627950.000012org.panda
9201894162224660.000013uk.gov.justice
9211894143056020.000007net.chinadialogue
9221894092441180.000008cat.uab
9231894074642270.000008com.spokesman
9241894008235230.000010co.com.google
9251893923044730.000008lu.google
9261893899641890.000008pe.com.google
9271893861833660.000010com.nybooks
9281893860643810.000008uk.ac.core
9291893820622280.000015com.termsfeed
9301893819416690.000018com.pcworld
9311893811238460.000009kr.co.yna
9321893800247930.000007com.gust
9331893778838800.000009org.cgiar
9341893730042310.000008pk.com.google
9351893653035750.000010net.inquirer
9361893600830830.000011ru.lenta
9371893400014680.000020com.nokia
9381893367629320.000011tw.com.pchome
9391893349612230.000024com.ycombinator
9401893335029110.000011nl.volkskrant
94118933194780.000411com.oculus
9421893261234550.000010cl.google
9431893186239490.000009org.polymer-project
9441893088826370.000013com.washingtonexaminer
9451893062239450.000009sk.sme
9461893053433890.000010edu.monash
947189300869180.000032com.canva
948189295524540.000066org.opensource
9491892939839770.000009com.rappler
9501892863040000.000009org.plan-international
9511892651845610.000008cr.co.google
9521892641235870.000010lt.google
9531892583238100.000009ca.macleans
954189256468170.000036net.adform
9551892504648730.000007com.blogto
9561892495235080.000010uk.ac.nhm
9571892492832110.000011edu.ua
9581892355428150.000012com.articulate
959189232882490.000098com.sxsw
9601892286639930.000009org.wilsoncenter
9611892267640820.000009edu.lehigh
962189223364170.000070com.skype
9631892154646990.000008com.out
9641892071410850.000027com.redhat
9651892068032660.000011my.com.google
9661891906420310.000016gov.ecfr
9671891890045850.000008org.nsidc
968189187784120.000070net.secureservercdn
9691891811245360.000008kz.google
9701891759032950.000011org.osce
971189175625570.000053org.whatwg
9721891741840960.000009com.wsoctv
9731891738025870.000013uk.org.nationaltrust
9741891722032010.000011uk.gov.london
9751891704819730.000017scot.gov
9761891698238650.000009uk.ac.qub
9771891646038070.000009com.governing
978189164305280.000056com.businesswire
9791891630022530.000015wales.gov
9801891506634220.000010com.afp
9811891498230800.000011uk.ac.qmul
9821891487851540.000007com.ingress
9831891454045960.000008com.webcindario
9841891431634020.000010org.psychiatryonline
9851891323041480.000008org.marxists
9861891309640730.000009me.thinglink
9871891297016600.000018com.css-tricks
9881891285847320.000008ie.nuigalway
9891891251443480.000008com.asiaone
9901891236833540.000010com.kaspersky-labs
9911891211012490.000024com.smashingmagazine
9921891206437870.000009org.nationalinterest
993189118485560.000053com.adweek
9941891143644980.000008ec.com.google
9951891140447220.000008bd.com.google
9961891000648460.000007uy.com.google
9971890999842330.000008com.match
9981890974640210.000009ee.google
9991890968839620.000009com.adn
10001890947443100.000008com.wnd

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!