Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2020 and January 2021. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks.

Host-level graph

The graph consists of 490 million nodes and 2.57 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 414 million dangling nodes (84.4%) and the largest strongly connected component contains 42.6 million (8.7%) nodes.

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 490 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 36 gzip-compressed files listed in two path listings – one for the nodes, one for the edges. First, download the paths listing and uncompress it using “gzip”. By adding the prefix s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl Oct/Nov/Jan 2020-2021 host-level webgraph

SizeFileDescription
3.08 GBcc-main-2020-21-oct-nov-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
11.76 GBcc-main-2020-21-oct-nov-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
5.18 GBcc-main-2020-21-oct-nov-jan-host.graphgraph in BVGraph format
2 kBcc-main-2020-21-oct-nov-jan-host.properties
5.63 GBcc-main-2020-21-oct-nov-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-21-oct-nov-jan-host-t.properties
1 kBcc-main-2020-21-oct-nov-jan-host.statsWebGraph statistics
7.04 GBcc-main-2020-21-oct-nov-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 86 million nodes and 1.47 billion edges. 50% or 43 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-21-oct-nov-jan/domain/.

Download files of the Common Crawl Oct/Nov/Jan 2020-2021 domain-level webgraph

SizeFileDescription
0.59 GBcc-main-2020-21-oct-nov-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.00 GBcc-main-2020-21-oct-nov-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.40 GBcc-main-2020-21-oct-nov-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2020-21-oct-nov-jan-domain.properties
3.26 GBcc-main-2020-21-oct-nov-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2020-21-oct-nov-jan-domain-t.properties
1 kBcc-main-2020-21-oct-nov-jan-domain.statsWebGraph statistics
1.85 GBcc-main-2020-21-oct-nov-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 86 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Oct/Nov/Jan 2020-2021)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13035556610.017956com.googleapis
22942716430.012871com.facebook
32817356220.012899com.google
42570281250.007348com.twitter
52562831440.007628org.w
62529780860.007231com.youtube
72419546690.005352com.instagram
82335535680.005532org.gmpg
92323367470.006500com.googletagmanager
1022492432110.003277com.linkedin
1121576402100.004076com.cloudflare
1221468510140.002649com.gravatar
1321395642130.003020org.wordpress
1421353798220.001726com.pinterest
1520946722300.001242org.wikipedia
1620926308190.001834com.wordpress
1720877776160.002056com.gstatic
1820799472150.002451com.bootstrapcdn
1920795402180.001943com.apple
2020626472320.001165com.vimeo
2120527986410.000886be.youtu
2220419038210.001769com.jquery
2320391686280.001246com.microsoft
2420327544240.001500com.wp
2520314602450.000769com.blogspot
2620231490370.001025com.amazonaws
2720208912510.000691com.amazon
2820199388470.000740gl.goo
2920093688710.000448com.tumblr
3020070176350.001070com.google-analytics
3120050256610.000598ly.bit
3220030452200.001794com.adobe
3319998314170.002005com.github
3419989010500.000715org.mozilla
3519962834580.000639eu.europa
3619945306340.001103net.cloudfront
3719849112520.000682com.flickr
3819843288400.000909net.jsdelivr
3919833032910.000369com.googleusercontent
40198235601050.000347com.yahoo
4119752300560.000650co.t
4219722088330.001114com.googlesyndication
4319712406230.001517com.fontawesome
4419708354810.000392com.weebly
4519706054550.000653com.paypal
46196952881090.000308com.reddit
4719641534310.001231me.wp
4819640398730.000435com.medium
4919635162670.000491io.github
50195904441370.000225com.nytimes
51195878801210.000280com.soundcloud
5219585192270.001262ru.yandex
5319583494430.000786com.addthis
5419582250440.000776com.macromedia
5519560416660.000504org.w3
5619549714700.000451com.shopify
57195186721460.000201com.forbes
58195024481440.000205org.archive
5919496300900.000371org.creativecommons
60194903481940.000131uk.co.bbc
6119482926590.000630org.schema
6219479528390.000910com.baidu
6319464572360.001035net.doubleclick
64194599662000.000129com.cnn
6519451100530.000677com.whatsapp
6619449068600.000611com.vk
67194449662060.000126net.slideshare
68194439561580.000169com.bing
69194198781740.000152com.imdb
70193859561860.000140com.imgur
71193725202360.000112com.washingtonpost
72193710761760.000150com.theguardian
73193569522540.000102com.wsj
74193564742100.000123org.wikimedia
75193521282190.000117com.businessinsider
76193476982090.000123com.stackoverflow
77193427124090.000065com.msn
78193266543270.000079com.appspot
79193243341570.000172int.who
80193211122160.000119edu.stanford
81193167961790.000148org.apache
82193103903330.000078com.ibm
83193093543370.000077edu.mit
84193049382250.000116net.sourceforge
85192929321160.000288com.ytimg
8619287812570.000649net.fbcdn
87192824862850.000091com.techcrunch
88192765002690.000094com.bbc
89192754801550.000181com.wixsite
90192752221520.000189gov.nih
91192752002200.000117com.livejournal
92192706502330.000113uk.co.google
93192706104400.000062gov.nasa
9419263354540.000666com.googleadservices
95192434042620.000097edu.harvard
96192431542700.000094com.oracle
97192431262760.000093org.acm
98192386502180.000117org.ietf
99192384501850.000142com.blogger
100192384262230.000116gov.ca
101192346304650.000059fr.free
102192320582590.000098com.bloomberg
103192218442750.000093com.android
104192186363040.000085com.live
105192108121260.000271com.jimdo
106192088961690.000159com.issuu
107192058021660.000162com.giphy
108191941564380.000062com.ted
109191901783480.000075com.huffingtonpost
110191877821300.000254com.weibo
111191868621540.000186us.zoom
112191857942520.000103org.gnu
113191763324030.000066com.myspace
1141916212210390.000030com.wikia
115191525823730.000071net.researchgate
116191500583430.000075com.usatoday
117191483323090.000084com.reuters
118191439884000.000067uk.co.telegraph
119191412024460.000061com.latimes
120191309763720.000071com.example
121191295523450.000075com.githubusercontent
12219127344930.000366com.unpkg
123191271163840.000069com.nature
124191253963360.000077com.wired
12519124320250.001485com.wixstatic
126191148422990.000087org.npr
127191110183080.000084com.cnbc
128191077723280.000079com.ebay
129191037042930.000088com.wiley
130191028141110.000299de.google
131190977321910.000135com.npmjs
132190954543440.000075com.hp
133190885505390.000050com.cisco
134190840489320.000034com.stackexchange
135190817361320.000251com.youtube-nocookie
136190806381340.000250com.ft
137190788142130.000120org.ampproject
138190772325320.000051com.steampowered
139190746383650.000072com.patreon
140190729184550.000061com.theatlantic
141190728804760.000057com.gitlab
142190723448900.000035com.pcmag
143190684361950.000131com.unsplash
144190654948770.000036edu.psu
145190639263760.000070com.time
146190611422080.000125com.twimg
147190610641640.000165com.yelp
148190593328730.000036edu.washington
149190571965330.000051edu.cornell
150190541521480.000197com.dropbox
151190517386030.000046org.arxiv
152190476263790.000070com.statista
153190430503240.000080org.un
154190426022490.000104com.bandcamp
155190409148240.000038com.venturebeat
15619040684750.000432me.fb
157190398828410.000037org.chromium
15819033464650.000519com.wix
159190262442840.000092com.sciencedirect
160190197666290.000045edu.yale
161190163265840.000047com.pexels
162190152308260.000038org.bitbucket
163190104528320.000038org.ieee
164190076363880.000068com.springer
165190018107650.000041com.evernote
166189975068550.000037edu.upenn
167189949262580.000098jp.ameblo
168189937721490.000195me.t
169189928344160.000065org.hbr
170189920282960.000088com.outlook
171189859541680.000160jp.co.yahoo
172189832385770.000048com.cbsnews
173189825467920.000040me.about
174189812288910.000035com.git-scm
175189803368290.000038com.economist
176189803281500.000193com.opera
177189780561380.000223me.line
178189749964500.000061com.goodreads
179189733646450.000044com.mysql
180189731148420.000037com.docker
181189697085620.000048com.buzzfeed
182189695665650.000048com.mashable
183189683985870.000047com.mozilla
184189645409510.000034com.about
185189626327970.000040org.worldbank
186189561288150.000039com.newyorker
187189546683420.000076com.dribbble
188189542362650.000096net.behance
189189518763900.000068com.theverge
190189518385010.000054gov.whitehouse
191189501424560.000061uk.co.dailymail
192189438903470.000075com.xinhuanet
193189428123200.000080com.w3schools
194189411243780.000070com.fc2
1951893648811510.000027edu.wisc
196189350747640.000041gov.noaa
197189323962940.000088com.disqus
1981893122813370.000023co.elastic
19918927646380.000956com.qq
200189266944480.000061com.bigcommerce
201189264426240.000045gov.loc
202189256201560.000179gov.cdc
203189246329290.000035gov.fcc
204189228161360.000228info.aboutads
205189216308210.000039com.qz
2061892130822950.000015com.wikidot
207189192403850.000069com.scribd
208189151047480.000042org.unesco
209189144189590.000033com.apnews
210189124263750.000070com.digg
211189110827790.000040com.vox
212189103701800.000147com.amazon-adsystem
213189101102720.000094com.squareup
214189074104950.000054uk.co.independent
215189062242560.000100org.iana
2161890560812510.000025edu.uchicago
217189013984200.000064com.force
218188987026460.000044com.usnews
219188981086470.000044com.gartner
220188949182950.000088com.nbcnews
221188901604700.000058com.dailymotion
2221888348810040.000031com.dropboxusercontent
223188782766170.000045org.pbs
224188764541810.000147jp.co.google
225188761641130.000292com.sharethis
226188758244670.000059com.nationalgeographic
227188741128110.000039uk.co.blogspot
228188733408440.000037au.net.abc
229188680009340.000034com.foxnews
2301886532215590.000020org.eclipse
231188594643990.000067com.getpocket
232188592289470.000034com.slate
233188590622660.000095org.doi
23418858866630.000541com.fb
235188566389680.000033com.politico
236188499929070.000035com.playstation
237188493346000.000046org.semver
2381884846815650.000020gd.is
2391884700413110.000024edu.unc
2401884675815230.000021org.kernel
241188463108390.000037org.sciencemag
242188460382570.000099com.typepad
2431884499811520.000027com.hatenablog
2441884400419810.000018com.googlesource
245188421802020.000128com.naver
246188405482480.000104com.feedburner
2471883983010280.000030edu.umn
248188375184210.000064com.ecwid
249188330483320.000078net.windows
250188310429140.000035com.trello
251188291765540.000049com.tandfonline
2521882917213690.000023cn.com.chinadaily
253188283821890.000138org.allaboutcookies
254188258447460.000042gov.senate
255188239461190.000286com.paypalobjects
2561881998010050.000031ly.ow
2571881872420140.000017org.tensorflow
258188187109010.000035edu.umich
259188179362910.000089com.tinyurl
260188172124790.000056org.pewresearch
26118815000760.000423com.list-manage
262188111322390.000111com.wpengine
263188069088340.000038ca.cbc
264188051447400.000043co.ibb
265188040444770.000057gov.fda
266188029342220.000117com.eepurl
267188024623180.000081it.google
26818798744790.000413net.facebook
2691879704620190.000017com.instructables
2701879556212000.000026edu.northwestern
271187947107520.000042org.change
272187936103940.000068es.google
273187934848930.000035org.cambridge
274187902022510.000103com.calendly
275187848629620.000033gov.congress
2761878486210220.000030uk.co.guardian
277187820145550.000049com.bigcartel
2781877780813480.000023org.semanticscholar
2791877634010060.000031com.gumroad
280187756906370.000044org.plos
2811877495613410.000023com.nikkei
282187737123130.000083com.optimizely
283187729884050.000066com.googlecode
284187666748960.000035gov.justice
2851876478810440.000029com.huffpost
286187643121530.000186com.addtoany
287187634083980.000067me.m
28818761658800.000403com.wsimg
289187600464110.000065com.tripod
290187548849570.000033ee.linktr
2911875452610210.000030gov.usgs
2921875316414590.000021uk.co.wired
293187527283380.000077fr.google
2941875184610590.000029com.500px
295187516364520.000061ca.google
2961874941819960.000017com.amd
2971874444419440.000018com.azure
298187429647770.000040au.com.google
299187425064810.000056com.163
3001874129210910.000028com.ssrn
3011874075810650.000029com.newsweek
3021873491016880.000019ca.utoronto
303187346201390.000218com.spotify
304187311127440.000042cn.com.people
305187303843340.000078page.g
3061873007427510.000012com.nabble
3071872840014540.000021com.howstuffworks
3081872293821070.000016com.lego
3091871976216750.000019com.storify
3101871933211400.000027uk.co.thetimes
311187179308010.000039site.business
312187177268840.000036uk.ac.ox
313187162063110.000083com.bitly
3141871506012180.000026com.scmp
315187136187980.000040com.adage
316187135526540.000044com.indiatimes
3171871256419080.000018de.mpg
3181871236810570.000029com.thehill
319187054665190.000052com.criteo
3201870475410780.000028org.ohchr
3211870447415310.000020com.aljazeera
322187033488020.000039uk.gov.service
3231870148215450.000020org.greenpeace
324186990643310.000078com.netdna-ssl
325186983789670.000033ch.google
326186939947840.000040us.icio
3271869369011530.000027int.coe
328186925569330.000034org.d3js
3291869045614990.000021com.history
3301868979410180.000030com.netlify
3311868806413200.000023com.nymag
3321868706413630.000023org.wiktionary
333186848682870.000091ru.ok
3341868379212930.000024com.intuit
3351868279614190.000022uk.co.standard
3361868138819950.000017edu.arizona
337186790589440.000034gov.archives
338186787949530.000034ru.google
3391867708410540.000029sg.com.google
340186758909000.000035br.com.google
34118674402850.000385co.g
3421867406819750.000018com.wattpad
343186737545260.000051ru.gov
3441867337013510.000023com.ikea
3451866859814610.000021com.reverbnation
3461866844426810.000013edu.drexel
3471866827611210.000027edu.si
3481866699411740.000027uk.co.mirror
3491866684625720.000013org.maven
350186667244120.000065com.cnet
351186645425800.000048org.openstreetmap
3521866371013730.000023com.jetbrains
3531866368810320.000030com.theconversation
3541866354219210.000018com.newscientist
355186614728470.000037gov.state
3561866114615720.000020ms.1drv
3571866015226440.000013com.mystrikingly
358186553609730.000032org.fao
359186544585900.000047cn.google
360186534722350.000112com.etsy
3611865223214850.000021com.flipboard
362186518207670.000041com.deviantart
3631865151413750.000023com.thedailybeast
3641865140412200.000026org.jstor
3651864902412700.000024com.strikingly
3661864742220450.000017blog.home
367186468126340.000044com.zdnet
368186448283250.000079tv.twitch
3691864227227810.000012com.diigo
3701864048211230.000027com.britannica
3711863925419040.000018ca.ubc
372186388403670.000072com.jotform
3731863518819590.000018com.gettyimages
3741863425416850.000019com.channel4
3751863127814940.000021org.pypi
376186303868130.000039in.co.google
377186278144170.000064com.ssl-images-amazon
378186269781610.000166gle.forms
3791862331019820.000018org.hrw
380186231322810.000092com.cloudinary
3811861861213820.000022au.com.smh
3821861723415660.000020uk.co.metro
3831861718020310.000017hk.com.google
3841861707215990.000020edu.ufl
3851861359023320.000015ly.rebrand
386186127864570.000061net.imgix
387186097464180.000064com.webflow
3881860905023110.000015com.shutterfly
389186077825680.000048com.feedly
390186038505380.000050gov.epa
391186024701040.000348com.stripe
39218601118830.000391net.jsfiddle
3931859979634230.000010org.aclweb
3941859716623480.000014com.yarnpkg
39518596278690.000461net.akamaihd
3961859620219070.000018gov.supremecourt
3971859524423440.000014com.thefreedictionary
398185938164680.000058nl.google
3991859207215780.000020com.dw
4001858829429550.000012com.upi
401185879329810.000032com.thelancet
402185879264250.000064com.slack
403185876803960.000067com.kickstarter
404185873787870.000040com.urldefense
4051858595017130.000019ca.sfu
406185835824600.000060com.livechatinc
407185810826230.000045com.quora
408185809644280.000063com.rackcdn
4091858062019670.000018com.euronews
410185805524510.000061com.go
4111858013013680.000023com.tunein
412185780765940.000046ru.liveinternet
413185767124750.000057com.googleblog
4141857177625970.000013pt.sapo
4151857121221090.000016com.itv
4161857063019450.000018uk.co.huffingtonpost
4171857054212860.000024edu.brookings
4181857052844230.000008tl.page
4191857005823690.000014com.angelfire
4201856888226140.000013org.wikibooks
4211856730216920.000019com.ifttt
422185641348610.000036com.freepik
4231856324622440.000015com.netvibes
424185626021330.000251com.mailchimp
425185625643640.000072me.telegram
426185624005610.000048com.microsoftonline
4271856222419760.000018uk.co.express
4281855920628880.000012sg.edu.nus
4291855909219280.000018io.webflow
430185572927720.000041pl.google
431185559004800.000056com.meetup
4321855548247520.000007com.newgrounds
4331855494423970.000014google.ai
4341855451224390.000014com.yolasite
4351855391221240.000016jp.geocities
4361855298633940.000011com.instapaper
437185513383620.000072com.proofpoint
4381854884413580.000023com.people
43918546296640.000531net.typekit
4401854369421040.000016org.c-span
441185419181590.000169ru.mail
4421854183420430.000017com.avg
4431854065022490.000015app.netlify
4441853939430040.000011com.000webhostapp
445185393164850.000055com.elsevier
4461853800834940.000010cn.edu.pku
4471853687216090.000020com.asahi
448185354228760.000036org.worldwildlife
4491853520411270.000027uk.parliament
4501853482219560.000018uk.gov.ons
451185336941880.000138com.iubenda
4521853279021130.000016org.documentcloud
4531853233830740.000011uk.co.timesonline
454185311182640.000096com.office
455185277642370.000112com.eventbrite
4561852701226990.000013com.self
4571852617225110.000013com.foreignpolicy
4581852480424210.000014org.sundance
459185247022140.000120com.aliyuncs
4601852414012130.000026be.google
4611852324222000.000016ie.google
4621852300014320.000022gov.weather
4631852269431360.000011com.openai
464185225888790.000036org.mediawiki
4651852112428060.000012com.pearltrees
4661852030617040.000019com.firebaseapp
4671851652036200.000010com.dailycaller
468185145124980.000054it.placehold
4691851416826950.000013com.france24
470185130266440.000044edu.berkeley
471185121384920.000055cn.360
4721851142822960.000015com.msnbc
4731851098620890.000017com.thestar
4741851025837320.000009me.site123
4751850939221330.000016com.gfycat
476185089063410.000076com.rawgit
477185079205210.000052com.gmail
4781850768619520.000018org.ocks
4791850687227390.000012org.rsc
4801850431024860.000014edu.hawaii
4811850376623660.000014de.br
4821850325024470.000014edu.colostate
483185025781710.000154com.zendesk
4841850142422220.000015org.nobelprize
4851850109632930.000011net.pixnet
4861850018815280.000020net.seesaa
4871850016424710.000014com.motherjones
488184997207560.000042com.vice
4891849937842340.000008com.masslive
4901849663423550.000014com.cision
491184950581010.000361com.godaddy
492184921048860.000036gov.nist
4931849195612490.000025org.ilo
4941849065420700.000017com.surveygizmo
4951849062833780.000011com.minds
496184905766350.000044com.matterport
4971848985826560.000013ph.com.google
498184881063690.000071org.python
499184870329800.000032gov.va
5001848580011660.000027at.google
5011848515213180.000023se.google
5021848364419610.000018ru.ucoz
5031848299624010.000014com.freep
5041848219038740.000009com.wizards
5051848173835830.000010edu.uvm
5061847814237110.000010org.tvtropes
5071847698815060.000021com.cognitoforms
5081847651614930.000021gov.uscourts
5091847602435300.000010org.oxfam
5101847399222350.000015cn.t
5111847305443310.000008fm.ask
5121847303417080.000019dk.google
5131847052631220.000011de.dw
5141846720420090.000017ua.com.google
5151846712639350.000009com.youdao
516184640161280.000262org.networkadvertising
5171846296810310.000030com.arstechnica
5181846267423100.000015int.unfccc
5191846184433230.000011ch.nzz
520184601561230.000276com.statcounter
5211846012637570.000009net.hinet
5221846001824840.000014com.washingtontimes
5231845977833910.000011edu.miami
5241845964850250.000007tw.com.gamer
5251845912043130.000008ch.qos
526184587747880.000040com.intel
5271845658422200.000015mx.com.google
5281845573422410.000015gov.ky
5291845550434260.000010com.nwsource
530184549488560.000037io.readthedocs
5311845373021870.000016gov.cisa
5321845198822560.000015com.straitstimes
533184494663710.000071io.codepen
534184470063610.000072com.prnewswire
5351844622440970.000009com.smore
5361844613221880.000016pt.google
5371844592027190.000012net.bplaced
5381844580253490.000007net.wargaming
5391844523232720.000011org.csis
5401844473214350.000022org.aarp
541184440802890.000090net.php
5421844375822820.000015no.google
5431844322839240.000009com.steemit
5441844314613040.000024tw.com.google
545184420183140.000083com.squarespace
546184408727430.000043com.oreilly
547184405961990.000130com.hubspot
5481843935448770.000007com.bonanza
5491843880220200.000017co.lpages
5501843860610790.000028net.ovh
551184382088350.000037com.imageshack
5521843787440230.000009com.doodlekit
5531843681824250.000014com.voanews
554184366803580.000073ru.rambler
5551843604828050.000012com.nationalpost
5561843542045340.000008by.google
557184352566140.000045org.nodejs
558184352003970.000067com.onesignal
5591843447033740.000011fr.rfi
560184344664630.000060gov.irs
5611843444425840.000013com.snopes
5621843423018990.000018link.page
5631843419036370.000010org.vim
5641843401822400.000015th.co.google
5651843378233950.000010org.scala-lang
5661843243431420.000011com.inquirer
5671843089828870.000012org.ballotpedia
5681843088833240.000011com.real
569184286006490.000044br.com.uol
570184280045130.000052com.pixabay
5711842665821420.000016uk.co.which
5721842663440700.000009com.viki
5731842567410380.000030com.thenextweb
5741842430231460.000011org.aps
5751842405027640.000012com.post-gazette
5761842351624990.000014net.openid
5771842270226270.000013edu.usf
57818421138820.000391com.livestream
579184204149610.000033jp.shinobi
580184202729560.000033int.wipo
5811841714644500.000008com.bravesites
5821841554228810.000012ru.aif
5831841457429060.000012io.gitlab
5841841428433870.000011org.pri
5851841427619320.000018gov.ct
5861841398426020.000013il.co.google
5871841390619100.000018org.oxfordjournals
5881841321846640.000008com.ucoz
589184124225660.000048com.photobucket
5901841234421910.000016com.xrea
5911841219822340.000015nz.co.google
5921841092020880.000017net.cnki
5931841082828470.000012com.webbyawards
594184101644330.000063com.staticflickr
5951840993436750.000010org.heritage
5961840890819930.000018tr.com.google
5971840857420530.000017com.treehugger
5981840606216950.000019net.leadpages
5991840528221120.000016fi.google
6001840276451530.000007kz.google
601184027082110.000121to.amzn
602184026705690.000048com.deloitte
6031840266211000.000028cz.google
6041840252645620.000008com.freehostia
6051840233421560.000016gov.faa
6061840232627240.000012com.detroitnews
6071840222027740.000012com.slidesharecdn
608184021023460.000075com.adnxs
609183967268120.000039com.thinkwithgoogle
6101839281614710.000021com.trustwave
6111839237626400.000013org.iea
6121839226228830.000012jp.blog
6131839114844260.000008com.goal
6141839018432840.000011com.financialpost
6151838914036360.000010net.alarabiya
6161838908235700.000010org.neocities
6171838858037840.000009co.ello
618183882562070.000126com.salesforce
6191838647835000.000010com.archdaily
6201838598445170.000008com.alamy
6211838592422970.000015gr.google
622183853981600.000168gov.privacyshield
6231838502025690.000013org.kqed
624183831962770.000093org.drupal
625183821103540.000074com.snapchat
6261838149623380.000015ro.google
6271838139233670.000011uk.ac.leeds
628183813162710.000094com.mapbox
6291838014439070.000009uk.gov.scotland
6301837962019460.000018hu.google
6311837822443990.000008co.aeon
632183774463740.000070com.cdninstagram
6331837606235450.000010gov.fec
6341837602233120.000011com.virgin
6351837562822190.000015ar.com.google
6361837506041280.000009cn.globaltimes
6371837468843330.000008com.corel
638183740664640.000059com.herokuapp
6391837320040620.000009jp.go.ndl
640183731107910.000040google.blog
6411837231622080.000016com.justia
6421837221623200.000015za.co.google
6431837061622160.000016ru.ria
6441837023236940.000010com.intensedebate
6451836979437930.000009com.visualcapitalist
6461836909427220.000012si.google
6471836851241820.000008com.rediff
6481836760438340.000009ca.uvic
6491836723625770.000013ru.rosminzdrav
650183659184390.000062com.nypost
6511836588046780.000008org.wikimapia
6521836535034390.000010com.nationalreview
6531836496221340.000016uk.org.asa
6541836428238500.000009tw.edu.ntu
655183639745980.000046com.samsung
6561836319027030.000012is.google
6571836259838690.000009com.podomatic
658183612423160.000082cn.bshare
6591836042434840.000010org.wri
6601836002841600.000009uk.co.spectator
6611835985817110.000019ly.cutt
6621835831649890.000007to.gplus
6631835808649080.000007com.atwebpages
664183578261770.000150com.tripadvisor
6651835743850030.000007org.scala-sbt
6661835648842760.000008ru.msu
6671835645011610.000027com.udemy
6681835535829730.000011com.timesofisrael
6691835250652130.000007edu.csulb
6701835162247440.000007com.authorstream
6711835094441270.000009gy.rb
6721835011032040.000011us.ny.state
6731834987636440.000010com.linuxquota
6741834979835630.000010com.udn
6751834957838450.000009org.jenkins-ci
6761834950816860.000019com.pcworld
6771834910424810.000014uk.ac.imperial
6781834878452380.000007com.etymonline
6791834802634920.000010eg.com.google
6801834777433630.000011uk.co.bbci
6811834733823860.000014com.name
6821834693837450.000009com.novell
6831834592414870.000021com.digitaloceanspaces
6841834537660400.000006net.vingle
6851834535026150.000013us.pa.state
686183450406420.000044com.xiti
6871834500623020.000015fr.pagesjaunes
6881834424646040.000008by.tut
68918341982780.000417com.messenger
6901834150216720.000019id.co.google
6911834149240120.000009com.donaldjtrump
6921833972423590.000014co.pcdn
693183386746060.000046com.indeed
694183384464590.000060com.sxsw
6951833787023790.000014sk.google
696183371262460.000105uk.co.amazon
697183368263510.000074com.atlassian
6981833681012250.000025com.dell
6991833644249470.000007fr.online
7001833622619330.000018com.law
7011833564837830.000009com.wmtransfer
7021833542222420.000015kr.co.google
7031833540247090.000008edu.odu
7041833513029710.000011cl.google
7051833502443000.000008il.ac.huji
7061833478242710.000008tw.gov.cdc
7071833379428860.000012my.com.google
7081833301433850.000011com.scotsman
7091833286433220.000011com.instructure
7101833283245630.000008com.hackaday
7111833219421310.000016gov.pa
712183320546270.000045com.withgoogle
7131833110819970.000017scot.gov
7141833091231780.000011com.broadwayworld
715183308048580.000036com.canva
7161833069445250.000008com.mongabay
7171832980245080.000008com.macobserver
7181832968637250.000010org.sonatype
7191832811823910.000014gov.wi
7201832773626830.000013org.usgbc
7211832766241130.000009gov.peacecorps
7221832762446520.000008cn.tianya
7231832671034950.000010pk.com.google
724183263028700.000036com.marketwatch
7251832616414900.000021com.billboard
726183249761070.000316net.gandi
7271832487828450.000012com.thecut
72818324686890.000372me.ogp
7291832398045850.000008io.meduza
7301832389828270.000012uk.org.nationaltrust
7311832375839110.000009au.edu.adelaide
7321832339847660.000007de.uni-erlangen
7331832248237590.000009uk.org.rspb
7341832237637730.000009cv.google
7351832125651350.000007cat.bcn
7361831973637280.000009com.ipage
7371831972653110.000007com.brother
7381831814824100.000014my.com.thestar
7391831787234010.000010uk.ac.york
7401831750433150.000011com.politifact
7411831740831280.000011ee.google
7421831717833260.000011org.thinkprogress
7431831703421020.000016se.haxx
7441831676445540.000008au.edu.rmit
7451831627229590.000011hr.google
7461831529652120.000007com.selfridges
7471831524437720.000009au.com.telstra
7481831374614360.000022com.fiverr
7491831304434200.000010de.hu-berlin
7501831151635720.000010com.nola
7511831109434580.000010sa.com.google
7521831043641450.000009ca.dal
7531831012662370.000006org.arkive
7541830942227590.000012bg.google
7551830869634290.000010com.monday
7561830866446350.000008at.tugraz
7571830843235080.000010com.eiseverywhere
7581830829837640.000009uk.co.cfdr
7591830810232980.000011org.iucn
7601830744435710.000010app.web
7611830693237020.000010org.iucnredlist
762183069082920.000088com.surveymonkey
7631830639038060.000009gi.com.google
7641830603850560.000007ec.com.google
7651830596238750.000009de.uni-freiburg
7661830552842440.000008au.com.heraldsun
767183052225150.000052io.shields
768183049146100.000046org.eff
7691830487838290.000009com.psmag
7701830450647210.000007ua.at
771183027989300.000034gov.uspto
772183026481900.000137com.automattic
7731830128639480.000009com.mozello
7741830061211080.000028com.gizmodo
7751830041835960.000010pl.wp
7761830032234710.000010org.royalsociety
7771829962228190.000012org.unep
7781829945236060.000010com.realclearpolitics
7791829829835310.000010jp.coocan
7801829829626130.000013vn.com.google
7811829821844340.000008jp.hatenablog
7821829789642810.000008com.waitrose
7831829787646760.000008info.webry
7841829785244270.000008net.inquirer
7851829770442740.000008jp.gree
7861829717846110.000008org.nationalinterest
7871829633029810.000011edu.uconn
788182956109460.000034edu.columbia
7891829555455310.000006org.mises
7901829545212740.000024com.smashingmagazine
7911829522433030.000011uk.gov.companieshouse
7921829486644420.000008gov.ourdocuments
7931829466638940.000009sl.com.google
7941829291262180.000006com.rhino3d
7951829284234350.000010org.cfr
796182927807900.000040com.airbnb
797182927122830.000092jp.co.amazon
798182915704130.000065com.pubmatic
799182909208780.000036com.box
8001829042656100.000006com.coroflot
8011829034643480.000008com.thediplomat
8021828690240660.000009com.inhabitat
8031828666832770.000011com.bp
8041828652245920.000008cat.uab
8051828348038270.000009uk.co.villiers-london
8061828301441400.000009org.grist
8071828245240160.000009com.foreignaffairs
8081828132410810.000028com.tapad
8091828037813470.000023org.altervista
810182803583820.000069com.skype
8111828032443490.000008com.worldsecuresystems
8121827968024090.000014com.volusion
8131827951629070.000012ru.nethouse
8141827948035270.000010pe.com.google
8151827943847790.000007be.lesoir
8161827887432880.000011co.com.google
8171827881638850.000009de.uni-koeln
8181827877829100.000012org.gnupg
8191827802246560.000008com.mihanblog
8201827755433600.000011org.panda
8211827718634400.000010lv.google
8221827667453000.000007lu.google
823182764424840.000055com.inc
8241827567651030.000007cn.com.caijing
8251827513433310.000011uk.gov.metoffice
82618274258680.000471com.oculus
8271827373223640.000014org.donorbox
8281827331230380.000011rs.google
8291827325611970.000026com.merriam-webster
8301827144850510.000007ee.ut
8311827106025190.000013com.amebaownd
8321827092244820.000008com.marksandspencer
8331827078064470.000006su.clan
8341826994840960.000009ru.interfax
8351826962038520.000009org.rferl
8361826875629040.000012gov.nd
837182679945480.000049com.fortune
8381826777646930.000008it.unitn
8391826771456650.000006am.google
8401826676235020.000010org.iaea
8411826374838930.000009pr.com.google
8421826215850450.000007com.tok2
8431826193819010.000018ch.ethz
8441826192233420.000011gov.la
8451826118245070.000008org.democracynow
8461826117625930.000013net.noscript
847182602168360.000037com.mix
848182598624080.000066net.adform
8491825960852080.000007tn.google
8501825797842120.000008jp.hateblo
8511825788860290.000006hk.edu.hkbu
8521825768038840.000009nl.wur
8531825759450090.000007gr.auth
854182574069970.000031com.webs
8551825676045120.000008com.mnn
8561825670257590.000006ru.nnov
8571825623839540.000009com.afp
8581825574413650.000023com.format
8591825566252090.000007nf.co
860182539543290.000079com.getbootstrap
8611825298849610.000007jp.hatenadiary
8621825215447280.000007hk.com.hkex
8631825125811930.000026com.redhat
8641825097456000.000006com.gust
8651825008810670.000029com.symantec
8661824946625620.000013net.ucoz
867182493202680.000095com.typeform
8681824869463270.000006com.x10host
8691824833235470.000010uk.co.saveourschools
8701824789829340.000012com.squarespace-cdn
8711824729229770.000011lt.google
872182468725250.000051com.adweek
8731824684442950.000008com.scienceblogs
8741824647248480.000007de.uni-konstanz
8751824556263620.000006com.ueuo
8761824504838560.000009uk.gov.data
8771824475640050.000009tr.com.hurriyet
8781824365230700.000011ae.google
8791824357018910.000019com.speakerdeck
8801824333050790.000007com.blogsky
8811824313420440.000017tv.ustream
8821824037467110.000006su.moy
883182392987610.000041gov.copyright
8841823909652920.000007ru.novayagazeta
8851823904427890.000012gov.nh
8861823899040570.000009org.hathitrust
8871823894836480.000010org.annualreviews
8881823893211540.000027pl.home
8891823888238150.000009com.businesscatalyst
890182377404720.000058com.ea
8911823772630870.000011uk.gov.hmrc
8921823694039300.000009cc.uxdesign
8931823689460150.000006com.artfire
894182367043660.000072org.opensource
8951823653034670.000010it.beniculturali
8961823613225070.000014gov.mn
8971823607610190.000030com.engadget
8981823590236820.000010ve.co.google
8991823545249730.000007com.teslamotors
9001823403874750.000005com.hangame
901182339664270.000063com.fastcompany
9021823360042630.000008com.hsbc
9031823307424620.000014com.netsolhost
9041823258255560.000006me.google
9051823234456430.000006mu.google
9061823157055290.000006com.yam
9071823124239690.000009tz.co.google
908182309989740.000032com.verisign
9091823091633640.000011tw.com.pchome
9101823066272930.000005com.addr
9111823062826360.000013com.shell
9121823060265990.000006com.dropmark
9131822970856350.000006li.google
9141822911650020.000007com.gab
9151822910644930.000008com.tapatalk
9161822819413250.000023edu.ucla
9171822795835570.000010uk.co.newmedianow
9181822793849880.000007edu.whoi
9191822781037380.000009ng.com.google
9201822763054440.000007ni.com.google
9211822607641100.000009uk.co.sainsburys
9221822545844120.000008com.iconarchive
9231822508053800.000007gr.ntua
9241822492461520.000006com.epochtimes
9251822471651980.000007org.birdlife
9261822461035320.000010uk.co.intersol
9271822417856150.000006id.co.kaskus
928182237629500.000034com.zoho
9291822316654030.000007cr.co.google
9301822304656950.000006sv.com.google
9311822288240740.000009vn.zing
9321822271445370.000008uk.co.zoopla
9331822248040390.000009uk.ac.jisc
9341822103438360.000009com.prweek
9351822042230980.000011int.wmo
9361822041054660.000006mz.co.google
9371822020249660.000007edu.umb
9381822019612900.000024uk.co.freeukbusinessdirectory
9391822006814760.000021org.owasp
9401821972666690.000006net.comunidades
9411821897641410.000009com.scotusblog
9421821884056360.000006com.cyberlink
9431821873838280.000009do.com.google
9441821867229660.000011io.termly
9451821826247350.000007com.fatcow
9461821817238510.000009mt.com.google
9471821811035890.000010uk.org.oxonaa
9481821795837740.000009gt.com.google
9491821690837370.000009com.solidworks
9501821678236410.000010uk.co.profilebusiness
9511821627036250.000010uk.co.heatall
9521821603445060.000008com.theringer
9531821538825580.000013nl.jouwweb
954182153208000.000039com.wikihow
9551821506059530.000006com.symbaloo
9561821476851710.000007pl.cba
9571821416257400.000006kg.google
9581821359423210.000015com.freeprivacypolicy
9591821285012220.000026com.att
9601821268052030.000007pl.lublin
9611821267215410.000020edu.umd
9621821217454850.000006uk.org.labour
9631821207442880.000008us.ms.state
9641821182834490.000010com.wantedly
9651821157043960.000008org.ametsoc
9661821154237010.000010uy.com.google
9671821148655530.000006jp.ifdef
9681821143852180.000007es.usal
969182113987690.000041com.netflix
9701821119663290.000006org.cgsociety
9711821085438970.000009hn.google
9721821054456020.000006org.svoboda
9731820782844320.000008org.ascd
9741820778445000.000008uk.co.dailystar
9751820771236510.000010uk.co.articlelistings
976182073705030.000054com.dmca
977182071149160.000035com.ggpht
9781820703251990.000007com.curseforge
9791820643252650.000007org.nsidc
9801820634015200.000021com.technologyreview
9811820590856680.000006ug.co.google
9821820582240300.000009org.lacity
9831820534848430.000007com.cbn
984182047164340.000063com.businesswire
9851820471258600.000006mn.google
9861820439468680.000005kr.ac.postech
9871820433256130.000006it.unige
9881820352633140.000011uk.gov.food
9891820331463530.000006com.skepticalscience
990182030529090.000035org.weforum
9911820243449070.000007com.globalpost
9921820241651720.000007com.weightwatchers
9931820200034030.000010com.lexology
9941820073859440.000006tt.google
9951820021052820.000007com.betfair
9961819996854280.000007py.com.google
9971819892848150.000007com.abcnews
998181986987630.000041com.psychologytoday
9991819851269740.000005org.toile-libre
10001819841432910.000011net.vnexpress

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2021 crawl archive now available

The crawl archive for January 2020 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content. It includes page captures of 1.15 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-04/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2021-04/segment.paths.gz100
WARC filesCC-MAIN-2021-04/warc.paths.gz7984078.98
WAT filesCC-MAIN-2021-04/wat.paths.gz7984022.92
WET filesCC-MAIN-2021-04/wet.paths.gz7984010.04
Robots.txt filesCC-MAIN-2021-04/robotstxt.paths.gz798400.23
Non-200 responses filesCC-MAIN-2021-04/non200responses.paths.gz798402.11
URL index filesCC-MAIN-2021-04/cc-index.paths.gz3020.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-04/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November/December 2020 crawl archive now available

The crawl archive for November/December 2020 is now available! The data was crawled between November 23 and December 6 and contains 2.64 billion web pages or 270 TiB of uncompressed content. It includes page captures of 1.4 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The November/December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-50/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-50/segment.paths.gz100
WARC filesCC-MAIN-2020-50/warc.paths.gz7200059.95
WAT filesCC-MAIN-2020-50/wat.paths.gz7200017.82
WET filesCC-MAIN-2020-50/wet.paths.gz720007.89
Robots.txt filesCC-MAIN-2020-50/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-50/non200responses.paths.gz720001.71
URL index filesCC-MAIN-2020-50/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-50/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

October 2020 crawl archive now available

The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-45/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-45/segment.paths.gz100
WARC filesCC-MAIN-2020-45/warc.paths.gz7200063.79
WAT filesCC-MAIN-2020-45/wat.paths.gz7200018.39
WET filesCC-MAIN-2020-45/wet.paths.gz720008.23
Robots.txt filesCC-MAIN-2020-45/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-45/non200responses.paths.gz720001.75
URL index filesCC-MAIN-2020-45/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Interactive Webgraph Statistics Notebook Released

We are pleased to announce the release of an interactive Jupyter notebook that is used to provide:

  • Visualization of web graph statistics
  • An interface for interacting with the webgraph

The visualization of the web graph statistics is done by leveraging the WebGraph framework, which provides means of gathering many interesting data points of a web graph, such as the frequency distribution of indegrees/outdegrees in the graph, or size distributions of the connected components. We then are able to use pandas and matplotlib to provide a visualization for the data provided by WebGraph. This effort was largely inspired by the Topology of the 2012 WDC Hyperlink Graph document. Further details of WebGraph tool installation/usage, and the data visualization may be found in the cc-notebooks repository.

The interface for interacting with the webgraph is done by using pyWebGraph, a front end that interfaces Jython with WebGraph. First, before using this interface we must re-build the string maps, in order to create a mapping between the node ID (a numerical value), to domain name (and vice versa). Once this is established we are able to simply load up the graph into pyWebGraph, and you will be able to traverse the graph interactively.

Further details of pyWebGraph installation/usage, and how to rebuild the string maps may be found in interactive webgraph README of the cc-notebooks repository.

The Jupyter notebook is available on Github in the same repository. More details about how to navigate the repository can be found in the notebook itself, as well as in the README.

We hope that users will be able to use these notebooks to gain more insight into the web graph in a numerical and practical sense.

We are grateful for WebGraph for providing extremely useful tools for processing the web graph itself, and Massimo Santini for developing pyWebGraph.

Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 539 million nodes and 3.02 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 467 million dangling nodes (86.7%) and the largest strongly connected component contains 46 million (8.5%) nodes.

You can download the graph and the ranks of all 539 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/ as prefix to access the files from everywhere.

SizeFileDescription
3.32 GBcc-main-2020-jul-aug-sep-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
13.7 GBcc-main-2020-jul-aug-sep-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
5.95 GBcc-main-2020-jul-aug-sep-host.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-host.properties
6.76 GBcc-main-2020-jul-aug-sep-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-jul-aug-sep-host-t.properties
1 kBcc-main-2020-jul-aug-sep-host.statsWebGraph statistics
7.77 GBcc-main-2020-jul-aug-sep-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 89 million nodes and 1.71 billion edges. 51% or 45 million nodes are dangling nodes, the largest strongly connected component covers 35 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/.

Download files of the Common Crawl Jul/Aug/Sep 2020 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2020-jul-aug-sep-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.80 GBcc-main-2020-jul-aug-sep-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.75 GBcc-main-2020-jul-aug-sep-domain.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-domain.properties
3.69 GBcc-main-2020-jul-aug-sep-domain-t.graphtranspose of the graph
2 kBcc-main-2020-jul-aug-sep-domain-t.properties
1 kBcc-main-2020-jul-aug-sep-domain.statsWebGraph statistics
1.91 GBcc-main-2020-jul-aug-sep-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 89 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Jul/Aug/Sep 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13202792810.018888com.googleapis
23031294430.012001com.facebook
32902594820.013237com.google
42656047240.007343org.w
52651653450.007172com.twitter
62601646460.006600com.youtube
72461419090.004795com.instagram
82422071280.005190org.gmpg
92357297070.005599com.googletagmanager
1023188190110.003202com.linkedin
1122457894150.002590com.gravatar
1222451350100.003967com.cloudflare
1322364152140.002726com.gstatic
1422350042120.003105org.wordpress
1521926906220.001505com.pinterest
1621699168210.001752com.wordpress
1721599006260.001181org.wikipedia
1821538264160.002431com.bootstrapcdn
1921497526180.001836com.apple
2021314410300.001106com.vimeo
2121248994410.000830be.youtu
2221186566200.001794com.jquery
2321081822230.001444com.microsoft
2421073240450.000773com.blogspot
2520994964390.000952com.amazonaws
2620975988460.000732gl.goo
2720971574250.001384com.wp
2820921220470.000723com.amazon
2920788608720.000439com.tumblr
3020716256190.001804com.adobe
3120694562670.000535ly.bit
3220675418340.001018com.google-analytics
3320627694530.000673org.mozilla
3420618998170.001975com.github
3520617620310.001059net.cloudfront
3620579928710.000449com.yahoo
3720571130290.001127com.googlesyndication
3820570586600.000612eu.europa
3920562028520.000679com.flickr
4020560188420.000818net.jsdelivr
4120526264970.000347com.googleusercontent
4220481758620.000606co.t
43204802181090.000313com.reddit
4420451670240.001419com.fontawesome
4520436180830.000389com.weebly
4620387228560.000628com.paypal
4720375802400.000910com.macromedia
4820372972700.000450com.medium
4920370180430.000808com.addthis
5020360678280.001156ru.yandex
5120338498270.001156me.wp
5220331252640.000559org.w3
5320326560790.000411io.github
54202928361380.000223com.nytimes
5520275824760.000414org.creativecommons
5620274244590.000615org.schema
57202553261500.000192com.forbes
58202460681730.000151com.imgur
5920227930360.000979net.doubleclick
60202196121940.000133uk.co.bbc
61202109241140.000285com.soundcloud
6220171070660.000548com.vk
63201552221950.000133com.cnn
6420142696440.000803org.apache
6520134806630.000587com.whatsapp
66201295823140.000082edu.mit
67201230321800.000146com.imdb
68201183102080.000124net.slideshare
69201166262430.000101com.wsj
70201157681970.000128org.wikimedia
7120089462850.000388com.shopify
72200822042150.000120edu.stanford
73200766841540.000181gov.cdc
74200756323280.000079com.wired
75200697242680.000094com.techcrunch
76200570662550.000096edu.harvard
77200513363530.000076com.appspot
78200512922070.000124net.sourceforge
79200512642570.000096com.oracle
80200512501550.000177int.who
81200508882060.000124com.businessinsider
82200460501370.000227org.archive
83200381982300.000113com.washingtonpost
84200358102500.000097com.live
85200299401640.000163com.bing
86200282105490.000054com.livejournal
87200276224240.000069com.go
88200246664560.000066com.msn
89200199924070.000072uk.co.telegraph
90200093061700.000154com.theguardian
91200025145270.000056edu.cornell
92199971461990.000128org.ietf
93199967144860.000063gov.nasa
94199954762590.000096com.android
95199862523020.000084com.reuters
9619983946510.000702net.fbcdn
97199748902400.000102com.bloomberg
98199664641620.000164com.giphy
9919960428770.000414com.list-manage
100199590465200.000057com.googleblog
101199565582690.000093com.bbc
102199552044090.000071com.slack
103199420561430.000205com.spotify
104199388285910.000049com.zdnet
10519936894480.000721net.facebook
106199350105860.000050com.quora
107199310721260.000265com.ytimg
108199227744440.000067com.myspace
109199220467570.000038edu.umich
110199201787150.000040edu.upenn
111199174821510.000185gov.nih
112199078863440.000077com.usatoday
113199038966540.000045com.economist
114199037223130.000082com.cnbc
115199027003080.000083com.example
116198965525250.000056com.pixabay
117198950144180.000070net.researchgate
118198827904490.000066com.latimes
119198811641880.000138com.blogger
120198700463870.000075org.python
12119864804650.000555com.wix
122198607604330.000068com.githubusercontent
123198587326930.000042org.ieee
124198542544990.000061com.mashable
125198509185710.000052edu.berkeley
126198475541350.000241com.youtube-nocookie
127198451301600.000167com.issuu
128198430682180.000118org.acm
129198397368340.000036org.chromium
130198395502350.000106uk.co.google
131198357905510.000054org.arxiv
132198330202460.000099net.behance
133198326822910.000086org.npr
134198319941080.000320com.unpkg
135198311368840.000034com.arstechnica
136198268402130.000121com.unsplash
137198228843410.000078com.outlook
138198226701100.000303de.google
13919812430540.000654com.googleadservices
140198108723470.000077com.prnewswire
141198064586780.000043edu.columbia
142198053821710.000153me.t
143198048862970.000085com.dribbble
144198041422560.000096com.squarespace
145197990321390.000215gov.privacyshield
146197988063060.000083com.huffingtonpost
147197979642600.000096com.bandcamp
148197951123980.000074com.time
14919793874370.000975com.baidu
150197920826160.000048com.gitlab
151197904063340.000079com.nationalgeographic
152197882144430.000067com.nature
153197851787940.000037com.stackexchange
154197821141790.000147gle.forms
155197816762580.000096org.ampproject
156197785345480.000054com.fortune
157197779028130.000036com.git-scm
15819776608330.001030com.wixstatic
159197740307710.000038com.qz
160197723902810.000089com.wiley
161197722686460.000046au.net.abc
162197709306380.000046edu.yale
163197695824280.000068com.meetup
164197678764680.000064com.ted
1651976138611600.000026com.hatenablog
166197590524480.000066com.patreon
167197574722830.000089com.disqus
168197567489360.000032edu.ucla
169197539981470.000195com.dropbox
170197533801680.000158com.yelp
171197506782710.000093org.un
172197463842120.000122com.twimg
173197431182540.000096org.drupal
174197414746890.000042org.bitbucket
175197365404220.000069com.statista
176197354409030.000033uk.ac.cam
177197319407180.000040com.evernote
178197319166820.000043com.newyorker
179197256386030.000049com.buzzfeed
180197195446060.000049me.about
181197186547220.000040com.mysql
182197168048500.000035com.thenextweb
183197154204950.000061com.theatlantic
184197109202790.000091com.sciencedirect
185197108264030.000073com.getpocket
186197053266690.000043uk.co.blogspot
1871970212612930.000023com.tinypic
188196967304500.000066com.booking
189196956525140.000058com.xinhuanet
190196949047430.000039org.weforum
191196942682470.000098gov.ca
192196923226020.000049gov.loc
1931969099812820.000023org.postgresql
194196899088280.000036edu.princeton
195196879542390.000103uk.co.amazon
196196859424800.000063com.dailymotion
1971967967214520.000021ru.narod
198196789261890.000138com.xing
199196759148790.000034edu.jhu
200196736705000.000060gov.whitehouse
201196718466650.000044org.worldbank
2021966870613650.000022org.eclipse
203196677704000.000073com.springer
204196676844450.000067com.nypost
205196658723160.000081com.ft
20619660930610.000606com.fb
207196589862040.000125com.feedburner
208196583948260.000036org.cambridge
209196547624760.000063uk.co.dailymail
210196543867660.000038edu.washington
211196542424960.000061org.eff
21219653044320.001054com.qq
213196501444730.000064com.goodreads
214196495242640.000095org.doi
215196495025120.000058com.w3schools
2161964124213110.000023edu.virginia
217196412124400.000067com.googlecode
218196383486330.000047com.vice
219196331285060.000059com.force
220196329767230.000040com.trello
221196327808360.000035com.about
222196305625230.000056com.inc
223196294824530.000066com.scribd
2241962936820530.000016com.wikidot
225196284366190.000048org.semver
226196144966070.000049com.cbsnews
227196077946510.000045com.withgoogle
228196055121460.000196me.line
2291960341020890.000016com.googlesource
230196014762190.000118org.iana
231196014525460.000054gov.usda
232195998003090.000083com.tinyurl
2331959829010900.000027com.techradar
234195976748580.000035com.dropboxusercontent
235195974463840.000076com.ibm
2361959520012840.000023co.elastic
237195940242890.000087com.squareup
2381959333614340.000021org.linuxfoundation
2391959238811340.000026org.coursera
2401958983010270.000029gov.fbi
2411958828411580.000026edu.unc
242195860087050.000041com.vox
243195833501930.000134de.amazon
244195830965500.000054uk.co.independent
2451958055414230.000021ms.1drv
246195789503830.000076com.digg
2471956761213930.000022org.kernel
248195639481130.000287com.sharethis
249195634687510.000039org.d3js
250195574908010.000037gov.fcc
2511955729210260.000029com.hollywoodreporter
2521955625813690.000022com.howstuffworks
253195537004300.000068com.cnet
254195520688040.000037com.foxnews
255195471341520.000183com.addtoany
256195470066440.000046com.indiatimes
257195469289950.000029com.steamcommunity
2581954686411050.000026cn.com.chinadaily
259195456285840.000050com.psychologytoday
260195441308230.000036uk.co.guardian
2611954392014630.000021it.scoop
262195437541330.000247com.mailchimp
263195422348370.000035com.slate
264195422141530.000182com.opera
265195384125890.000050com.mckinsey
2661953681610200.000029com.sap
2671953641826050.000013org.wikiquote
268195343343070.000083com.bitly
269195333086270.000047com.mozilla
270195330542620.000095jp.ameblo
271195312607350.000039org.sciencemag
272195282461160.000284com.paypalobjects
2731952810823450.000014org.wikibooks
274195271041760.000151com.amazon-adsystem
275195269486880.000042gov.noaa
276195248683050.000083com.netdna-ssl
277195245443100.000083com.nbcnews
278195233309890.000030com.target
2791952277615230.000020com.instructables
280195175269750.000030edu.umn
281195165309650.000031com.merriam-webster
2821951626014310.000021hk.com.google
283195148521850.000140com.tripadvisor
2841951460823770.000014com.diigo
285195039164970.000061ca.google
286194992622360.000106com.wpengine
2871949924610290.000028com.sun
2881949656211890.000025com.digitaltrends
289194963403910.000075com.stumbleupon
290194918461150.000284com.weibo
2911949163816260.000019com.ign
2921949121013140.000023com.mercurynews
2931949096413520.000022de.zeit
294194906362290.000114com.etsy
295194891067970.000037uk.ac.ox
296194874542840.000089com.optimizely
29719485106730.000425net.akamaihd
2981948436812070.000025net.speedtest
2991948428415220.000020org.greenpeace
3001948362215530.000020net.seesaa
301194794507200.000040au.com.google
302194786049040.000033de.spiegel
3031947633610770.000027com.podbean
304194751426280.000047org.pbs
305194747225160.000058com.gofundme
306194744844160.000070com.kickstarter
3071947359013400.000022com.urbandictionary
308194724224720.000064org.pewresearch
309194713205190.000057com.bigcommerce
3101946791221370.000015de.bild
311194672402310.000112com.eepurl
312194653005150.000058com.theverge
313194647922730.000092com.stackoverflow
314194645989260.000032com.politico
315194630368110.000036co.ibb
316194623943320.000079it.google
3171946216221100.000016ly.visual
318194618409550.000031org.unicef
3191946093220200.000016org.tensorflow
3201945759216880.000018com.itv
3211945715010130.000029com.lifehacker
322194565121060.000334com.stripe
3231945627213490.000022edu.msu
324194554123120.000083net.windows
325194533748050.000037edu.academia
3261945028413910.000022com.storify
3271944963812570.000024com.crunchbase
328194493865950.000049com.tandfonline
3291944913219580.000017com.lego
3301944468211870.000025com.jetbrains
331194437966770.000043gov.senate
332194436648550.000035com.chicagotribune
3331944323423010.000014com.rottentomatoes
334194402247700.000038ca.cbc
335194399342050.000125com.eventbrite
3361943949612730.000023hk.hku
3371943640210350.000028edu.wisc
338194361046910.000042com.libsyn
3391943574210510.000028edu.northwestern
340194332129440.000031com.scientificamerican
3411943279810430.000028edu.uchicago
3421943118212880.000023uk.co.wired
343194255461900.000137jp.co.google
3441942434620020.000016org.maven
3451942373210300.000028com.mediafire
346194233504150.000070me.telegram
347194184403960.000074com.criteo
348194172083570.000076fr.google
349194170386640.000044us.icio
3501941640214770.000020com.deadline
351194158086400.000046com.sagepub
352194142567300.000039com.ecwid
3531941346612750.000023org.aclu
354194132585760.000051com.typepad
355194121684710.000064com.photobucket
356194072945330.000055com.oup
3571940716811990.000025com.reverbnation
3581940696815140.000020de.mpg
3591940533013890.000022edu.rutgers
3601940479010670.000027com.scmp
36119403976810.000392net.jsfiddle
362194036924210.000069com.calendly
363194036188440.000035com.sciencedaily
364194034687270.000039gov.justice
365194008305750.000051gov.hhs
366193982589190.000032com.theconversation
367193975969910.000030com.apnews
368193974429380.000032com.huffpost
3691939493415180.000020com.newscientist
370193946566080.000049org.openstreetmap
3711939330012870.000023com.aljazeera
372193932302160.000119com.hubspot
373193900186450.000046gov.house
3741938811826820.000012uk.co.timesonline
3751938803425640.000013com.space
376193839107000.000041com.pinimg
377193835044320.000068page.g
3781938199012410.000024com.sky
379193818448660.000035gov.congress
380193810269120.000033com.500px
3811938063212170.000024org.wiktionary
382193803409580.000031com.ssrn
3831937974217090.000018edu.bu
3841937764017570.000018gov.cia
385193757402140.000120org.bbb
3861937563414380.000021com.foxbusiness
387193718146240.000047ru.gov
3881937105615980.000019ca.mcgill
389193679267900.000037com.qualtrics
3901936605412900.000023org.semanticscholar
391193657787610.000038site.business
392193657602670.000094ru.ok
393193637989770.000030edu.si
394193637588870.000034br.com.google
395193636888470.000035co.g
3961936320410210.000029uk.co.thetimes
3971936212226630.000012com.discovermagazine
398193599201820.000142us.zoom
399193594928890.000034org.fao
400193593526830.000043org.change
4011935786614690.000020com.salon
402193566502280.000114com.aliyuncs
403193562809970.000029com.thehill
404193548189730.000030gov.usgs
405193515842980.000085com.ebay
4061935098812220.000024com.nikkei
407193501423380.000078com.rawgit
408193496605780.000051it.placehold
409193488241570.000173com.wixsite
4101934812212380.000024com.smithsonianmag
411193465527580.000038org.oecd
4121934651410880.000027ee.linktr
4131934525433120.000011com.openai
4141934228810480.000028uk.co.mirror
415193416566790.000043com.deviantart
4161934133215760.000019org.phys
417193405984130.000070tv.twitch
418193401384040.000072com.mapbox
4191933524615460.000020ca.sfu
4201933246427540.000012com.instapaper
421193306562440.000100org.gnu
4221933050421150.000016au.edu.unimelb
4231932872410440.000028int.coe
4241932832020780.000016org.nobelprize
425193282866670.000043pl.google
4261932768013330.000022com.irishtimes
427193275782930.000086com.office
4281932753619620.000017org.torproject
429193249364840.000063net.imgix
4301932462812810.000023uk.ac.ucl
4311932092610540.000028org.ohchr
4321931877212130.000025com.strikingly
433193155025090.000059org.hbr
4341931504014110.000021uk.co.metro
435193143041230.000270com.statcounter
436193134689720.000030gov.dhs
437193133802870.000088com.thedailybeast
4381931323418110.000017com.bankofamerica
4391931253412650.000024com.buzzsprout
440193119408630.000035gov.nps
4411930986824260.000014au.com.theage
442193074729330.000032com.aweber
4431930676615570.000020blog.home
444193054488480.000035gov.bls
445193052964900.000062edu.nyu
4461930434620870.000016com.oxforddictionaries
4471930407411620.000025gov.nyc
44819303568930.000356org.reactjs
4491930277813820.000022au.com.news
4501930088222910.000014sg.edu.nus
4511929990014290.000021com.flipboard
452192998964810.000063com.scorecardresearch
4531929801025170.000013com.dummies
4541929584024650.000013org.rsc
4551929547210100.000029com.britannica
456192949847140.000040gov.state
4571929421617000.000018org.gutenberg
4581929289235650.000010fm.ask
4591929086629700.000011com.pearltrees
460192899907930.000037com.zapier
4611928649425620.000013com.mystrikingly
462192840928760.000034com.cctv
463192835008160.000036com.healthline
4641928304419550.000017com.chrome
4651928263814840.000020com.rt
466192825509670.000031com.newsweek
4671928053823620.000014com.biography
4681927964610050.000029ch.google
4691927050414120.000021com.ifttt
4701927023815840.000019com.axios
471192700424660.000065es.google
472192696588820.000034au.gov.nsw
4731926744434830.000010hk.edu.cuhk
474192671508620.000035com.stitcher
4751926700025200.000013com.boredpanda
4761926558211920.000025fr.lemonde
477192639925540.000053com.steampowered
4781926387810550.000028org.jstor
4791926215013350.000022org.imf
480192619188730.000034com.venturebeat
481192611968250.000036org.poynter
4821925957416840.000018com.straitstimes
4831925945233900.000010com.chosun
4841925932215020.000020edu.asu
4851925876223510.000014io.gitlab
486192568109560.000031ru.google
487192559969520.000031sg.com.google
4881925379813310.000022uk.co.standard
489192529066120.000048de.gesetze-im-internet
490192515169480.000031gov.archives
4911925027023850.000014th.co.google
492192497304230.000069io.codepen
4931924893030330.000011com.nola
4941924889420230.000016edu.gmu
4951924524628360.000012app.netlify
4961924515811160.000026com.wikia
4971924265613530.000022com.history
4981924216010070.000029com.thelancet
4991924183029180.000011com.coca-colacompany
5001924064026540.000012google.ai
501192406008560.000035com.freepik
5021924043015480.000020com.buzzfeednews
5031923864828940.000012org.cato
504192377004310.000068net.datatables
505192374565010.000060com.rackcdn
5061923616815900.000019gov.supremecourt
5071923330225340.000013edu.byu
508192332686420.000046fr.amazon
5091923292028720.000012tw.blogspot
510192319448030.000037in.co.google
5111923153019770.000017org.edx
5121923122813090.000023com.tunein
5131923115617790.000018org.ocks
514192304785220.000057nl.google
515192283705550.000053com.gmail
5161922706823980.000014com.nationalpost
5171922691018670.000017edu.ucsb
5181922641823830.000014edu.nd
5191922639213720.000022com.dw
520192262561270.000262com.jimdo
5211922586024120.000014no.uio
5221922540010060.000029google.blog
5231922239814090.000021cn.cntv
5241922216432850.000011cn.org.china
5251922113616390.000019org.unwomen
526192189509460.000031com.airtable
5271921778825100.000013edu.uoregon
5281921537621720.000015org.britishcouncil
5291921467426680.000012org.icrc
530192144629510.000031com.gallup
5311921337822650.000015ru.kremlin
5321921289413320.000022com.globalsign
533192108508750.000034gov.uspto
534192104929590.000031edu.psu
5351921002215090.000020com.penguinrandomhouse
5361920931813450.000022com.netdna-cdn
5371920868632690.000011is.archive
5381920834415310.000020uk.ac.lse
5391920795225030.000013fi.helsinki
5401920762020420.000016edu.pitt
5411920723621700.000015net.openid
5421920625611550.000026edu.brookings
543192052907860.000037com.imageshack
544192047701720.000152com.npmjs
5451920448632900.000011de.diplo
5461920438019560.000017edu.unl
5471920383215440.000020edu.georgetown
5481920321021250.000015org.metmuseum
5491920275012400.000024org.nejm
550192022447260.000040com.adage
5511920043419900.000017com.channel4
5521920029015110.000020com.findlaw
5531920003022240.000015com.france24
554191989382820.000089net.php
5551919869817840.000017com.csmonitor
556191978664190.000069com.proofpoint
557191953201920.000135com.iubenda
5581919437210110.000029gov.treasury
5591919402817080.000018com.euronews
5601919144622860.000014com.thoughtco
5611919013637420.000009com.doodlekit
562191898621070.000320com.godaddy
5631918933412980.000023edu.duke
5641918865220710.000016com.foreignpolicy
5651918511819960.000017org.documentcloud
5661918375613000.000023com.livescience
5671918370625080.000013com.upi
5681918310420850.000016com.gq
569191822601780.000148com.zendesk
5701918207430200.000011com.authorstream
5711918207439150.000009com.mysanantonio
5721918169441330.000008tw.edu.sinica
5731917789427190.000012org.wikisource
5741917738222200.000015com.insider
575191771808510.000035gov.nist
5761917700016250.000019com.thestar
577191766421810.000145jp.co.yahoo
5781917454613040.000023au.com.smh
5791917402820250.000016org.ncsl
5801917380042520.000008hk.edu.cityu
5811917374433490.000010com.sina
5821917310821970.000015ie.independent
5831917226621560.000015edu.uky
58419171704960.000349me.ogp
5851917093634130.000010uk.ac.sussex
5861917079217550.000018gov.doc
587191707041310.000250org.networkadvertising
588191695663200.000080io.shields
589191680586490.000045gov.usa
5901916699042910.000008org.china-embassy
5911916681031370.000011com.udn
592191637741610.000166ru.mail
5931916371234740.000010com.worldatlas
594191635225050.000060com.netflix
595191632548570.000035com.thinkwithgoogle
5961916235614410.000021gov.defense
5971916195213180.000023tw.com.google
5981916082616040.000019org.hrw
5991915981214950.000020com.asahi
600191595707850.000037io.readthedocs
6011915876826880.000012org.freedomhouse
6021915865414130.000021tv.ustream
603191578228930.000034org.mediawiki
6041915644617150.000018org.pypi
6051915180030280.000011org.adb
6061915140620990.000016fr.leparisien
6071915115226150.000013com.abc7news
6081915065020630.000016com.voanews
6091915004810190.000029com.pcmag
610191486984470.000067org.nodejs
6111914855442880.000008com.theundefeated
6121914781638600.000009org.gephi
6131914717613270.000023org.undp
6141914646232770.000011org.iucnredlist
6151914645425830.000013com.sacbee
6161914620415940.000019com.treehugger
6171914560822920.000014no.google
6181914446224710.000013co.ello
6191914335419860.000017com.msnbc
620191433542520.000097com.myshopify
621191428109810.000030uk.parliament
6221914252022870.000014co.pcdn
6231914194212550.000024gov.uscourts
6241914189614220.000021co.lpages
6251914078023440.000014org.fas
626191397687810.000037com.intel
627191387408070.000036com.marketwatch
6281913691420470.000016com.infogram
6291913384825380.000013com.sputniknews
6301913370424300.000014ie.google
6311913258213440.000022se.google
632191317989900.000030com.netlify
633191310009250.000032com.jekyllrb
6341913061230550.000011int.interpol
635191303085240.000056fr.free
6361913018011980.000025be.google
6371912975015750.000019uk.co.huffingtonpost
6381912931023230.000014ly.rebrand
6391912910415040.000020link.page
6401912870417940.000017com.sched
6411912772422180.000015jp.co.japantimes
6421912725428290.000012org.tigris
6431912715228390.000012org.pri
6441912700623190.000014nz.co.nzherald
6451912562212040.000025at.google
6461912546452920.000007org.arkive
647191253262220.000116com.salesforce
648191232966500.000045br.com.uol
6491912101842420.000008kr.co.kbs
6501911937416650.000018com.thebalance
6511911912614550.000021org.oxfordjournals
6521911863837380.000009com.encyclopedia
6531911726222040.000015org.eji
6541911650628180.000012org.heritage
6551911629823710.000014com.popsci
6561911451821990.000015com.snopes
6571911409826010.000013org.oas
658191133481560.000174com.aspnetcdn
6591911271210310.000028org.ilo
6601910965422630.000015com.insidehighered
6611910898015870.000019gov.usembassy
6621910893216220.000019dk.google
6631910804033920.000010org.jenkins-ci
6641910738828270.000012org.project-syndicate
6651910655619630.000017com.justia
6661910412015630.000019gov.govinfo
6671910315216990.000018com.firebaseapp
6681910206820930.000016edu.uga
6691910202836780.000010edu.wm
6701910161432840.000011com.cgtn
6711910159618810.000017org.worldcat
672191012269000.000033com.zoho
673191005903920.000074com.atlassian
6741910029026760.000012org.transparency
6751909977613170.000023org.aarp
6761909968616750.000018org.americanbar
6771909916422390.000015com.timeshighereducation
6781909796432700.000011com.pastemagazine
6791909590225980.000013org.csis
680190943426290.000047com.samsung
681190940587740.000038com.pexels
6821909337419640.000017com.washingtontimes
6831909271420160.000016gov.usaid
6841909016613340.000022org.heart
685190887641910.000136com.automattic
686190884288650.000035com.verisign
6871908766021080.000016com.motherjones
6881908703429440.000011org.vim
6891908649820620.000016edu.nap
690190861729240.000032com.webs
6911908477815930.000019org.amnesty
6921908434421010.000016ua.com.google
6931908355239880.000009org.globalnetworkinitiative
6941908319625460.000013org.globalcitizen
6951908250017540.000018com.surveygizmo
6961908205822620.000015org.wbur
6971908104823530.000014uk.gov.companieshouse
6981908039824680.000013jp.mainichi
6991908028631810.000011com.podomatic
7001907811617510.000018org.unhcr
7011907627621180.000016ca.ctvnews
7021907531025650.000013uk.co.bbci
703190738129680.000031uk.gov.legislation
7041907152226810.000012com.nationalreview
7051907083225230.000013com.cleveland
7061907047438140.000009org.neocities
7071906988410730.000027ly.snip
708190688644380.000067com.herokuapp
709190685106560.000045com.oreilly
7101906673011540.000026cz.google
7111906646421640.000015org.nrdc
7121906576826710.000012org.thinkprogress
7131906565417950.000017ca.globalnews
714190651062700.000093jp.co.amazon
7151906284013280.000023org.altervista
7161906173231190.000011uk.ac.nottingham
7171906116812670.000024uk.gov.nationalarchives
7181906093421060.000016au.edu.anu
7191906023630350.000011com.intensedebate
7201906010227340.000012de.hu-berlin
721190598027360.000039com.airbnb
7221905980023260.000014de.auswaertiges-amt
7231905937623160.000014nz.co.google
7241905917026720.000012org.unenvironment
7251905897831320.000011org.rsf
7261905793241100.000008com.koreaherald
7271905777819600.000017org.pewtrusts
7281905767828670.000012com.techinasia
7291905748822760.000014com.thecut
7301905617437000.000009com.viki
7311905606827240.000012org.gnupg
7321905459024690.000013ro.google
7331905439420570.000016edu.gwu
7341905411630570.000011com.bangkokpost
7351905362625720.000013fr.rfi
736190528684140.000070com.pubmatic
7371905190623090.000014com.tutsplus
7381905164810790.000027tr.com.google
739190515162480.000098com.getbootstrap
7401905090844240.000008com.wonderhowto
7411905062636190.000010com.upworthy
7421905049628830.000012org.sonatype
743190503822880.000087com.typeform
7441904957428060.000012il.co.google
7451904938427390.000012uk.ac.leeds
746190481162010.000127to.amzn
7471904798627030.000012vn.com.google
748190475782740.000092com.surveymonkey
749190473809220.000032int.wipo
7501904628810570.000028com.gizmodo
751190461448740.000034com.box
7521904557822980.000014com.oregonlive
753190449165470.000054gg.discord
7541904444433560.000010com.theepochtimes
7551904440024800.000013ar.com.google
7561904414429430.000011bg.google
7571904363220610.000016com.squarespace-cdn
7581904340034790.000010io.soup
7591904277825450.000013com.webbyawards
7601904238427440.000012io.fabric
7611904229815880.000019com.speakerdeck
762190416841360.000232info.aboutads
763190406069070.000033com.docker
7641903881418170.000017com.miamiherald
7651903792431910.000011ph.com.google
7661903776224630.000013com.channelnewsasia
7671903755631980.000011uk.co.vogue
7681903755426190.000013edu.fsu
769190358704850.000063com.staticflickr
7701903528424950.000013za.co.google
7711903367826960.000012com.thejakartapost
7721903244212360.000024edu.ucsd
773190322584870.000062com.fc2
7741903203854150.000007com.armorgames
7751903194421550.000015fi.google
7761903123438850.000009com.alamy
7771903086822210.000015id.co.google
7781903046227940.000012com.rd
7791902971229510.000011com.cartodb
7801902958420920.000016com.newrepublic
7811902934834360.000010com.benzinga
782190283646610.000044com.entrepreneur
7831902796053760.000007org.gwtproject
7841902666029880.000011com.sciencealert
7851902653827630.000012org.iaea
7861902640223760.000014com.thenation
7871902369234110.000010si.google
7881902304624000.000014pt.google
7891902012429650.000011au.gov.nla
7901901983835130.000010com.dailykos
791190197564940.000061com.aol
7921901912825190.000013edu.emory
7931901901235730.000010com.inhabitat
7941901895634150.000010uk.ac.soas
795190184026660.000044com.deloitte
7961901823011850.000025com.today
797190168389780.000030com.windowsphone
7981901618636590.000010org.cpj
7991901616421190.000016kr.co.google
8001901590629810.000011se.lu
8011901578027740.000012org.cfr
802190148564290.000068me.fb
8031901367832880.000011com.joins
8041901298042640.000008sa.com.google
8051901287828140.000012com.politifact
806190122929640.000031com.alexa
8071901144241310.000008edu.utm
8081901106827350.000012com.law360
809190105469830.000030com.engadget
8101900866235830.000010hr.google
8111900853821460.000015hu.google
812190068606310.000047fm.last
8131900654024760.000013eu.politico
8141900624840470.000009com.chinatimes
8151900611625210.000013mx.com.google
8161900606031410.000011com.jezebel
8171900594238680.000009com.iconarchive
8181900531834710.000010com.ogilvy
8191900486623990.000014gr.google
8201900408628160.000012com.monday
8211900325227380.000012com.digitaljournal
8221900324831490.000011com.nyt
8231900322033000.000011audio.breaker
8241900264028230.000012uk.co.guim
825190023846250.000047com.cisco
8261900203833910.000010cn.globaltimes
8271900180826480.000012com.instructure
8281900064633210.000011com.crashlytics
8291899972027230.000012au.com.businessinsider
8301899933834300.000010org.grist
8311899828012090.000025com.pastebin
832189981183150.000082ai.shortpixel
8331899807839900.000009org.constitutioncenter
8341899796048420.000007jp.hatenadiary
8351899678037700.000009edu.ttu
8361899607629970.000011uk.ac.york
8371899593616710.000018com.eater
83818995084900.000364com.livestream
8391899503627720.000012com.bepress
8401899475228980.000012org.wri
8411899226220430.000016my.com.thestar
8421899112237750.000009com.minds
8431899059223520.000014mp.j
8441899057037080.000009app.web
8451899006234100.000010org.carnegieendowment
8461898978636450.000010tr.com.aa
847189894187110.000041gov.sec
8481898774638120.000009com.hyperallergic
8491898728234080.000010com.foreignaffairs
8501898664037970.000009au.edu.uts
851189853924700.000064com.fastcompany
8521898503235600.000010org.hypotheses
8531898446838960.000009com.japantoday
8541898275235070.000010edu.wayne
8551898204837130.000009uk.ac.kent
8561898198836970.000009rs.google
8571898053240710.000009org.sourcewatch
858189793668320.000036com.symantec
8591897842425390.000013fr.paris
8601897799629420.000011com.prweek
8611897790217650.000018ch.ipcc
8621897696022170.000015com.kinstacdn
8631897626210460.000028edu.cmu
8641897546220390.000016int.unfccc
8651897506241960.000008eg.com.google
8661897480431800.000011org.nationalgeographic
8671897454826430.000013gov.doi
8681897394034060.000010de.uni-frankfurt
8691897349442430.000008by.google
8701897202250500.000007com.symbaloo
8711897101034170.000010nl.wur
8721896995023280.000014org.unodc
8731896843015990.000019com.routledge
8741896841245090.000008com.ipsos-mori
8751896696236580.000010ae.google
8761896615244820.000008com.etymonline
8771896588849820.000007build.bazel
8781896556633200.000011org.brainpickings
8791896454431430.000011com.scotsman
8801896379642950.000008com.oilprice
8811896338035970.000010uk.ac.westminster
8821896326645450.000008lk.google
8831896257612600.000024fr.blogspot
8841896136034120.000010org.rferl
8851896131031730.000011org.epi
8861895990041150.000008lv.google
8871895981239090.000009au.edu.griffith
8881895942242190.000008kr.ac.snu
8891895728013120.000023com.upwork
8901895707624360.000014com.html5rocks
8911895671454930.000007me.nimbusweb
8921895650229400.000011fr.archives-ouvertes
8931895639842930.000008com.delawareonline
8941895546217920.000017ru.rbc
895189549687450.000039com.gartner
8961895493011270.000026edu.utexas
8971895364225260.000013net.noscript
8981895346627170.000012ae.thenational
8991895333633800.000010com.study
900189530924270.000068com.hp
9011895307436410.000010uk.co.spectator
9021895276238690.000009com.cleantechnica
9031895220828030.000012org.unctad
9041895120042550.000008com.teslamotors
9051895011816140.000019com.billboard
9061894936630740.000011com.theculturetrip
9071894789624540.000013com.multiscreensite
908189477387040.000041com.visualstudio
9091894758839850.000009uk.ac.plymouth
9101894745426600.000012sk.google
9111894731238110.000009net.aljazeera
9121894711024130.000014com.theintercept
9131894655634210.000010uk.ac.exeter
9141894649433320.000010social.mastodon
9151894587628280.000012com.euractiv
9161894586436350.000010com.db
9171894273644470.000008org.mises
9181894231646800.000008ng.com.google
9191894201627950.000012org.panda
9201894162224660.000013uk.gov.justice
9211894143056020.000007net.chinadialogue
9221894092441180.000008cat.uab
9231894074642270.000008com.spokesman
9241894008235230.000010co.com.google
9251893923044730.000008lu.google
9261893899641890.000008pe.com.google
9271893861833660.000010com.nybooks
9281893860643810.000008uk.ac.core
9291893820622280.000015com.termsfeed
9301893819416690.000018com.pcworld
9311893811238460.000009kr.co.yna
9321893800247930.000007com.gust
9331893778838800.000009org.cgiar
9341893730042310.000008pk.com.google
9351893653035750.000010net.inquirer
9361893600830830.000011ru.lenta
9371893400014680.000020com.nokia
9381893367629320.000011tw.com.pchome
9391893349612230.000024com.ycombinator
9401893335029110.000011nl.volkskrant
94118933194780.000411com.oculus
9421893261234550.000010cl.google
9431893186239490.000009org.polymer-project
9441893088826370.000013com.washingtonexaminer
9451893062239450.000009sk.sme
9461893053433890.000010edu.monash
947189300869180.000032com.canva
948189295524540.000066org.opensource
9491892939839770.000009com.rappler
9501892863040000.000009org.plan-international
9511892651845610.000008cr.co.google
9521892641235870.000010lt.google
9531892583238100.000009ca.macleans
954189256468170.000036net.adform
9551892504648730.000007com.blogto
9561892495235080.000010uk.ac.nhm
9571892492832110.000011edu.ua
9581892355428150.000012com.articulate
959189232882490.000098com.sxsw
9601892286639930.000009org.wilsoncenter
9611892267640820.000009edu.lehigh
962189223364170.000070com.skype
9631892154646990.000008com.out
9641892071410850.000027com.redhat
9651892068032660.000011my.com.google
9661891906420310.000016gov.ecfr
9671891890045850.000008org.nsidc
968189187784120.000070net.secureservercdn
9691891811245360.000008kz.google
9701891759032950.000011org.osce
971189175625570.000053org.whatwg
9721891741840960.000009com.wsoctv
9731891738025870.000013uk.org.nationaltrust
9741891722032010.000011uk.gov.london
9751891704819730.000017scot.gov
9761891698238650.000009uk.ac.qub
9771891646038070.000009com.governing
978189164305280.000056com.businesswire
9791891630022530.000015wales.gov
9801891506634220.000010com.afp
9811891498230800.000011uk.ac.qmul
9821891487851540.000007com.ingress
9831891454045960.000008com.webcindario
9841891431634020.000010org.psychiatryonline
9851891323041480.000008org.marxists
9861891309640730.000009me.thinglink
9871891297016600.000018com.css-tricks
9881891285847320.000008ie.nuigalway
9891891251443480.000008com.asiaone
9901891236833540.000010com.kaspersky-labs
9911891211012490.000024com.smashingmagazine
9921891206437870.000009org.nationalinterest
993189118485560.000053com.adweek
9941891143644980.000008ec.com.google
9951891140447220.000008bd.com.google
9961891000648460.000007uy.com.google
9971890999842330.000008com.match
9981890974640210.000009ee.google
9991890968839620.000009com.adn
10001890947443100.000008com.wnd

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

September 2020 crawl archive now available

The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The September crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-40/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-40/segment.paths.gz100
WARC filesCC-MAIN-2020-40/warc.paths.gz7960081.8
WAT filesCC-MAIN-2020-40/wat.paths.gz7960023.14
WET filesCC-MAIN-2020-40/wet.paths.gz7960010.28
Robots.txt filesCC-MAIN-2020-40/robotstxt.paths.gz796000.22
Non-200 responses filesCC-MAIN-2020-40/non200responses.paths.gz796002.36
URL index filesCC-MAIN-2020-40/cc-index.paths.gz3020.27

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-40/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August 2020 crawl archive now available

The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th. It includes page captures of 940 million URLs unknown in any of our prior crawl archives.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-34/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-34/segment.paths.gz100
WARC filesCC-MAIN-2020-34/warc.paths.gz6000048.9
WAT filesCC-MAIN-2020-34/wat.paths.gz6000016.9
WET filesCC-MAIN-2020-34/wet.paths.gz600007.56
Robots.txt filesCC-MAIN-2020-34/robotstxt.paths.gz600000.19
Non-200 responses filesCC-MAIN-2020-34/non200responses.paths.gz600001.94
URL index filesCC-MAIN-2020-34/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

July 2020 crawl archive now available

The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th. It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives.

Bug Fixes and Improvements

The URL index fields "redirect" and "mime" haven’t been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case. This bug has been detected during the crawl and was fixed for 90 out of 100 segments. It also affects the columnar index and the fields "fetch_redirect" resp. "content_mime_type". To a minor extend it may affect the detection of character set and content language as the value of the Content-Type header is used as additional hint for the detection. Additional information about this bug fix is given in the corresponding issue report.

Archive Location and Download

The July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-29/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-29/segment.paths.gz100
WARC filesCC-MAIN-2020-29/warc.paths.gz6000062.64
WAT filesCC-MAIN-2020-29/wat.paths.gz6000022.23
WET filesCC-MAIN-2020-29/wet.paths.gz600009.87
Robots.txt filesCC-MAIN-2020-29/robotstxt.paths.gz600000.21
Non-200 responses filesCC-MAIN-2020-29/non200responses.paths.gz600002.52
URL index filesCC-MAIN-2020-29/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-29/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/May 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The host-level graph now includes hosts visited by the crawler but not linking to any other host. Why is this possible – isn’t any host found via links the crawler is following? Yes, but some links were already detected in a prior crawl, not in one of the 3 crawls used to build the web graphs. More details about the issue are given in cc-pyspark#15. The impact of this fix on the graph size is minimal: the recent crawl now includes 1 million nodes (0.1% of all nodes) which are not connected to any other node.

Host-level graph

The graph consists of 927 million nodes and 3.88 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 857 million dangling nodes (92.5%) and the largest strongly connected component contains 47 million (5.1%) nodes.

You can download the graph and the ranks of all 927 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/May 2020 host-level webgraph

SizeFileDescription
5.67 GBcc-main-2020-feb-mar-may-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
17.26 GBcc-main-2020-feb-mar-may-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
7.40 GBcc-main-2020-feb-mar-may-host.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-host.properties
8.57 GBcc-main-2020-feb-mar-may-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-feb-mar-may-host-t.properties
1 kBcc-main-2020-feb-mar-may-host.statsWebGraph statistics
12.16 GBcc-main-2020-feb-mar-may-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 91 million nodes and 1.96 billion edges. 51% or 46 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/.

Download files of the Common Crawl Feb/Mar/May 2020 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2020-feb-mar-may-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.79 GBcc-main-2020-feb-mar-may-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.23 GBcc-main-2020-feb-mar-may-domain.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-domain.properties
4.16 GBcc-main-2020-feb-mar-may-domain-t.graphtranspose of the graph
2 kBcc-main-2020-feb-mar-may-domain-t.properties
1 kBcc-main-2020-feb-mar-may-domain.statsWebGraph statistics
1.96 GBcc-main-2020-feb-mar-may-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/May 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13266761810.018180com.googleapis
23055277230.011873com.facebook
32956908820.013789com.google
42692046040.007145com.twitter
52688312850.007106org.w
62636044860.006483com.youtube
72471939690.004210com.instagram
82425194280.005125org.gmpg
92384133270.005329com.googletagmanager
1023606890130.002940com.linkedin
1122741292100.003621com.cloudflare
1222732960120.002974org.wordpress
1322661910140.002515com.gravatar
1422577680150.002438com.gstatic
1522378134220.001529com.pinterest
1622196962270.001192org.wikipedia
1722189650190.001864com.wordpress
1822066028160.002404com.bootstrapcdn
1921967760180.001884com.apple
2021751768200.001863com.jquery
2121589606240.001461com.microsoft
2221568908440.000785be.youtu
2321568474430.000806com.blogspot
2421533280310.001104com.vimeo
2521415938460.000761gl.goo
2621399120350.001040com.amazonaws
2721358048530.000665com.amazon
2821331634210.001737com.adobe
2921324666230.001506com.wp
3021209012700.000452com.tumblr
3121184360170.001949com.github
3221150652370.001008com.google-analytics
3321110976300.001152com.baidu
3421096692870.000387com.yahoo
3521081268590.000547ly.bit
3621060360330.001072com.macromedia
3721046916360.001035net.cloudfront
3821036258450.000763com.flickr
3920997926320.001101com.googlesyndication
4020993476260.001277me.wp
4120980462970.000340com.googleusercontent
4220966446560.000624eu.europa
4320960242420.000807net.jsdelivr
4420959910520.000677co.t
4520901872290.001163ru.yandex
4620846092500.000742net.doubleclick
4720843032410.000869com.addthis
4820823518690.000457io.github
4920817952760.000433com.medium
5020810030250.001287com.fontawesome
51208091201390.000189com.forbes
5220796434610.000510org.w3
5320759102550.000640com.paypal
54207572661090.000282com.soundcloud
5520754514900.000368org.creativecommons
5620747472570.000619com.vk
5720711184540.000658org.mozilla
5820710182880.000382com.weebly
5920698442840.000410com.wix
60206753721020.000317com.weibo
6120663930580.000604org.schema
62206502021640.000151com.imgur
63206444521470.000177org.apache
64206422821780.000138uk.co.bbc
65206255601290.000210org.archive
66206103542740.000089com.ibm
67206096141540.000169com.bing
68206023801910.000125net.sourceforge
69205790121300.000207com.nytimes
70205786261500.000174int.who
71205710121830.000131com.cnn
72205616741740.000140net.slideshare
73205476341580.000164gov.cdc
74205425462020.000116com.android
75205272302280.000104com.wsj
76205185481940.000122edu.stanford
77205055462050.000115com.businessinsider
78204950342540.000095com.oracle
7920489434340.001049net.fbcdn
80204888683730.000067com.msn
81204882822610.000093edu.harvard
82204833843100.000080com.go
8320478152990.000335com.shopify
84204714242670.000093com.bbc
85204644342970.000083edu.mit
86204613403300.000076com.myspace
8720458776620.000497com.whatsapp
88204572062890.000085com.appspot
89204544663070.000080com.wired
90204463002920.000085com.reuters
91204420041010.000323com.godaddy
92204355501710.000147com.theguardian
93204177701430.000182gov.nih
94204125361960.000120org.ietf
95204013303880.000065gov.nasa
96203972984230.000061com.theverge
97203947361490.000175com.giphy
98203942763820.000066net.researchgate
99203849302700.000092com.bloomberg
100203777781080.000285com.unpkg
101203763941140.000271com.reddit
102203738563370.000075com.xinhuanet
103203667362150.000108org.gnu
104203635063180.000079com.usatoday
105203526608130.000037org.chromium
106203449963560.000071com.springer
10720343678980.000335de.google
10820342420280.001184com.qq
109203418243450.000073com.example
110203365107440.000041edu.psu
111203245364680.000055edu.cornell
112203243781840.000131com.blogger
11320314024600.000516net.akamaihd
114203042423750.000067org.hbr
115203023107500.000040com.git-scm
116203000149370.000032com.wikia
117202985461370.000191com.spotify
118202960124850.000053edu.yale
119202955161130.000271com.jimdo
120202931405540.000047com.cbsnews
121202919467170.000043com.economist
122202905742140.000109com.washingtonpost
123202885041400.000188jp.co.yahoo
124202864702850.000086com.huffingtonpost
125202845583160.000080org.un
126202818744100.000063fr.free
127202799464730.000054edu.berkeley
128202754462870.000086com.cnbc
129202732802450.000099com.dribbble
130202715845760.000046org.arxiv
131202697161510.000172com.issuu
132202570385450.000047com.mysql
133202562621600.000157com.twimg
134202525321070.000285com.statcounter
135202516823380.000075uk.co.telegraph
136202474783050.000081com.w3schools
137202466825610.000047com.gitlab
138202422108020.000038edu.columbia
139202409785240.000049gov.noaa
140202306661220.000230com.ytimg
141202299001190.000233com.youtube-nocookie
142202276567310.000042org.ieee
143202271263330.000075org.npr
144202255287290.000042io.readthedocs
145202252062860.000086org.acm
146202223143390.000074com.time
1472022043011800.000025org.eclipse
148202203822410.000100org.ampproject
149202186163440.000074com.fc2
150202157301420.000185com.wixsite
151202136927550.000040edu.washington
152202101224210.000061com.force
153202098642760.000089com.prnewswire
154202091305000.000052com.buzzfeed
155202071364340.000060com.nationalgeographic
156202064024030.000063com.nature
157202038262000.000118gle.forms
158202024907990.000038org.sciencemag
159202011444280.000061com.theatlantic
160202001048710.000035com.stackexchange
161201981422800.000088com.sciencedirect
162201854003320.000075com.staticflickr
163201845284950.000052uk.co.independent
164201822562630.000093gov.ca
165201809726870.000043org.worldbank
166201759944350.000060com.mozilla
167201754007340.000041com.marketwatch
1682016809810870.000027com.hatenablog
169201670403640.000069com.nypost
170201640166460.000043org.bitbucket
171201611922190.000107com.ft
172201511164630.000056com.pixabay
173201437963540.000071jp.co.rakuten
174201426527430.000041edu.upenn
175201401262770.000089org.doi
176201393769660.000031jp.livedoor
177201365461980.000120uk.co.google
178201349324070.000063uk.co.dailymail
179201344047240.000042org.pbs
180201339362580.000094net.behance
181201329141920.000124org.wikimedia
182201278609170.000033edu.jhu
183201278284540.000057gov.whitehouse
184201223528560.000035org.weforum
185201221704160.000062com.dailymotion
1862011705414870.000020com.warnerbros
187201118983260.000077org.opensource
1882011079810910.000027cn.com.chinadaily
189201099165480.000047me.about
190201098202320.000103jp.ameblo
191201089405580.000047com.oup
192201034283250.000077com.digg
193200974184550.000056com.entrepreneur
194200951086310.000044com.vice
195200941427490.000040com.qz
1962009269212590.000024com.discovery
197200911544440.000058com.goodreads
198200910524470.000057gg.discord
1992008291011090.000027com.sap
200200821863530.000071com.scribd
201200794121880.000128com.feedburner
202200761464660.000055com.fortune
203200755565800.000045com.gartner
2042007259810120.000029com.500px
205200721364580.000056jp.ne.sakura
206200674001760.000139com.imdb
207200609507320.000042uk.co.blogspot
2082005905417350.000018com.amd
209200582289470.000032edu.princeton
210200566668900.000034org.cambridge
21120056572510.000714com.fb
212200562728480.000036com.evernote
213200544721440.000180com.dropbox
21420053532390.000951com.wixstatic
215200516626170.000044org.unesco
2162005094014610.000020com.fandom
217200481522940.000084com.wiley
218200461347680.000039com.withgoogle
2192003942610150.000029org.altervista
2202003901023370.000014com.wolfram
221200379207980.000038com.slate
2222003148412010.000025org.kernel
2232002816410490.000028edu.purdue
224200252825690.000046page.g
225200213407860.000038com.trello
226200170182300.000103com.disqus
227200127967570.000040org.eff
228200104309510.000031com.merriam-webster
229200046864930.000052gov.usda
230200042409810.000030com.netlify
2312000399421790.000015com.diigo
232200029188070.000038com.vox
233200026901800.000135org.allaboutcookies
2342000222012060.000025com.jetbrains
2351999941814160.000021edu.arizona
236199943845420.000047com.tandfonline
237199930308440.000036com.foxnews
238199921842910.000085com.live
239199911421750.000140com.xing
240199898749090.000033com.politico
241199885703200.000079com.outlook
2421998503611350.000026jp.ne.goo
243199833407540.000040au.net.abc
2441998268019450.000016com.wikidot
245199779347930.000038com.investopedia
2461997757410660.000028edu.uchicago
2471997682010090.000029edu.wisc
248199759221970.000120com.eepurl
2491997256010390.000028com.bostonglobe
250199720967750.000039org.semver
251199695946190.000044com.sagepub
252199691824970.000052gov.fda
253199684423470.000073net.windows
2541996808415680.000019edu.osu
255199653863190.000079com.nbcnews
256199639462440.000099com.myshopify
257199628925850.000045cn.google
258199625306080.000044site.business
259199610668320.000036com.sciencedaily
2601996038010440.000028com.strikingly
2611995636612360.000024edu.unc
2621995626814460.000021edu.virginia
2631995603412040.000025co.elastic
2641995296011940.000025com.nymag
2651995050022060.000015com.renren
266199504907420.000041gov.house
2671995044821630.000015sg.edu.nus
2681994797622850.000014org.wikibooks
2691994728419610.000016com.googlesource
270199405982350.000103com.wpengine
271199401583230.000078com.googlecode
272199392127610.000040gov.senate
273199380085130.000051com.herokuapp
274199377384520.000057org.pewresearch
275199374925670.000046org.iana
2761993695410930.000027com.podbean
277199358189820.000030com.alexa
2781993474216290.000019gd.is
279199338041030.000301com.paypalobjects
280199327408050.000038org.unicef
281199324167180.000043com.newyorker
282199308589690.000031uk.co.thetimes
283199293244040.000063com.patreon
2841992826610600.000028com.lifehacker
285199259403810.000066com.criteo
286199245249970.000030com.huffpost
287199225763030.000081com.squareup
288199225108390.000036ca.cbc
2891992180811450.000026org.wiktionary
290199188441460.000178com.addtoany
291199181742010.000117com.optimizely
2921991805213420.000022edu.msu
2931991598613710.000022com.history
294199133844180.000062com.calendly
2951990586011810.000025com.udemy
296199033648090.000037uk.ac.ox
297199029201720.000145com.amazon-adsystem
29819899332490.000743com.googleadservices
299198969241550.000167com.opera
300198909708870.000034org.fao
3011989083210170.000029com.ecwid
302198908264760.000054com.googleblog
303198871422110.000110com.stackoverflow
3041988619014190.000021uk.ac.lse
305198853123600.000070com.getpocket
3061988445616670.000018org.maven
307198838009150.000033uk.co.guardian
308198833581690.000148org.bbb
3091988108413370.000022com.aljazeera
310198807902550.000095com.aliyuncs
3111987993827230.000013net.pixnet
3121987438431800.000011net.hinet
3131986902811700.000025com.smithsonianmag
3141986883213470.000022edu.ucdavis
315198682588940.000034gov.congress
3161986719013200.000023edu.illinois
3171986516811200.000026com.theglobeandmail
3181986330610360.000029gov.archives
319198624144920.000052it.placehold
32019861934930.000359net.facebook
3211986137616150.000019hk.com.google
3221986092214730.000020ca.sfu
3231985635216760.000018blog.home
3241985529010730.000027com.apnews
325198548929630.000031com.ssrn
3261985368233830.000010com.wizards
3271985110219970.000016com.nabble
328198510327600.000040com.chinaz
3291985041236670.000010cn.edu.sjtu
3301984814014840.000020com.urbandictionary
3311984443611360.000026com.scmp
3321984232614890.000020ms.1drv
3331984179643610.000008tw.com.gamer
3341983858213920.000021com.flipboard
335198381669190.000033co.g
336198375425470.000047com.gofundme
3371983699620970.000015com.france24
3381983563614050.000021jp.geocities
3391983365413700.000022com.ibtimes
340198313625810.000045com.biomedcentral
3411983005611280.000026com.britannica
3421982942021740.000015com.oregonlive
343198270624120.000062com.kickstarter
344198262149620.000031com.adjust
345198241888670.000035gov.fcc
346198240487150.000043uk.co.mirror
347198232665890.000045us.icio
3481982317211290.000026com.mediafire
3491982176814320.000021edu.tamu
350198213105870.000045com.usnews
3511982044213140.000023org.greenpeace
352198202529850.000030edu.academia
3531981948613810.000021com.livescience
3541981597216840.000018gov.cia
3551981456413250.000023com.akamai
356198132669300.000032com.chicagotribune
357198115381560.000167com.npmjs
3581981110014290.000021net.seesaa
359198101203290.000076es.google
3601980971012380.000024com.reverbnation
361198094905500.000047com.quora
3621980831434810.000010com.proboards
3631980626810400.000028com.thehill
364198038403210.000078org.python
3651980147611320.000026org.jstor
3661980101817220.000018ca.mcgill
367197999821670.000149com.zendesk
368197928909990.000030com.thelancet
3691979224610940.000027com.jamanetwork
3701978859419350.000016uk.ac.manchester
371197852145400.000048com.udacity
3721978332813720.000021ca.utoronto
373197830825790.000046com.bigcartel
3741978223024870.000013org.wikiquote
3751978118613570.000022edu.rutgers
376197800288960.000034org.apa
377197797184390.000059com.newsweek
378197785389200.000033com.healthline
3791977798222040.000015com.knowyourmeme
380197756103280.000077com.tinyurl
381197755587260.000042gov.state
382197750922160.000108com.unsplash
3831977370217080.000018ca.ualberta
384197723784060.000063com.githubusercontent
3851977190014710.000020com.asahi
386197712202590.000094org.nodejs
387197694364750.000054com.latimes
3881976925810270.000029com.timeanddate
389197686864320.000060com.slack
390197684107690.000039jp.shinobi
3911976797616740.000018com.buzzfeednews
392197650384150.000062com.elsevier
3931976472213350.000022edu.gatech
3941976429828610.000012com.youdao
395197612568950.000034com.brightcove
3961975973017740.000017com.bankofamerica
3971975953025690.000013edu.byu
3981975876019180.000016com.voanews
3991975758631640.000011com.opendns
4001975681614250.000021com.sky
4011975578023360.000014com.slides
4021975446213730.000021com.dw
4031975445811580.000026com.nikkei
404197525909040.000033com.cbslocal
4051974876622360.000014net.earthlink
406197486783910.000064com.cnet
4071974815016420.000018com.xrea
4081974743013540.000022uk.co.huffingtonpost
409197464241820.000133com.eventbrite
4101974637010710.000027com.nydailynews
4111974409013050.000023me.vk
412197431949180.000033gov.bls
4131974154214580.000020org.ap
414197409363840.000066net.imgix
4151973986024140.000014org.aclweb
4161973975016410.000018com.axios
417197389409870.000030com.wattpad
4181973753017130.000018com.straitstimes
419197374124740.000054com.ted
4201973687412940.000023edu.brookings
421197286349670.000031int.coe
422197275802120.000109com.etsy
4231972711223920.000014com.biography
424197260808650.000035gov.va
425197257102170.000107com.typepad
4261972462819320.000016com.cocolog-nifty
4271972358016080.000019com.reference
428197207405530.000047com.livejournal
4291971740620960.000015ru.kremlin
430197163548150.000037uk.gov.service
431197153782980.000083com.techcrunch
4321971235824620.000013org.wikisource
4331971229615530.000019com.foxbusiness
4341971162012810.000023mil.army
4351971124417610.000017com.itv
436197102607330.000041com.deviantart
4371970595213110.000023de.mpg
438197052888450.000036gov.justice
4391970457419930.000016cn.people
4401970324812620.000024au.com.smh
4411970165617630.000017org.tensorflow
4421970163412230.000024org.ohchr
443197010005680.000046ru.gov
444197001364000.000064com.technorati
4451969959621340.000015jp.co.japantimes
44619697954830.000413com.list-manage
4471969708810680.000028com.thedrum
4481969675415380.000019uk.co.standard
449196954301850.000131com.rawgit
4501969421621200.000015com.oxforddictionaries
4511969300622410.000014com.shutterfly
4521969208231470.000011tw.edu.ntu
4531969156425500.000013com.smashwords
4541968986218620.000016edu.unl
4551968876824020.000014org.fas
456196886462960.000084uk.org.ico
4571968813827100.000013tv.blip
458196860669570.000031com.bandsintown
4591968444835160.000010cn.org.china
4601968296015500.000019uk.co.express
4611967970810820.000027jp.jugem
4621967915836560.000010info.webry
4631967873014030.000021gov.uscourts
4641967794421570.000015au.edu.unimelb
46519675766920.000363com.wsimg
466196748682830.000086ru.rambler
4671967373819210.000016com.washingtontimes
468196717543510.000072com.proofpoint
46919669412740.000441net.jsfiddle
470196683527880.000038org.mediawiki
4711966815828510.000012jp.blog
4721966774014790.000020com.firebaseapp
4731966741816180.000019com.webnode
4741966594021730.000015com.pbworks
4751966574833740.000011com.patheos
4761966568431350.000011uk.co.timesonline
4771966398021710.000015google.ai
478196633542330.000103com.squarespace
4791966218829040.000012fr.rfi
4801966098414540.000020gov.supremecourt
4811965920018890.000016int.unfccc
482196585343310.000076com.office
483196565265770.000046pl.google
484196540989910.000030gov.wa
485196527968040.000038gov.sba
4861965262612670.000023com.cognitoforms
4871965006622070.000015org.csis
488196490083660.000068io.codepen
4891964875023440.000014com.kobo
490196465121100.000281com.mailchimp
4911964342816710.000018edu.wustl
4921964257227340.000013edu.kit
4931964233414800.000020org.hrw
494196422769530.000031edu.umich
4951964185613890.000021com.dictionary
496196415448360.000036com.mapquest
4971964083617470.000017org.worldcat
4981964027636210.000010net.aljazeera
499196401443570.000071com.photobucket
5001963994820460.000015net.cnki
5011963851017050.000018com.secondlife
5021963841624210.000014int.wmo
5031963788810890.000027org.ilo
5041963745011000.000027google.blog
505196366923780.000067com.meetup
506196346349950.000030uk.co.pinterest
5071963377033970.000010com.freehostia
5081963041232560.000011com.doodlekit
509196297469360.000032com.arstechnica
5101962837037300.000009com.colourlovers
5111962835616960.000018ru.ucoz
512196282989520.000031com.thenextweb
5131962445822860.000014org.unep
5141962234222520.000014org.icrc
5151962180814240.000021com.findlaw
5161962113423340.000014com.similarweb
517196206964810.000054com.gmail
5181961930430400.000012io.soup
5191961624614370.000021com.imageshack
5201961595627850.000013com.sputniknews
5211961407830800.000012com.smore
5221961323232460.000011org.iucnredlist
5231961176631170.000011com.kinja
5241961176018830.000016com.csmonitor
525196116041450.000180ru.mail
5261961008813390.000022gov.uscis
527196085544460.000058net.secureservercdn
5281960631430040.000012sh.now
529196057484270.000061tv.twitch
5301960499415800.000019link.app
531196008144400.000059com.statista
5321959916036760.000010jp.hatenablog
5331959555043560.000008com.coroflot
5341959526431770.000011org.jenkins-ci
5351959515817570.000017gov.oregon
5361959313032000.000011li.paper
5371959310638470.000009com.pixar
5381958987830950.000011com.shell
5391958819440350.000009com.scienceblogs
5401958618816250.000019org.amnesty
541195848248920.000034com.thedailybeast
5421958246417670.000017org.pypi
5431958234621490.000015com.foreignpolicy
5441958031028490.000012com.instapaper
5451957967229100.000012org.accessnow
5461957861416020.000019com.surveygizmo
5471957778017330.000018ca.globalnews
5481957620031750.000011de.uni-koeln
549195761982390.000101io.shields
5501957618433770.000011org.lds
5511957590222380.000014org.rand
552195747902070.000114com.salesforce
5531957454434380.000010net.mootools
5541957442823570.000014at.ac.univie
5551957418240500.000009org.marxists
5561957166428600.000012org.panda
5571957119428060.000013com.oprah
5581956857618740.000016com.justia
5591956797034710.000010org.avaaz
5601956785428800.000012com.openai
5611956776435970.000010org.neocities
5621956726037530.000009cn.edu.sdu
563195649607620.000040com.netflix
564195641204980.000052com.oreilly
5651956308644050.000008com.yam
566195622482270.000105uk.co.amazon
567195622048660.000035com.zoho
568195609566290.000044com.zdnet
5691955996612980.000023ly.snip
5701955879017900.000017ch.ipcc
571195586649930.000030uk.parliament
5721955850837870.000009com.nestle
5731955630412540.000024se.google
5741955629229970.000012com.treehugger
5751955518410110.000029net.nocookie
5761955509646440.000008com.x0
5771955336836310.000010org.tvtropes
5781955099211410.000026org.sphinx-doc
5791954999421220.000015ru.mos
5801954882030440.000012es.csic
5811954853029130.000012uk.gov.companieshouse
5821954657610340.000029com.engadget
5831954623011830.000025com.here
5841954549250600.000007com.dbs
5851954543841030.000009br.ufrj
5861954420421590.000015edu.colostate
5871954339827060.000013de.uni-heidelberg
5881954050030590.000012com.pearltrees
5891953926821760.000015net.openid
5901953788026000.000013com.mystrikingly
5911953784438800.000009com.chinatimes
5921953583424000.000014link.page
5931953418223540.000014com.real
5941953343218360.000017org.ncsl
595195322883010.000082com.surveymonkey
596195319303620.000070com.hp
5971953141211930.000025org.js
5981953070021350.000015com.123formbuilder
5991952884224260.000014org.vim
6001952810432050.000011pl.wp
6011952801826020.000013au.com.sbs
602195267801700.000148com.yelp
6031952621624990.000013uk.ac.kcl
6041952434613380.000022org.aarp
6051952369226210.000013th.co.google
6061952315610060.000029uk.gov.legislation
607195230422600.000094com.getbootstrap
6081952285636630.000010com.magcloud
6091952227439900.000009com.zynga
6101952194212680.000023tw.com.google
6111952192228290.000013com.kaggle
612195201309480.000031gov.gpo
613195197429460.000032com.about
6141951971432730.000011org.rsf
6151951874029760.000012org.tigris
6161951822427270.000013uk.ac.leeds
6171951551235350.000010de.dw
6181951543430190.000012org.cfr
6191951457432530.000011de.uni-freiburg
6201951357036400.000010de.uni-konstanz
6211951271438810.000009ua.at
6221951125421170.000015info.worldometers
6231951031446570.000008com.embarcadero
6241950937029990.000012vn.zing
6251950913432290.000011com.bangkokpost
6261950880436150.000010ly.rebrand
6271950854820080.000016gov.ky
6281950842640090.000009org.wilsoncenter
6291950677440590.000009jp.hatenadiary
6301950628443740.000008com.musictoday
6311950538838240.000009org.constitutioncenter
632195051863720.000067com.booking
6331950440225790.000013com.eiseverywhere
6341950380040380.000009com.itsnicethat
6351950377633310.000011il.ac.tau
6361950209623590.000014mx.com.google
6371950080637360.000009com.db
638194989283120.000080com.ebay
6391949858835780.000010jp.hateblo
6401949816633480.000011org.democracynow
6411949729639750.000009edu.odu
6421949681228150.000013dk.au
6431949662642200.000008com.etymonline
6441949618428850.000012uk.gov.metoffice
645194957563610.000070com.skype
6461949556635700.000010com.hsbc
6471949484422280.000015com.bankrate
6481949410422400.000014gov.wi
6491949335218150.000017fi.google
6501949330644260.000008com.x10host
6511949213632240.000011org.royalsociety
652194910968170.000037com.pexels
653194903585320.000048com.mashable
6541949028246140.000008com.epochtimes
6551949001811740.000025edu.ucla
6561948965632260.000011cc.reurl
6571948941434300.000010com.dailykos
6581948936037420.000009uk.ac.uea
6591948805037050.000010ca.shaw
6601948610419680.000016uk.gov.tfl
6611948598834340.000010uk.ac.nhm
6621948503230600.000012com.ipage
6631948475424980.000013com.prweek
6641948459818190.000017gov.usembassy
6651948396648610.000007am.do
6661948363630860.000011com.viki
6671948351832520.000011se.liu
6681948271830660.000012com.coca-colacompany
6691948258042320.000008br.ufrgs
6701948249836390.000010de.uni-kiel
6711948134014530.000020com.speakerdeck
6721948071830770.000012net.openreview
6731948066022080.000015de.auswaertiges-amt
674194802482080.000113com.hubspot
6751947976220260.000016com.lexisnexis
6761947870021060.000015net.ucoz
6771947755234940.000010com.iconarchive
678194775328190.000037com.steampowered
679194772867560.000040com.xiti
6801947713224860.000013com.post-gazette
6811947689833690.000011com.eklablog
6821947663229370.000012uk.co.bbci
6831947637819110.000016hu.google
6841947616043990.000008com.jacobinmag
6851947597433230.000011uk.ac.sussex
6861947436830680.000012uk.ac.qmul
6871947421239300.000009nf.co
6881947301441140.000009com.collinsdictionary
6891947289652150.000007com.evaair
6901947284625720.000013com.marketwire
6911947258031380.000011au.com.telstra
6921947211439160.000009it.unitn
693194716468980.000034com.visualstudio
6941947133038070.000009in.ernet
6951947099429060.000012nl.rug
6961946870852970.000007org.arkive
697194682522520.000096org.drupal
6981946705034600.000010ca.dal
6991946704636930.000010com.canada
7001946564214510.000021com.tinypic
7011946530431360.000011org.wri
7021946503436980.000010com.la-croix
7031946410845570.000008com.mitsubishielectric
7041946382847480.000008com.gamejolt
7051946297627890.000013gr.google
7061946288248820.000007cz.webgarden
7071946240430790.000012my.com.thestar
708194618302690.000092net.php
7091946164043290.000008au.gov.fairwork
7101946077022790.000014co.pcdn
7111946017639430.000009uk.ac.essex
712194599841210.000231org.networkadvertising
7131945968433960.000010org.rferl
7141945906842110.000008com.sc
7151945902032920.000011com.blogfa
7161945879433820.000010ca.yelp
7171945758041020.000009edu.utm
7181945724856940.000007com.anghami
7191945653252100.000007su.clan
7201945614440950.000009it.justpaste
721194560064140.000062com.sxsw
7221945591432580.000011com.waterstones
7231945460239600.000009com.jigsy
724194545168380.000036com.intel
7251945439440320.000009ee.ut
726194532429160.000033com.docker
727194529887380.000041com.samsung
7281945180234220.000010es.ucm
7291945071825030.000013com.washingtonexaminer
7301945034239510.000009tl.page
7311945020622090.000015org.wbur
7321944903641120.000009site.negocio
7331944892227730.000013com.yell
7341944851639880.000009com.fatcow
7351944826632820.000011pl.poznan
736194481981350.000194com.youku
7371944793028780.000012ae.thenational
7381944776647050.000008id.co.kaskus
7391944766834070.000010com.afp
7401944760253360.000007net.manilatimes
741194467344190.000062com.caniuse
7421944616814700.000020com.pastebin
7431944591033870.000010uk.org.rspb
744194457367650.000039com.moz
7451944437640270.000009lv.draugiem
7461944160425080.000013gov.dni
7471944087425930.000013ro.google
7481944014429460.000012com.broadwayworld
7491943957437500.000009ru.msu
7501943937437660.000009pl.cba
7511943933241370.000009org.rfa
7521943928055620.000007org.bukkit
7531943908620130.000016scot.gov
754194388681330.000200com.constantcontact
7551943882656380.000007org.adbusters
7561943809445170.000008google.design
7571943765441540.000008com.macobserver
7581943708816490.000018fr.pagesjaunes
7591943702025020.000013com.thenation
7601943677639730.000009com.bbcamerica
7611943455648570.000007com.orgfree
7621943381029780.000012com.channelnewsasia
763194325067350.000041gov.sec
7641943250240080.000009com.teamspeak
7651943243028000.000013org.gnupg
7661943226037800.000009com.the-scientist
7671943225230150.000012com.laweekly
7681943144629210.000012au.edu.sydney
7691943008435770.000010uk.co.yougov
7701943000031400.000011vn.com.google
7711942994244170.000008com.50webs
7721942900431240.000011org.repec
7731942893832150.000011org.ourworldindata
7741942789035060.000010com.tradingeconomics
7751942735231020.000011tw.com.pchome
7761942658233320.000011com.monday
7771942655635560.000010org.project-syndicate
7781942555223310.000014com.amebaownd
7791942489015960.000019org.whatbrowser
7801942475019560.000016org.americanbar
7811942468037390.000009ie.thejournal
782194241521040.000298com.stripe
7831942414040140.000009com.hatenadiary
7841942406029330.000012org.thinkprogress
7851942371230730.000012uk.gov.london
7861942305439270.000009com.thesaurus
7871942300634750.000010net.webself
7881942296434320.000010io.pantheon
7891942171234200.000010uk.ac.exeter
7901942150843430.000008com.appledaily
7911942111835280.000010com.bravesites
7921942081651780.000007com.bambuser
7931942059233790.000011com.foreignaffairs
7941941937824320.000013com.instructables
7951941638821850.000015vn.vietnamnet
7961941473639940.000009com.webcindario
7971941432828230.000013org.ewg
7981941393445340.000008ws.nimb
7991941377828330.000013org.fullfact
800194133522560.000095us.zoom
8011941255636850.000010com.encyclopedia
8021941247438970.000009de.uni-erlangen
8031941082253410.000007net.boards
804194095983410.000074com.histats
8051940953442010.000008is.pse
806194094367480.000040fm.last
8071940780836610.000010com.mongabay
8081940704032200.000011me.site123
8091940633834360.000010com.seetickets
8101940555058380.000007com.gamigo
8111940440016660.000018com.materialdesignicons
8121940410851400.000007bd.com.google
813194032427900.000038com.venturebeat
8141940121846010.000008uk.org.phrases
8151940078032130.000011com.instructure
8161940029828170.000013gov.arkansas
81719399890720.000444com.livestream
8181939955440810.000009cat.uab
8191939948635460.000010org.lacity
8201939937236120.000010com.heraldscotland
8211939837014990.000020com.teachable
8221939667228950.000012com.foodandwine
8231939575212330.000024com.createjs
8241939427422660.000014com.ajc
8251939417239500.000009com.rappler
8261939403023550.000014net.noscript
8271939398241400.000009jp.doorblog
8281939288228730.000012com.timeshighereducation
829193922382750.000089com.bandcamp
8301938933239690.000009jp.ne.hi-ho
8311938809436290.000010net.inquirer
832193878825520.000047com.cisco
8331938731840760.000009pl.lublin
8341938637016570.000018com.pcworld
835193834042660.000093com.typeform
836193828862030.000116com.naver
8371938269837230.000010gov.bts
8381938219218160.000017jp.makeshop
8391938210244620.000008com.tor
8401938207245130.000008com.weightwatchers
8411938134614380.000021org.khanacademy
842193812749540.000031com.thinkwithgoogle
8431938102033850.000010uk.ac.jisc
8441938023840880.000009ly.genial
8451937998640070.000009com.themoscowtimes
8461937850032720.000011com.nyt
8471937843437600.000009com.springernature
8481937835633900.000010int.cbd
8491937785460450.000006es.xurl
8501937689817560.000017com.netsolhost
8511937659838520.000009au.edu.griffith
8521937605447400.000008co.edu.unal
8531937604040740.000009kr.co.koreatimes
854193745887270.000042com.deloitte
8551937430049860.000007org.edc
8561937394041490.000008vn.tienphong
8571937347635150.000010com.thediplomat
8581937293240990.000009uk.ac.lancs
8591937279850060.000007com.inoreader
8601937274649220.000007com.ueuo
8611937259415850.000019tv.ustream
8621937257632340.000011com.tapatalk
8631937235634160.000010nl.wur
8641937210648480.000007net.hypermart
8651937163622930.000014org.kff
866193693563980.000064com.pubmatic
8671936898236250.000010org.grist
8681936848030880.000011tw.gov.cdc
8691936828833890.000010com.gothamist
8701936813011060.000027com.gizmodo
8711936811641010.000009com.globalpost
872193676768140.000037gov.nist
8731936753645630.000008org.globalsecurity
8741936645445470.000008build.bazel
8751936638437820.000009us.ms.state
8761936587842560.000008gr.ntua
8771936577644440.000008se.thelocal
8781936537229630.000012com.politifact
8791936512813170.000023com.ensighten
8801936358850970.000007ru.my1
8811936268034680.000010com.rabbitmq
8821935969841380.000009com.elasticbeanstalk
8831935957413640.000022com.billboard
8841935912247660.000008cc.dict
8851935877456870.000007fi.mbnet
886193573908790.000035com.aliexpress
887193569182100.000111to.amzn
8881935566842750.000008edu.ohio
8891935554634520.000010com.thejakartapost
8901935535032770.000011vn.com.dantri
8911935508052850.000007com.galvanize
8921935488034840.000010jp.go.ndl
8931935479047100.000008com.kiwibox
8941935451421400.000015org.linuxfoundation
8951935450048010.000007ru.nnov
8961935316642880.000008gr.auth
8971935297022570.000014net.vnexpress
8981935177029000.000012com.crashlytics
8991935159410450.000028com.dropboxusercontent
9001935082834390.000010com.scotusblog
9011935071240900.000009org.carnegieendowment
902193502783950.000064com.atlassian
9031934972634650.000010com.study
904193487243500.000072com.mapbox
9051934853210460.000028com.redhat
9061934788617990.000017com.bravenet
9071934746042840.000008uk.org.npg
9081934715244630.000008com.btplc
9091934714852890.000007ru.drom
9101934654224300.000013com.vimeopro
9111934590044190.000008edu.marquette
912193456444260.000061com.adweek
913193451449140.000033com.shutterstock
9141934509010160.000029com.ubuntu
9151934196057120.000007in.ac.nptel
9161934148812270.000024com.msdn
9171934071447070.000008com.vocabulary
9181934068039290.000009edu.uaf
9191933965839190.000009com.atavist
9201933945632010.000011com.healthgrades
9211933909225460.000013com.kinstacdn
9221933838423450.000014com.gazhall
9231933793853980.000007com.asmallorange
9241933780037970.000009com.generalmills
9251933617645850.000008vn.vtc
9261933590815190.000020cn.gov.mofcom
927193337787970.000038com.box
9281933360639660.000009si.uni-lj
9291933332241700.000008az.president
9301933319417880.000017org.reactjs
9311933241236050.000010com.postaffiliatepro
9321933192251920.000007edu.uah
9331933128035990.000010org.openedition
9341933069648380.000007com.kapook
9351933038241530.000008org.caringbridge
936193303744830.000053com.aol
9371932961423030.000014org.nfpa
9381932953859560.000006com.glosbe
9391932919441240.000009com.mcall
9401932762242890.000008ru.tmweb
9411932687641260.000009uk.co.liverpoolecho
9421932642242440.000008com.atwebpages
9431932598010670.000028com.freepik
9441932479040850.000009org.specialolympics
9451932386848450.000007net.freeforums
9461932367647440.000008uk.ac.westminster
9471932353240920.000009com.tok2
9481932346010250.000029com.elpais
9491932315049460.000007tw.com.sina
9501932250832960.000011com.wowza
951193223063170.000079com.webs
9521932202446970.000008com.warriorplus
9531932191834140.000010com.cityam
9541932181244820.000008org.fee
9551932152048540.000007tw.edu.ntnu
9561932129649620.000007com.sparknotes
9571932020245160.000008com.newspapers
9581931963421920.000015com.tutsplus
9591931960058680.000007com.ananova
9601931927438180.000009org.opensecrets
961193191346330.000044gov.uspto
9621931872256800.000007su.moy
9631931836610130.000029com.uk
9641931826649360.000007ru.pr-cy
9651931805838270.000009cz.centrum
9661931778041580.000008edu.niu
9671931532016650.000018org.webkit
9681931501446920.000008pl.edu.amu
9691931408451860.000007com.artfire
9701931389438000.000009org.ascd
9711931210638010.000009edu.scu
9721931174243070.000008com.taipeitimes
9731931156843510.000008edu.whoi
9741931085459490.000006com.voatiengviet
9751931074831000.000011com.broadcastingcable
9761931072046550.000008hk.rthk
9771931024657030.000007com.enotes
978193099104880.000053com.indiatimes
979193096608600.000035com.playstation
9801930904048660.000007com.brothersoft
9811930894827080.000013uk.gov.defra
982193076062310.000103org.whatwg
9831930717844510.000008com.batchgeo
984193071187510.000040com.psychologytoday
9851930636842630.000008uk.co.lrb
9861930635050340.000007ca.pe.gov
9871930588441590.000008com.ecowatch
9881930382041950.000008com.williamhill
9891930354857670.000007pt.ipp
9901930297248430.000007uk.org.38degrees
9911930162413030.000023com.technologyreview
9921930146440910.000009org.spie
993193010689590.000031com.libsyn
9941930057247950.000007com.storeboard
9951930054832600.000011de.bmel
9961929944847490.000008net.onlinewebshop
9971929927438720.000009ru.1gb
998192986542790.000088com.automattic
9991929850238700.000009com.piie
10001929744053060.000007com.allthatsinteresting

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!