The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th. It includes page captures of 850 million URLs not contained in any crawl archive before.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
The crawl archive for November 2019 is now available! It contains 2.55 billion web pages or 250 TiB of uncompressed content, crawled between November 11th and 23rd with a short operational break on Nov 16th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.
What’s new?
We’ve added two new fields to the URL indexes (CDX and columnar):
the redirect target location is stored in the CDX JSON field “redirect” resp. the column “fetch_redirect”. The value is extracted from HTTP header field “Location” if the HTTP status code indicates a HTTP redirect. A relative URL path is converted to an absolute URL using the page URL as base URL. The key is absent (resp. the field value is null) in case the “Location” value is missing, not a valid URL or not a valid relative URL path.
truncation of the WARC record payload is indicated by the key “truncated” resp. the column “content_truncated”. The reason for the truncation is given only for truncated records following the WARC header field “WARC-Truncated”.
Additional details and examples can be found in the corresponding PR #15.
We’ve fixed a bug affecting the capture time (WARC-Date) in the the robots.txt subset which has been extracted from the HTTP “Date” field of the HTTP header and appeared to be occasionally wrong. Please see issue #14 for further details.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.
What’s new?
The following improvements have been made for this webgraph release:
the graphs now also included edges stemming from HTTP 303 “See Other” redirects (in addition to other HTTP redirect status codes)
the Common Crawl robots.txt WARC files are used to get additional host-level redirects including hosts which exclude the entire content in their robots.txt
links from robots.txt files to sitemaps are now extracted directly from the robots.txt WARC files, see the Feb/Mar/Apr 2018 web graph announcement for more details about this type of host-level links
Host-level graph
The graph consists of 820 million nodes and 4.55 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 752 million dangling nodes (92%) and the largest strongly connected component contains 50 million (6%) nodes.
You can download the graph and the ranks of all 820 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/. Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/ as prefix to access the files from everywhere.
Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.
Domain-level graph
The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.
The domain-level graph has 92.7 million nodes and 2.4 billion edges. 52% or 48 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 40% of the nodes.
All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/ resp. https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/.
Download files of the Common Crawl Aug/Sep/Oct 2019 domain-level webgraph
Top 1000 domains ranked by harmonic centrality (Aug/Sept/Oct 2019)
harmonic
centrality
rank
hc value
page rank
page rank
value
reversed hostname
1
32193222
1
0.020989
com.googleapis
2
29939490
3
0.012691
com.facebook
3
29202762
2
0.012925
com.google
4
26823584
4
0.007369
com.twitter
5
26304804
5
0.006660
org.w
6
26023106
6
0.006435
com.youtube
7
24330930
9
0.003914
com.instagram
8
23954698
7
0.004993
org.gmpg
9
23537216
8
0.004863
com.googletagmanager
10
23456362
13
0.002913
com.linkedin
11
22601398
12
0.003086
org.wordpress
12
22511672
10
0.003602
com.cloudflare
13
22484028
22
0.001698
com.gravatar
14
22366260
23
0.001509
com.pinterest
15
22337722
19
0.002143
com.wordpress
16
22168804
14
0.002422
com.bootstrapcdn
17
22152656
32
0.001134
org.wikipedia
18
21939876
20
0.001777
com.apple
19
21666694
42
0.000842
com.blogspot
20
21595338
21
0.001736
com.jquery
21
21576018
49
0.000713
be.youtu
22
21574068
35
0.001064
com.vimeo
23
21525328
30
0.001154
com.microsoft
24
21517514
18
0.002164
com.gstatic
25
21444512
17
0.002176
com.adobe
26
21427572
39
0.000964
com.amazonaws
27
21426674
50
0.000695
com.wp
28
21393434
51
0.000681
com.amazon
29
21314258
65
0.000516
com.tumblr
30
21291646
46
0.000767
gl.goo
31
21256658
25
0.001309
com.macromedia
32
21253410
29
0.001173
com.baidu
33
21136800
67
0.000501
ly.bit
34
21087412
27
0.001200
com.flickr
35
21083642
24
0.001391
com.github
36
21068510
89
0.000381
com.yahoo
37
21063868
41
0.000928
com.google-analytics
38
21063204
31
0.001139
com.googlesyndication
39
21053852
57
0.000608
eu.europa
40
21051522
61
0.000541
org.mozilla
41
20997812
106
0.000300
com.reddit
42
20934970
37
0.001002
net.cloudfront
43
20930738
28
0.001184
ru.yandex
44
20907836
38
0.000964
com.addthis
45
20875048
48
0.000734
co.t
46
20863874
47
0.000744
net.doubleclick
47
20860222
70
0.000482
org.w3
48
20822794
98
0.000329
com.googleusercontent
49
20819594
43
0.000814
com.squarespace
50
20815248
75
0.000462
com.medium
51
20812946
91
0.000376
org.creativecommons
52
20811076
175
0.000140
org.wikimedia
53
20788398
86
0.000417
com.weebly
54
20786242
63
0.000534
me.wp
55
20764144
129
0.000221
com.nytimes
56
20754576
88
0.000400
io.github
57
20744406
55
0.000625
com.paypal
58
20731050
165
0.000148
uk.co.bbc
59
20729776
58
0.000557
net.jsdelivr
60
20723608
108
0.000297
com.soundcloud
61
20720614
172
0.000141
com.imgur
62
20696474
130
0.000210
com.dropbox
63
20676226
137
0.000181
com.forbes
64
20641078
173
0.000141
net.slideshare
65
20640886
54
0.000634
org.schema
66
20637016
153
0.000163
com.theguardian
67
20619410
187
0.000136
com.cnn
68
20614482
204
0.000118
com.businessinsider
69
20589376
217
0.000109
com.wsj
70
20575384
281
0.000086
edu.harvard
71
20572806
167
0.000147
com.bing
72
20571044
241
0.000098
com.techcrunch
73
20567292
290
0.000084
edu.mit
74
20557144
285
0.000084
com.reuters
75
20554074
375
0.000067
com.msn
76
20549852
329
0.000075
com.cnet
77
20542104
140
0.000178
org.archive
78
20538104
250
0.000094
com.bloomberg
79
20522694
33
0.001120
com.fontawesome
80
20521374
141
0.000175
gov.nih
81
20513408
93
0.000355
com.shopify
82
20513140
271
0.000089
com.myspace
83
20507048
207
0.000116
edu.stanford
84
20496152
53
0.000647
com.wix
85
20489102
200
0.000120
com.stackoverflow
86
20487938
434
0.000057
com.googleblog
87
20484632
154
0.000163
org.apache
88
20478194
229
0.000102
com.oracle
89
20475392
214
0.000110
com.washingtonpost
90
20472386
260
0.000091
com.android
91
20469298
267
0.000090
com.bbc
92
20466440
194
0.000123
org.ietf
93
20438542
310
0.000079
com.time
94
20436352
298
0.000081
uk.co.telegraph
95
20417976
369
0.000067
com.ted
96
20415662
372
0.000067
gov.nasa
97
20412316
368
0.000067
com.githubusercontent
98
20409888
185
0.000136
com.npmjs
99
20401604
394
0.000063
com.quora
100
20396146
601
0.000042
com.thenextweb
101
20395674
161
0.000156
com.giphy
102
20388778
726
0.000037
com.wikia
103
20380728
343
0.000072
uk.co.dailymail
104
20379652
294
0.000082
com.usatoday
105
20378396
371
0.000067
com.latimes
106
20370212
713
0.000037
org.chromium
107
20369700
306
0.000079
org.un
108
20368148
144
0.000174
com.wixsite
109
20366610
493
0.000050
com.economist
110
20361312
26
0.001226
com.qq
111
20343426
268
0.000090
com.appspot
112
20339266
480
0.000052
com.pixabay
113
20337398
491
0.000050
com.zdnet
114
20328308
315
0.000079
com.example
115
20325422
358
0.000070
com.livejournal
116
20322334
380
0.000066
com.mashable
117
20308200
302
0.000080
com.cnbc
118
20308066
253
0.000093
org.ampproject
119
20306984
442
0.000056
com.nationalgeographic
120
20293426
505
0.000049
com.venturebeat
121
20292380
404
0.000062
com.dailymotion
122
20285502
139
0.000178
com.twimg
123
20284164
476
0.000052
org.bitbucket
124
20282368
547
0.000046
com.pexels
125
20280714
327
0.000075
com.springer
126
20279992
218
0.000108
com.huffingtonpost
127
20279190
94
0.000355
com.whatsapp
128
20277928
459
0.000054
com.cisco
129
20268416
146
0.000170
com.blogger
130
20267684
123
0.000234
com.ytimg
131
20264730
413
0.000061
com.fortune
132
20263014
641
0.000040
uk.ac.ox
133
20262258
231
0.000100
com.getbootstrap
134
20261648
847
0.000035
org.cambridge
135
20261268
629
0.000040
org.weforum
136
20250854
197
0.000123
com.typepad
137
20250698
279
0.000086
com.sciencedirect
138
20250162
512
0.000048
com.about
139
20247192
286
0.000084
com.wired
140
20240130
317
0.000078
com.skype
141
20235202
558
0.000045
org.worldbank
142
20230192
134
0.000186
com.issuu
143
20225004
504
0.000049
com.mysql
144
20220996
650
0.000039
org.sciencemag
145
20220972
531
0.000047
org.arxiv
146
20218296
624
0.000041
uk.co.guardian
147
20216194
407
0.000062
com.nature
148
20214012
127
0.000226
com.unpkg
149
20213638
143
0.000175
com.spotify
150
20195500
824
0.000036
com.playstation
151
20195352
177
0.000139
uk.co.google
152
20195260
440
0.000057
gov.noaa
153
20193574
323
0.000077
com.staticflickr
154
20193512
366
0.000068
com.gmail
155
20191934
1037
0.000028
org.eclipse
156
20191832
395
0.000063
net.researchgate
157
20185934
342
0.000072
com.fc2
158
20179194
603
0.000042
org.ieee
159
20177140
132
0.000201
com.zendesk
160
20177108
383
0.000065
com.theatlantic
161
20173850
590
0.000043
com.git-scm
162
20173722
182
0.000136
me.t
163
20169446
282
0.000085
com.googlecode
164
20167964
212
0.000113
net.behance
165
20166960
364
0.000068
com.w3schools
166
20165408
657
0.000039
com.stackexchange
167
20147566
128
0.000222
com.youtube-nocookie
168
20144266
430
0.000058
com.buzzfeed
169
20143168
573
0.000043
br.com.uol
170
20141222
828
0.000036
ca.blogspot
171
20138528
592
0.000042
com.evernote
172
20137536
854
0.000034
com.scientificamerican
173
20123000
227
0.000102
com.dribbble
174
20122966
495
0.000049
com.vice
175
20119812
180
0.000137
com.feedburner
176
20118786
574
0.000043
net.azurewebsites
177
20113370
536
0.000046
com.alexa
178
20110780
418
0.000059
com.outlook
179
20103382
424
0.000059
com.gitlab
180
20092588
422
0.000059
me.about
181
20092232
409
0.000061
com.goodreads
182
20091842
1102
0.000026
com.nvidia
183
20082450
419
0.000059
com.mozilla
184
20078524
447
0.000056
com.entrepreneur
185
20073740
236
0.000099
com.ft
186
20071534
452
0.000055
com.wikihow
187
20066124
245
0.000096
com.disqus
188
20064942
1092
0.000026
com.jetbrains
189
20063756
1327
0.000023
org.phys
190
20062066
602
0.000042
org.greenpeace
191
20061474
386
0.000065
org.hbr
192
20059468
178
0.000139
com.salesforce
193
20058532
537
0.000046
com.adage
194
20056012
300
0.000080
org.doi
195
20055914
1106
0.000026
org.ap
196
20054068
860
0.000034
com.500px
197
20051824
488
0.000051
gov.loc
198
20051342
957
0.000030
com.sap
199
20050500
626
0.000041
com.marketwatch
200
20049824
1265
0.000024
com.siemens
201
20049584
1173
0.000025
ca.utoronto
202
20049300
428
0.000058
uk.co.independent
203
20048034
222
0.000104
com.hubspot
204
20045788
593
0.000042
com.slate
205
20042018
349
0.000071
gg.discord
206
20024956
1435
0.000021
com.hackernoon
207
20022096
487
0.000051
uk.co.blogspot
208
20012130
1451
0.000021
org.tensorflow
209
20007682
401
0.000062
com.indiatimes
210
20007486
1035
0.000028
org.kernel
211
20001698
530
0.000047
com.trello
212
19999034
666
0.000038
com.searchengineland
213
19997084
1009
0.000029
com.unity3d
214
19996940
473
0.000052
com.computerworld
215
19996232
549
0.000045
com.withgoogle
216
19993078
1369
0.000022
edu.osu
217
19991880
949
0.000030
edu.si
218
19990236
612
0.000041
au.net.abc
219
19988088
1428
0.000021
com.lego
220
19987532
287
0.000084
com.nbcnews
221
19977482
1356
0.000022
com.angelfire
222
19976080
499
0.000049
com.moz
223
19975358
199
0.000122
net.sourceforge
224
19969236
667
0.000038
co.ibb
225
19968114
1618
0.000019
org.edx
226
19967072
515
0.000048
com.box
227
19961458
986
0.000029
com.huffpost
228
19961370
598
0.000042
gov.state
229
19956418
1563
0.000019
blog.home
230
19955608
1678
0.000018
com.oregonlive
231
19954284
631
0.000040
com.pinimg
232
19953180
863
0.000034
gov.usgs
233
19949892
2048
0.000016
com.sputniknews
234
19948950
1047
0.000027
co.elastic
235
19947460
1196
0.000025
edu.rutgers
236
19946614
211
0.000115
com.optimizely
237
19945418
1409
0.000021
org.maven
238
19942668
1373
0.000022
net.seesaa
239
19939512
237
0.000099
com.aliyuncs
240
19939300
291
0.000083
com.tinyurl
241
19939182
188
0.000134
com.eepurl
242
19938152
224
0.000103
com.wpengine
243
19936538
2235
0.000014
com.slides
244
19935586
659
0.000039
com.sciencedaily
245
19933262
136
0.000183
com.addtoany
246
19933088
946
0.000031
com.storify
247
19932194
142
0.000175
com.yimg
248
19927032
354
0.000070
com.getpocket
249
19925642
715
0.000037
com.vox
250
19922530
60
0.000546
com.vk
251
19920994
171
0.000142
org.allaboutcookies
252
19919990
1165
0.000025
com.vogue
253
19918364
335
0.000074
com.wufoo
254
19914676
1282
0.000023
ms.1drv
255
19906484
1481
0.000020
io.itch
256
19906312
834
0.000035
com.techtarget
257
19905162
600
0.000042
org.change
258
19901530
597
0.000042
com.uk
259
19901258
421
0.000059
com.squareup
260
19897576
1408
0.000021
com.itv
261
19896802
954
0.000030
com.thehill
262
19896772
1291
0.000023
com.scmp
263
19894514
1777
0.000017
com.diigo
264
19893192
316
0.000079
es.google
265
19890244
651
0.000039
com.lifehacker
266
19888786
671
0.000038
gov.fcc
267
19886980
739
0.000037
com.chicagotribune
268
19886180
2309
0.000014
com.pearltrees
269
19885516
1554
0.000019
org.unep
270
19881960
313
0.000079
net.windows
271
19881842
248
0.000094
ru.rambler
272
19880642
506
0.000049
us.icio
273
19877580
92
0.000358
com.weibo
274
19876556
109
0.000290
com.paypalobjects
275
19874826
891
0.000033
com.strikingly
276
19873598
1178
0.000025
com.netlify
277
19867654
456
0.000055
gov.epa
278
19866350
292
0.000083
com.criteo
279
19864080
714
0.000037
org.pewresearch
280
19861136
533
0.000047
org.plos
281
19860954
1225
0.000024
com.newscientist
282
19860836
849
0.000035
uk.co.mirror
283
19860700
1010
0.000029
com.mediafire
284
19860298
1072
0.000027
com.sky
285
19859946
928
0.000031
com.buffer
286
19858910
1228
0.000024
com.aljazeera
287
19858168
1339
0.000022
it.scoop
288
19858040
209
0.000116
org.iana
289
19857260
2070
0.000016
com.coca-colacompany
290
19856912
683
0.000038
com.flipboard
291
19853900
1801
0.000017
jp.ac.u-tokyo
292
19853116
1018
0.000028
uk.co.metro
293
19851054
309
0.000079
com.ibm
294
19846968
322
0.000077
com.go
295
19846838
1552
0.000019
uk.bl
296
19841556
1264
0.000024
com.nikkei
297
19840090
52
0.000667
com.fb
298
19839844
2506
0.000013
it.unimi
299
19836858
1595
0.000019
com.googlesource
300
19834504
474
0.000052
com.udacity
301
19834024
835
0.000035
uk.co.thetimes
302
19832262
168
0.000144
com.imdb
303
19831660
843
0.000035
gov.congress
304
19828142
668
0.000038
org.fao
305
19826656
1191
0.000025
org.acs
306
19825238
1728
0.000018
com.toptal
307
19824736
1065
0.000027
edu.duke
308
19823982
621
0.000041
site.business
309
19820920
1133
0.000026
com.trendmicro
310
19817822
955
0.000030
com.theconversation
311
19814258
983
0.000029
co.g
312
19813034
851
0.000034
com.bmj
313
19812202
170
0.000143
com.amazon-adsystem
314
19808398
1045
0.000027
com.searchenginewatch
315
19806128
1376
0.000022
edu.gatech
316
19803474
2207
0.000015
com.viki
317
19803388
1135
0.000026
edu.brookings
318
19803178
971
0.000030
com.reverbnation
319
19798960
1069
0.000027
au.com.smh
320
19797938
44
0.000797
com.googleadservices
321
19796164
475
0.000052
org.freecodecamp
322
19792806
658
0.000039
br.com.google
323
19791896
1766
0.000017
jp.co.japantimes
324
19791234
400
0.000063
me.telegram
325
19790208
1332
0.000022
com.msnbc
326
19789672
1915
0.000016
org.wikibooks
327
19789356
1296
0.000023
com.dw
328
19787622
1366
0.000022
com.hostgator
329
19784154
477
0.000052
com.theverge
330
19780916
1574
0.000019
com.bankofamerica
331
19776986
994
0.000029
com.yoast
332
19775742
997
0.000029
com.socialmediaexaminer
333
19774146
841
0.000035
org.apa
334
19772748
426
0.000058
com.elsevier
335
19771404
458
0.000055
com.bigcartel
336
19770190
2240
0.000014
com.kinja
337
19770024
1701
0.000018
com.mediaplex
338
19769080
1058
0.000027
uk.co.huffingtonpost
339
19766820
1682
0.000018
org.bitcoin
340
19765668
1430
0.000021
com.grammarly
341
19765220
2071
0.000016
com.mathworks
342
19764662
1253
0.000024
com.livescience
343
19764202
249
0.000094
com.live
344
19763516
2265
0.000014
org.biorxiv
345
19762024
1794
0.000017
com.makeuseof
346
19760700
942
0.000031
com.econsultancy
347
19759296
518
0.000047
com.bigcommerce
348
19759088
953
0.000030
com.searchenginejournal
349
19757028
62
0.000537
net.akamaihd
350
19755866
1764
0.000017
com.colourlovers
351
19751232
314
0.000079
com.rackcdn
352
19749162
1834
0.000017
com.sas
353
19746962
223
0.000104
org.gnu
354
19742390
2490
0.000013
com.itsnicethat
355
19741694
2296
0.000014
uk.ac.sussex
356
19739212
820
0.000036
com.neilpatel
357
19738554
162
0.000156
com.opera
358
19738540
951
0.000030
com.gumroad
359
19733434
868
0.000034
com.business2community
360
19731138
909
0.000032
uk.co.pinterest
361
19730570
617
0.000041
uk.parliament
362
19729560
898
0.000032
com.ecwid
363
19729024
526
0.000047
me.m
364
19728302
1186
0.000025
com.thelancet
365
19727476
1677
0.000018
uk.co.timesonline
366
19725568
1662
0.000018
edu.iastate
367
19720948
890
0.000033
com.thedrum
368
19718200
1234
0.000024
com.seattletimes
369
19716772
116
0.000258
com.jimdo
370
19715158
1748
0.000018
org.rsc
371
19713318
318
0.000078
me.wa
372
19713052
2312
0.000014
io.soup
373
19712174
240
0.000098
net.php
374
19710524
996
0.000029
com.healthline
375
19706662
103
0.000317
net.facebook
376
19700662
389
0.000064
com.meetup
377
19698168
1397
0.000021
int.unfccc
378
19697842
2364
0.000014
com.autoblog
379
19697184
1147
0.000026
uk.co.ebay
380
19696284
1512
0.000020
com.channel4
381
19696102
345
0.000072
int.who
382
19695842
856
0.000034
com.photoshelter
383
19693426
297
0.000081
org.python
384
19693168
2103
0.000016
edu.miami
385
19693108
2445
0.000013
com.mysanantonio
386
19693052
1314
0.000023
com.bustle
387
19693004
2416
0.000013
com.smore
388
19690872
1244
0.000024
uk.co.express
389
19689496
1882
0.000016
com.smashwords
390
19689346
1454
0.000021
com.gawker
391
19689266
1492
0.000020
org.hrc
392
19688570
1378
0.000022
uk.gov.blog
393
19688210
266
0.000090
com.rawgit
394
19685386
251
0.000094
uk.org.ico
395
19684372
2229
0.000015
org.vim
396
19683694
2148
0.000015
uk.ac.york
397
19683048
1902
0.000016
com.discovermagazine
398
19682466
2017
0.000016
com.dummies
399
19682270
2811
0.000012
com.iht
400
19678702
1498
0.000020
fr.lesechos
401
19677190
1643
0.000019
org.amnesty
402
19677184
1087
0.000026
org.aarp
403
19675912
836
0.000035
uk.gov.legislation
404
19675666
1582
0.000019
com.pbworks
405
19675228
1197
0.000025
com.cio
406
19675036
1541
0.000020
com.googlegroups
407
19673696
888
0.000033
uk.gov.nationalarchives
408
19671786
489
0.000051
com.nwsource
409
19669190
1344
0.000022
com.thestar
410
19668328
1993
0.000016
com.treehugger
411
19668276
1602
0.000019
com.brainyquote
412
19667868
513
0.000048
com.livechatinc
413
19667262
1195
0.000025
org.heart
414
19666046
259
0.000091
com.unsplash
415
19665938
1475
0.000020
ie.independent
416
19665662
2444
0.000013
org.sciencenews
417
19664008
1478
0.000020
fi.google
418
19662896
1201
0.000025
uk.co.standard
419
19662404
163
0.000156
com.eventbrite
420
19661850
1997
0.000016
com.timesofisrael
421
19661340
1304
0.000023
com.surveygizmo
422
19659778
1245
0.000024
org.ohchr
423
19656716
1989
0.000016
com.nationalreview
424
19654260
2022
0.000016
com.gucci
425
19653254
605
0.000041
org.mediawiki
426
19651234
972
0.000029
com.wordstream
427
19651102
1584
0.000019
com.netvibes
428
19649566
1976
0.000016
org.bitcointalk
429
19648228
2372
0.000014
com.deepmind
430
19648124
1773
0.000017
org.iucn
431
19647904
1496
0.000020
com.startribune
432
19646024
293
0.000082
com.ebay
433
19639388
1355
0.000022
com.convinceandconvert
434
19637100
522
0.000047
edu.yale
435
19636614
384
0.000065
com.kickstarter
436
19635776
100
0.000321
com.godaddy
437
19634912
2157
0.000015
com.instapaper
438
19633830
1767
0.000017
uk.co.ibtimes
439
19631378
1261
0.000024
com.imageshack
440
19630146
110
0.000284
com.mailchimp
441
19627008
2887
0.000011
net.openreview
442
19626924
481
0.000052
gov.whitehouse
443
19626884
1301
0.000023
ch.ipcc
444
19625858
959
0.000030
com.bandsintown
445
19625598
388
0.000064
com.office
446
19624032
2039
0.000016
edu.udel
447
19623636
1818
0.000017
uk.ac.kcl
448
19619866
988
0.000029
org.ilo
449
19618636
1880
0.000016
tl.we
450
19618128
2092
0.000016
io.gitlab
451
19616698
1975
0.000016
com.digitaljournal
452
19615278
84
0.000440
com.list-manage
453
19614194
15
0.002224
com.wixstatic
454
19611928
1791
0.000017
com.secondlife
455
19604998
1171
0.000025
uk.gov.tfl
456
19603646
1994
0.000016
org.peta
457
19602880
1252
0.000024
com.medicalnewstoday
458
19601844
1744
0.000018
com.teenvogue
459
19601126
45
0.000773
net.fbcdn
460
19600768
1813
0.000017
com.upi
461
19600410
205
0.000117
com.etsy
462
19598800
1577
0.000019
no.google
463
19597706
2097
0.000016
com.shell
464
19596732
1535
0.000020
com.quicksprout
465
19596622
406
0.000062
com.fastcompany
466
19596226
1324
0.000023
org.hrw
467
19596164
559
0.000045
edu.berkeley
468
19595736
826
0.000036
com.intel
469
19593408
1911
0.000016
com.tomsguide
470
19592762
1655
0.000018
ca.pinterest
471
19591462
365
0.000068
com.hp
472
19590312
649
0.000039
org.nodejs
473
19589296
2135
0.000015
com.politifact
474
19588516
2400
0.000013
com.towardsdatascience
475
19588356
2292
0.000014
com.dailykos
476
19588058
1749
0.000018
com.oprah
477
19585238
3039
0.000011
org.arkive
478
19584732
859
0.000034
com.engadget
479
19584238
1740
0.000018
com.shareholder
480
19584228
967
0.000030
ly.snip
481
19577646
1359
0.000022
com.smallbiztrends
482
19577604
2384
0.000014
com.hsbc
483
19577414
104
0.000312
com.statcounter
484
19577334
566
0.000044
com.photobucket
485
19576468
2161
0.000015
org.jenkins-ci
486
19574024
1017
0.000028
com.contentmarketinginstitute
487
19569238
2447
0.000013
uk.co.spectator
488
19567958
1966
0.000016
com.thecut
489
19567398
2655
0.000012
uk.ac.mmu
490
19563030
1458
0.000021
net.convio
491
19562626
1897
0.000016
org.project-syndicate
492
19562602
857
0.000034
com.deviantart
493
19562312
1658
0.000018
google.ai
494
19560912
1921
0.000016
com.ogilvy
495
19560528
1775
0.000017
com.csoonline
496
19559434
990
0.000029
com.cognitoforms
497
19558398
2029
0.000016
link.page
498
19557452
2224
0.000015
com.upworthy
499
19555356
1670
0.000018
com.kinsta
500
19551574
393
0.000063
com.getclicky
501
19548794
1907
0.000016
ms.nyti
502
19548294
1951
0.000016
uk.ac.leeds
503
19546822
1247
0.000024
st.po
504
19546690
359
0.000069
com.mapbox
505
19545958
2341
0.000014
com.sciencealert
506
19545120
2387
0.000013
com.instructure
507
19543894
2343
0.000014
org.theiet
508
19543292
2620
0.000012
com.ksl
509
19540054
2168
0.000015
com.webbyawards
510
19537886
2852
0.000011
com.brandyourself
511
19535564
2735
0.000012
jp.hatenablog
512
19534552
2741
0.000012
com.zynga
513
19533780
382
0.000066
org.acm
514
19532322
1841
0.000017
com.cmswire
515
19531950
431
0.000058
io.codepen
516
19531032
1343
0.000022
org.pocoo
517
19530112
2931
0.000011
uk.co.autocar
518
19529900
160
0.000158
com.tripadvisor
519
19529372
234
0.000099
org.drupal
520
19528028
991
0.000029
com.gizmodo
521
19525144
2317
0.000014
org.aei
522
19524148
864
0.000034
com.matterport
523
19523142
1871
0.000017
uk.co.thesundaytimes
524
19521230
1041
0.000027
com.tinypic
525
19520944
813
0.000036
com.netflix
526
19520420
2439
0.000013
com.newatlas
527
19518764
2410
0.000013
com.triplepundit
528
19518666
381
0.000066
com.booking
529
19518320
2978
0.000011
fr.hellocoton
530
19517366
2201
0.000015
org.unfpa
531
19516300
1603
0.000019
pt.google
532
19514002
1715
0.000018
net.openid
533
19511332
3080
0.000011
com.blogsky
534
19511240
1763
0.000017
com.bloglines
535
19508014
272
0.000089
com.adnxs
536
19507232
2106
0.000015
org.royalsociety
537
19506586
2659
0.000012
com.asiaone
538
19504284
2308
0.000014
com.waterstones
539
19503858
2342
0.000014
com.financialexpress
540
19503212
1639
0.000019
uk.org.nationaltrust
541
19502772
1646
0.000019
org.pypi
542
19501202
899
0.000032
com.highcharts
543
19500790
1889
0.000016
org.panda
544
19500702
2898
0.000011
org.ifaw
545
19500700
1828
0.000017
org.thinkprogress
546
19499724
901
0.000032
com.arstechnica
547
19498236
2203
0.000015
com.kaggle
548
19497656
1948
0.000016
org.wri
549
19494804
2693
0.000012
co.electrek
550
19493786
2306
0.000014
uk.org.wwf
551
19493426
2436
0.000013
com.mongabay
552
19493282
3319
0.000010
com.carscoops
553
19492162
1082
0.000027
com.mixpanel
554
19486550
1502
0.000020
io.fabric
555
19486258
1269
0.000023
com.firebaseapp
556
19485830
906
0.000032
edu.psu
557
19484868
1848
0.000017
com.infolinks
558
19484056
1647
0.000018
com.coschedule
559
19481940
1672
0.000018
us.pa.state
560
19480200
2209
0.000015
uk.ac.nhm
561
19479650
1302
0.000023
com.clicky
562
19477726
500
0.000049
tv.twitch
563
19477544
532
0.000047
edu.cornell
564
19477084
872
0.000033
edu.washington
565
19476626
71
0.000478
com.livestream
566
19475600
2307
0.000014
com.autonews
567
19474520
2660
0.000012
pt.publico
568
19474486
1929
0.000016
org.americanprogress
569
19474190
2578
0.000012
com.nordvpn
570
19473972
2206
0.000015
org.sonatype
571
19471930
1457
0.000021
com.activecampaign
572
19471612
625
0.000041
com.samsung
573
19471306
2730
0.000012
com.delawareonline
574
19470860
2848
0.000011
com.topgear
575
19468240
999
0.000029
edu.upenn
576
19465494
1760
0.000017
uk.gov.metoffice
577
19464352
2733
0.000012
com.sc
578
19464298
2573
0.000013
br.inpe
579
19460386
1873
0.000017
com.prweek
580
19460086
2589
0.000012
com.ecowatch
581
19459484
72
0.000477
net.jsfiddle
582
19458590
3293
0.000010
com.algorithmia
583
19457214
2027
0.000016
com.scotsman
584
19457126
429
0.000058
com.slack
585
19455372
1887
0.000016
com.impactbnd
586
19453748
1008
0.000029
uk.ac.cam
587
19453316
2263
0.000014
com.articulate
588
19453140
2780
0.000012
com.nouw
589
19451266
2896
0.000011
com.flock
590
19449038
2571
0.000013
org.globalcitizen
591
19447006
538
0.000046
com.proofpoint
592
19445998
2335
0.000014
com.googledrive
593
19444262
2434
0.000013
nz.co.radionz
594
19444224
2763
0.000012
jp.riken
595
19443690
2388
0.000013
de.greenpeace
596
19443190
119
0.000244
com.youku
597
19442118
174
0.000141
jp.co.yahoo
598
19441598
2831
0.000011
com.mumsnet
599
19439924
1874
0.000017
com.crashlytics
600
19439174
965
0.000030
edu.umich
601
19439028
2114
0.000015
uk.org.rspb
602
19438028
208
0.000116
uk.co.amazon
603
19437448
101
0.000321
de.google
604
19435790
2748
0.000012
com.quickanddirtytips
605
19431834
2668
0.000012
au.com.huffingtonpost
606
19431216
1896
0.000016
uk.gov.london
607
19430698
2541
0.000013
com.thejakartapost
608
19429486
3097
0.000011
com.shanghaidaily
609
19428860
415
0.000061
com.xinhuanet
610
19428614
3069
0.000011
com.theminimalists
611
19428486
1271
0.000023
com.sprinklr
612
19426496
1208
0.000025
org.iea
613
19426466
2512
0.000013
ie.thejournal
614
19426152
1785
0.000017
com.jeffbullas
615
19424902
2979
0.000011
com.art
616
19424640
2837
0.000011
it.polito
617
19423008
1808
0.000017
com.martechtoday
618
19422426
2599
0.000012
uk.co.profilebusiness
619
19421492
2534
0.000013
com.db
620
19420756
2851
0.000011
org.onegreenplanet
621
19418396
2340
0.000014
net.opendemocracy
622
19416952
1869
0.000017
org.iucnredlist
623
19413908
2688
0.000012
uk.org.savethechildren
624
19412614
2379
0.000014
com.theyworkforyou
625
19411666
695
0.000037
com.xiti
626
19409198
2661
0.000012
org.oceanconservancy
627
19408718
2683
0.000012
com.dreamgrow
628
19407976
2254
0.000014
com.rabbitmq
629
19407372
2568
0.000013
com.shoutmeloud
630
19407170
1028
0.000028
com.mcafeesecure
631
19406866
449
0.000055
fr.free
632
19403640
362
0.000069
org.npr
633
19402072
1865
0.000017
com.copyscape
634
19401308
2791
0.000012
com.sitesell
635
19400880
312
0.000079
gov.cdc
636
19399828
2423
0.000013
com.cleantechnica
637
19399686
2809
0.000012
pl.edu.uw
638
19397274
399
0.000063
com.nypost
639
19396828
569
0.000044
com.aol
640
19396446
3167
0.000010
com.seeker
641
19396390
2760
0.000012
uk.org.amnesty
642
19396212
265
0.000090
com.sohu
643
19395962
1613
0.000019
com.flashtalking
644
19395308
2516
0.000013
com.generalmills
645
19393472
2049
0.000016
com.cityam
646
19392474
3380
0.000010
com.dremel
647
19392370
396
0.000063
com.163
648
19391762
3020
0.000011
com.brothersoft
649
19391670
2061
0.000016
org.gnupg
650
19388022
36
0.001003
com.createjs
651
19387660
1027
0.000028
edu.ucla
652
19386630
511
0.000048
com.dmca
653
19385442
1495
0.000020
scot.gov
654
19383806
2347
0.000014
org.grist
655
19383592
2474
0.000013
uk.org.oxfam
656
19381766
2457
0.000013
uk.co.thisismoney
657
19380480
3259
0.000010
org.aqicn
658
19379848
2566
0.000013
uk.org.rspca
659
19379190
1169
0.000025
com.hollywoodreporter
660
19378746
2726
0.000012
org.irena
661
19377826
2908
0.000011
org.kuow
662
19375866
2934
0.000011
eu.i-scoop
663
19375282
3137
0.000011
com.winefolly
664
19374230
244
0.000096
com.bandcamp
665
19373806
1350
0.000022
net.leadpages
666
19371298
1855
0.000017
net.noscript
667
19370726
1438
0.000021
com.pastebin
668
19370120
2692
0.000012
com.targetmarketingmag
669
19368524
3516
0.000010
co.edureka
670
19368376
2773
0.000012
com.ipsos-mori
671
19368284
2546
0.000013
org.zsl
672
19368044
2393
0.000013
com.moodys
673
19367896
1170
0.000025
gov.fbi
674
19367686
2182
0.000015
com.thermofisher
675
19366198
2800
0.000012
uk.ac.ceh
676
19365484
273
0.000089
com.surveymonkey
677
19364456
1703
0.000018
uk.co.which
678
19363118
1431
0.000021
uk.gov.defra
679
19362092
2626
0.000012
com.wikidot
680
19361864
2112
0.000015
com.problogger
681
19361432
2794
0.000012
com.pnsegypt
682
19360486
3132
0.000011
com.hatenadiary
683
19359572
169
0.000143
com.taobao
684
19359506
333
0.000074
com.pubmatic
685
19358770
377
0.000066
com.scribd
686
19358748
2985
0.000011
org.storyofstuff
687
19358106
3168
0.000010
org.heartland
688
19356998
2902
0.000011
com.nationalgrid
689
19355728
352
0.000070
com.wiley
690
19355014
886
0.000033
com.windowsphone
691
19351528
2511
0.000013
uk.gov.forestry
692
19349818
2746
0.000012
org.spie
693
19349596
816
0.000036
com.mobirise
694
19346822
2963
0.000011
uk.ac.mdx
695
19345936
463
0.000054
com.oreilly
696
19345228
2298
0.000014
com.iconarchive
697
19344974
3213
0.000010
edu.uah
698
19344130
893
0.000032
edu.columbia
699
19343846
2196
0.000015
uk.gov.food
700
19342492
2770
0.000012
edu.dukeupress
701
19341928
2518
0.000013
com.wral
702
19337306
1239
0.000024
google.blog
703
19337180
453
0.000055
com.sxsw
704
19337108
686
0.000038
com.steampowered
705
19332972
2891
0.000011
com.almanac
706
19332496
915
0.000031
com.docker
707
19332138
433
0.000057
com.force
708
19330890
913
0.000032
org.reactjs
709
19330434
3158
0.000011
com.dbs
710
19330012
3320
0.000010
uk.org.bornfree
711
19329944
1283
0.000023
uk.org.greenpeace
712
19328328
1100
0.000026
com.redhat
713
19328004
1248
0.000024
com.elpais
714
19327924
785
0.000036
com.webs
715
19324934
3401
0.000010
org.sciencenewsforstudents
716
19324548
3476
0.000010
org.sharktrust
717
19323678
3447
0.000010
uk.org.caat
718
19322218
305
0.000080
com.digg
719
19320384
325
0.000076
com.typeform
720
19320196
2756
0.000012
com.batchgeo
721
19319558
2116
0.000015
com.fifa
722
19317480
2389
0.000013
org.chathamhouse
723
19317116
1322
0.000023
org.whatbrowser
724
19317094
2098
0.000016
org.fsc
725
19316024
1706
0.000018
com.nike
726
19315926
2357
0.000014
uk.co.inews
727
19315824
1362
0.000022
edu.ucsd
728
19315458
3400
0.000010
com.artstation
729
19315386
855
0.000034
org.unesco
730
19315260
2654
0.000012
com.ingress
731
19313414
1561
0.000019
com.technologyreview
732
19312758
2375
0.000014
io.pantheon
733
19311846
2952
0.000011
com.climatechangenews
734
19311082
2981
0.000011
org.c2es
735
19309714
1771
0.000017
com.ikea
736
19309506
3010
0.000011
com.foodsafetynews
737
19306598
2574
0.000012
uk.org.38degrees
738
19305744
2676
0.000012
com.thecvf
739
19305478
2588
0.000012
org.carbonbrief
740
19305458
2990
0.000011
org.sourcewatch
741
19304968
571
0.000043
com.cbsnews
742
19304594
2986
0.000011
com.moneysupermarket
743
19304168
469
0.000053
com.statista
744
19304094
3414
0.000010
me.start
745
19301508
2844
0.000011
com.tiddlywiki
746
19299692
2645
0.000012
com.bnef
747
19298620
3095
0.000011
uk.co.bristolpost
748
19297446
198
0.000122
io.polyfill
749
19297002
3059
0.000011
jp.ac.kobe-u
750
19296802
122
0.000238
org.networkadvertising
751
19296318
502
0.000049
com.atlassian
752
19294076
338
0.000073
com.prnewswire
753
19291522
1128
0.000026
com.canva
754
19288978
3012
0.000011
org.twinery
755
19288828
2737
0.000012
com.adcolony
756
19288458
3117
0.000011
no.forskning
757
19286246
2785
0.000012
com.doctoroz
758
19284850
3556
0.000010
com.cmgdigital
759
19284678
3143
0.000011
com.sunherald
760
19284062
3172
0.000010
com.ibmbigdatahub
761
19283992
3517
0.000010
com.2createawebsite
762
19283716
2996
0.000011
net.organicfacts
763
19282858
2243
0.000014
com.privacypolicies
764
19282122
2905
0.000011
com.winemag
765
19281746
1056
0.000027
com.ubuntu
766
19281512
1419
0.000021
uk.co.thesun
767
19281086
470
0.000053
com.inc
768
19281010
2143
0.000015
org.cites
769
19280990
2290
0.000014
uk.gov.dft
770
19279280
3146
0.000011
com.insideevs
771
19279174
2734
0.000012
de.ksta
772
19278422
2684
0.000012
com.e-activist
773
19278376
1412
0.000021
com.speakerdeck
774
19276894
2747
0.000012
com.chubb
775
19273916
2608
0.000012
org.rspo
776
19273894
964
0.000030
net.2mdn
777
19273142
3265
0.000010
com.jordantimes
778
19272034
319
0.000078
gov.ca
779
19268910
3506
0.000010
com.idt
780
19268426
2757
0.000012
com.theinnovationenterprise
781
19267542
2349
0.000014
uk.gov.environment-agency
782
19267478
3496
0.000010
com.sutori
783
19266406
151
0.000163
ru.mail
784
19266224
164
0.000152
com.yelp
785
19265510
3184
0.000010
com.galvanize
786
19264800
3425
0.000010
com.thewritepractice
787
19264778
3212
0.000010
org.carbontracker
788
19264570
3464
0.000010
org.earthworksaction
789
19263548
1713
0.000018
com.martechseries
790
19262638
981
0.000029
com.visualstudio
791
19262168
3383
0.000010
com.nutraingredients
792
19261694
3222
0.000010
com.quandl
793
19261452
1484
0.000020
uk.co.foe
794
19260924
232
0.000100
to.amzn
795
19260174
1731
0.000018
org.khanacademy
796
19260130
2699
0.000012
com.businessgreen
797
19259920
524
0.000047
com.airbnb
798
19259634
3200
0.000010
com.thedrinksbusiness
799
19258704
3384
0.000010
com.monbiot
800
19258488
2685
0.000012
au.com.mumbrella
801
19257102
3072
0.000011
fr.thelocal
802
19256728
3330
0.000010
org.cnduk
803
19256628
660
0.000039
org.eff
804
19256476
1441
0.000021
com.tutsplus
805
19255922
3090
0.000011
ai.fast
806
19255422
2723
0.000012
com.goinswriter
807
19255170
3358
0.000010
org.thechicagocouncil
808
19253936
3029
0.000011
jp.hatenadiary
809
19252730
2743
0.000012
gov.ferc
810
19252634
1384
0.000022
com.uber
811
19252094
3444
0.000010
com.visitdublin
812
19250954
2582
0.000012
nz.govt.mfat
813
19249844
2223
0.000015
uk.gov.charitycommission
814
19249406
1192
0.000025
edu.utexas
815
19249112
3273
0.000010
com.chemistryworld
816
19248998
3300
0.000010
org.alaskapublic
817
19248984
1418
0.000021
fr.lemonde
818
19248812
3144
0.000011
com.tuck
819
19247226
3156
0.000011
com.marksdailyapple
820
19246284
1005
0.000029
com.americanexpress
821
19246204
579
0.000043
com.patreon
822
19245062
2814
0.000012
com.ing
823
19245032
166
0.000147
jp.co.google
824
19244244
1932
0.000016
uk.gov.education
825
19242896
2753
0.000012
com.webestools
826
19242502
2504
0.000013
com.instructables
827
19242460
1185
0.000025
edu.princeton
828
19240552
3645
0.000010
com.theppk
829
19240536
3305
0.000010
com.machinelearningmastery
830
19238864
1716
0.000018
se.haxx
831
19238712
1149
0.000026
com.digiday
832
19238462
896
0.000032
com.zoho
833
19238268
4669
0.000009
com.9to5mac
834
19237602
3761
0.000010
org.muslimaid
835
19235836
541
0.000046
com.alibaba
836
19235736
2817
0.000012
uk.ac.rcplondon
837
19233882
556
0.000045
gov.sec
838
19232880
3043
0.000011
com.platts
839
19232688
2651
0.000012
com.recyclenow
840
19232618
3441
0.000010
org.thebestschools
841
19231994
3352
0.000010
com.beruby
842
19231826
202
0.000119
com.constantcontact
843
19231002
2354
0.000014
net.privacypolicytemplate
844
19230142
3207
0.000010
com.gpsvisualizer
845
19227774
3104
0.000011
com.rabobank
846
19227216
3306
0.000010
com.seat61
847
19227198
3412
0.000010
uk.co.lep
848
19226122
311
0.000079
com.marriott
849
19224666
239
0.000098
cn.com.sina
850
19224282
753
0.000036
com.css-tricks
851
19223532
246
0.000095
jp.co.amazon
852
19222846
1299
0.000023
gd.is
853
19221828
2350
0.000014
uk.co.vogue
854
19221424
1381
0.000022
com.dell
855
19221118
722
0.000037
fm.last
856
19221104
2009
0.000016
io.getmdl
857
19220430
3756
0.000010
uk.org.stopwar
858
19220196
2627
0.000012
org.ramsar
859
19217988
1987
0.000016
com.instapage
860
19217434
595
0.000042
com.psychologytoday
861
19217202
3592
0.000010
com.fox13memphis
862
19216396
3134
0.000011
uk.org.sja
863
19216342
3538
0.000010
com.breakingenergy
864
19216070
3436
0.000010
com.star2
865
19215784
3103
0.000011
org.scielo
866
19215692
97
0.000332
com.sharethis
867
19215686
881
0.000033
com.aliexpress
868
19215320
3613
0.000010
it.diggita
869
19214836
210
0.000116
jp.ne.hatena
870
19214614
1125
0.000026
com.firefox
871
19214492
634
0.000040
gov.nist
872
19212940
3252
0.000010
org.beatthemicrobead
873
19212372
3603
0.000010
nl.zoom
874
19212310
1232
0.000024
com.convertkit
875
19207820
545
0.000046
uk.co.eventbrite
876
19207334
3145
0.000011
com.abnamro
877
19206384
2904
0.000011
org.wildlifetrusts
878
19206088
1637
0.000019
org.whales
879
19205750
1068
0.000027
com.shutterstock
880
19204676
3981
0.000009
com.visitguatemala
881
19203846
3128
0.000011
uk.org.scope
882
19203288
1030
0.000028
com.foxnews
883
19203148
2675
0.000012
org.soilassociation
884
19202842
1019
0.000028
com.cbslocal
885
19200848
3702
0.000010
no.haugenbok
886
19199586
2914
0.000011
com.ironsrc
887
19199426
952
0.000030
com.variety
888
19199344
2622
0.000012
com.feedreader
889
19198876
517
0.000048
com.ea
890
19198322
3595
0.000010
uk.co.theboltonnews
891
19198052
1190
0.000025
com.globo
892
19196818
2863
0.000011
com.itsma
893
19196098
1564
0.000019
org.freecsstemplates
894
19195978
2078
0.000016
com.hulu
895
19195672
3148
0.000011
com.rebekahradice
896
19195262
289
0.000084
com.discordapp
897
19194762
3543
0.000010
info.e-ir
898
19194546
3364
0.000010
org.swi-prolog
899
19192288
3182
0.000010
com.wpxi
900
19191698
486
0.000051
com.nasdaq
901
19190864
3170
0.000010
uk.co.dennis
902
19190668
3551
0.000010
com.alaskadispatch
903
19190460
1150
0.000026
com.java
904
19190276
230
0.000100
com.googletagservices
905
19189676
3504
0.000010
es.ree
906
19189562
3162
0.000010
com.sgx
907
19188626
3721
0.000010
br.org.imazon
908
19188334
3776
0.000010
com.citymayors
909
19188182
3582
0.000010
au.com.hotfrog
910
19187066
3272
0.000010
uk.org.cat
911
19186280
3279
0.000010
aq.ats
912
19186188
678
0.000038
com.newyorker
913
19184690
4002
0.000009
net.politicalscrapbook
914
19184344
3606
0.000010
com.southernfriedscience
915
19183484
3125
0.000011
app.web
916
19182578
220
0.000106
com.naver
917
19181770
1736
0.000018
com.techrepublic
918
19180978
3632
0.000010
com.theoildrum
919
19180728
3728
0.000010
org.worldnuclearreport
920
19180674
156
0.000162
gov.privacyshield
921
19179118
2871
0.000011
uk.co.realbusiness
922
19177254
1463
0.000021
edu.uchicago
923
19176726
1453
0.000021
tv.ustream
924
19175164
1807
0.000017
com.nba
925
19173352
3271
0.000010
uk.org.cpre
926
19173188
1788
0.000017
org.golang
927
19172402
2955
0.000011
com.writetothem
928
19172368
2041
0.000016
com.howstuffworks
929
19170896
1407
0.000021
uk.co.theregister
930
19170668
464
0.000054
com.adweek
931
19170630
243
0.000096
com.stumbleupon
932
19170480
1579
0.000019
edu.unc
933
19169444
2211
0.000015
edu.virginia
934
19168860
3619
0.000010
com.renewablesnow
935
19168518
1390
0.000022
com.over-blog
936
19167800
1443
0.000021
com.digitaltrends
937
19167782
4073
0.000009
uk.co.moblog
938
19165140
1406
0.000021
us.imageshack
939
19164724
3604
0.000010
com.at0086
940
19164492
2144
0.000015
org.coursera
941
19164428
3799
0.000010
com.avivaromm
942
19162582
984
0.000029
com.thinkwithgoogle
943
19162450
3629
0.000010
com.eremnews
944
19161660
466
0.000053
com.snapchat
945
19159918
1442
0.000021
com.billboard
946
19159904
3394
0.000010
uk.gov.peterborough
947
19159506
3530
0.000010
org.campaigncc
948
19158600
642
0.000039
org.pbs
949
19157590
3299
0.000010
uk.co.siemens
950
19157574
3470
0.000010
org.ilga-europe
951
19156258
978
0.000029
com.dropboxusercontent
952
19154380
894
0.000032
com.uservoice
953
19154252
1578
0.000019
com.ssllabs
954
19153992
3367
0.000010
com.trafficgenerationcafe
955
19152256
1614
0.000019
com.warnerbros
956
19152042
922
0.000031
com.libsyn
957
19151852
3665
0.000010
uk.org.biofuelwatch
958
19151718
3617
0.000010
uk.org.garyhall
959
19151548
2399
0.000013
com.ehow
960
19150820
3771
0.000010
no.universitetsforlaget
961
19148430
3559
0.000010
br.org.idec
962
19148290
839
0.000035
com.qz
963
19148164
2911
0.000011
net.nend
964
19147422
690
0.000038
com.webmd
965
19147238
1694
0.000018
com.codeplex
966
19144852
1374
0.000022
com.fiverr
967
19144584
3922
0.000009
net.kjokkenutstyr
968
19144572
497
0.000049
edu.cmu
969
19144158
4006
0.000009
org.freedom-now
970
19143932
1167
0.000025
com.smashingmagazine
971
19143604
3147
0.000011
uk.org.refill
972
19143372
1971
0.000016
com.invisionapp
973
19142256
2864
0.000011
com.dzone
974
19142158
3490
0.000010
io.dataquest
975
19141538
3984
0.000009
org.alqaws
976
19141230
3242
0.000010
io.dropwizard
977
19140662
3821
0.000010
com.superiorthreads
978
19140420
3398
0.000010
uk.co.firstnews
979
19138994
355
0.000070
org.debian
980
19138154
2134
0.000015
com.w3layouts
981
19134388
877
0.000033
com.foursquare
982
19134040
3035
0.000011
com.vungle
983
19133716
3205
0.000010
org.corporateeurope
984
19133644
866
0.000034
gov.census
985
19133496
3995
0.000009
com.tinnedtomatoes
986
19133382
1400
0.000021
com.blackberry
987
19133336
1335
0.000022
jp.livedoor
988
19132476
3334
0.000010
com.drillordrop
989
19131836
3329
0.000010
com.ovoenergy
990
19131172
3804
0.000010
com.descarteslabs
991
19130778
1064
0.000027
com.politico
992
19128888
3788
0.000010
org.ianfairlie
993
19128728
1866
0.000017
com.nokia
994
19127786
2669
0.000012
in.bbc
995
19127512
2987
0.000011
org.vegsoc
996
19127108
3387
0.000010
com.figure-eight
997
19125818
357
0.000070
gov.ftc
998
19125596
242
0.000097
org.icann
999
19125358
1401
0.000021
com.xkcd
1000
19125216
3609
0.000010
br.com.ambev
Credits
Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.
We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!
The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th. It includes page captures of 1.0 billion URLs not contained in any crawl archive before. The other 1.5 billion pages have been already captured in prior crawls and are now revisited.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
The crawl archive for August 2019 is now available! It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th.
The August crawl contains page captures of 1.1 billion URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the May/Jun/Jul 2019 webgraph data set from the following sources:
a random sample of 2.1 billion outlinks extracted from July crawl WAT files
1.8 billion URLs mined in a breadth-first side crawl within a maximum of 6 links (“hops”), started from
the homepages of the top 60 million hosts and domains and randomly selected samples of
2 million human-readable sitemap pages (HTML format)
3 million URLs of pages written in 130 less-represented languages (cf. language distributions)
1 billion URLs extracted and sampled from 20 million sitemaps, RSS and Atom feeds
Starting with this crawl the following fixes and improvements are applied to the provided data formats:
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark on GitHub which host all scripts and tools required to construct the graphs.
What’s new?
Links from Content-Location and Link HTTP headers are now also used to span up the web graphs. This is in accordance with RFC 5988 which defines the Link HTTP header as semantically equivalent to the element in HTML. It also fits previous web graph releases which used to include all kinds of links including technical ones and redirects.
Host-level graph
The graph consists of 445 million nodes and 3.14 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 382 million dangling nodes (86%) and the largest strongly connected component contains 48 million (11%) nodes.
You can download the graph and the ranks of all 445 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/host/. Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/host/ as prefix to access the files from everywhere.
Download files of the Common Crawl May/June/July 2019 host-level webgraph
Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.
Domain-level graph
The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.
The domain-level graph has 88 million nodes and 1.9 billion edges. 52% or 46 million nodes are dangling nodes, the largest strongly connected component covers 35 million or 40% of the nodes.
All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/domain/ resp. https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/domain/.
Download files of the Common Crawl May/June/July 2019 domain-level webgraph
Top 1000 domains ranked by harmonic centrality (May/June/July 2019)
harmonic
centrality
rank
hc value
page rank
page rank
value
reversed hostname
1
29977668
1
0.020841
com.googleapis
2
27867704
3
0.011812
com.facebook
3
27419980
2
0.012857
com.google
4
25196030
4
0.007273
com.twitter
5
24558836
5
0.006439
org.w
6
24533702
6
0.005984
com.youtube
7
22592098
9
0.003799
com.instagram
8
22060650
7
0.004857
org.gmpg
9
21829028
13
0.002863
com.linkedin
10
21595446
8
0.004481
com.googletagmanager
11
20930920
22
0.001704
com.gravatar
12
20912076
24
0.001531
com.pinterest
13
20730700
11
0.003384
com.cloudflare
14
20698732
17
0.002180
com.wordpress
15
20613210
12
0.003087
org.wordpress
16
20607942
26
0.001241
org.wikipedia
17
20408594
14
0.002452
com.bootstrapcdn
18
20351540
20
0.001823
com.apple
19
20148418
41
0.000904
com.blogspot
20
20103846
30
0.001124
com.vimeo
21
20036764
21
0.001719
com.jquery
22
19874716
50
0.000673
com.wp
23
19870332
29
0.001130
com.microsoft
24
19839912
43
0.000816
gl.goo
25
19828406
45
0.000769
com.amazon
26
19793040
18
0.002021
com.gstatic
27
19790998
19
0.002015
com.adobe
28
19788744
57
0.000573
com.tumblr
29
19754126
31
0.001104
com.amazonaws
30
19619798
25
0.001407
com.macromedia
31
19616602
34
0.001057
com.googlesyndication
32
19585788
47
0.000744
be.youtu
33
19585670
39
0.000937
com.google-analytics
34
19583342
62
0.000531
ly.bit
35
19572994
68
0.000440
com.yahoo
36
19549710
33
0.001080
com.flickr
37
19526876
35
0.001023
net.cloudfront
38
19526762
23
0.001676
com.github
39
19503814
60
0.000553
me.wp
40
19467672
27
0.001170
ru.yandex
41
19467424
58
0.000568
org.mozilla
42
19454890
106
0.000305
com.googleusercontent
43
19411724
49
0.000725
net.doubleclick
44
19374766
52
0.000658
co.t
45
19366860
44
0.000776
com.baidu
46
19322188
70
0.000401
com.weebly
47
19321754
105
0.000310
com.reddit
48
19317094
123
0.000234
com.nytimes
49
19313908
46
0.000749
com.paypal
50
19308094
104
0.000312
com.soundcloud
51
19278436
67
0.000448
com.medium
52
19268558
66
0.000451
io.github
53
19266970
63
0.000517
org.w3
54
19255616
80
0.000379
org.creativecommons
55
19228034
184
0.000143
uk.co.bbc
56
19219470
175
0.000151
com.imgur
57
19191484
137
0.000194
com.forbes
58
19184320
168
0.000154
net.slideshare
59
19169524
56
0.000588
org.schema
60
19166214
162
0.000162
com.bing
61
19163388
180
0.000144
net.sourceforge
62
19155882
182
0.000143
org.wikimedia
63
19145618
48
0.000738
com.googleadservices
64
19143840
215
0.000109
com.businessinsider
65
19136040
233
0.000104
com.techcrunch
66
19125198
273
0.000089
com.reuters
67
19112730
152
0.000169
com.theguardian
68
19091708
177
0.000147
com.imdb
69
19081148
64
0.000496
net.jsdelivr
70
19076642
145
0.000177
org.apache
71
19068912
202
0.000120
org.gnu
72
19067720
250
0.000097
com.ibm
73
19065904
274
0.000089
com.cnet
74
19060402
194
0.000124
com.washingtonpost
75
19056162
164
0.000159
com.blogger
76
19049622
336
0.000073
gov.nasa
77
19043406
271
0.000090
com.android
78
19038878
32
0.001080
com.fontawesome
79
19030824
196
0.000123
com.huffingtonpost
80
19022764
243
0.000100
com.oracle
81
19022114
99
0.000323
com.shopify
82
19010092
178
0.000147
com.stackoverflow
83
19008866
264
0.000092
com.bbc
84
18991504
138
0.000194
com.wixsite
85
18979794
193
0.000128
org.ampproject
86
18979606
331
0.000074
com.latimes
87
18966924
334
0.000073
com.livejournal
88
18954352
148
0.000171
com.eventbrite
89
18952914
406
0.000061
com.zdnet
90
18951470
38
0.000950
com.addthis
91
18941168
260
0.000093
com.usatoday
92
18930306
261
0.000093
com.wired
93
18929948
473
0.000052
com.economist
94
18924894
122
0.000237
com.ytimg
95
18915820
295
0.000083
com.prnewswire
96
18907784
107
0.000304
com.whatsapp
97
18905562
241
0.000101
com.appspot
98
18903750
289
0.000086
org.npr
99
18899826
605
0.000046
com.thenextweb
100
18898732
139
0.000192
com.issuu
101
18897130
198
0.000122
org.ietf
102
18893188
181
0.000143
jp.co.yahoo
103
18889096
142
0.000183
com.spotify
104
18888760
449
0.000055
com.venturebeat
105
18888186
55
0.000590
eu.europa
106
18886240
382
0.000064
com.goodreads
107
18880882
37
0.000994
com.qq
108
18880880
601
0.000046
org.ieee
109
18876988
209
0.000114
com.bandcamp
110
18874448
359
0.000068
com.quora
111
18872666
426
0.000058
com.cisco
112
18869640
211
0.000112
net.behance
113
18866560
474
0.000052
org.arxiv
114
18852080
394
0.000062
com.buzzfeed
115
18844806
95
0.000330
com.sharethis
116
18834502
427
0.000058
com.deviantart
117
18834166
899
0.000031
com.ibtimes
118
18829762
185
0.000141
com.giphy
119
18828960
96
0.000328
com.statcounter
120
18825074
649
0.000043
com.stackexchange
121
18823624
170
0.000152
uk.co.google
122
18818848
283
0.000087
com.cnbc
123
18817384
825
0.000034
org.eclipse
124
18814566
333
0.000074
com.aol
125
18814392
485
0.000051
com.pixabay
126
18806944
206
0.000117
com.disqus
127
18800912
458
0.000054
com.about
128
18793968
42
0.000849
com.squarespace
129
18793572
522
0.000048
com.mysql
130
18792740
144
0.000180
com.yelp
131
18790794
355
0.000068
com.theatlantic
132
18787424
417
0.000059
me.about
133
18787006
317
0.000077
com.skype
134
18782636
476
0.000052
com.visualstudio
135
18780538
232
0.000104
me.t
136
18772666
948
0.000030
com.nvidia
137
18772560
468
0.000053
com.wikihow
138
18768358
276
0.000089
com.sciencedirect
139
18767822
224
0.000106
com.dribbble
140
18762266
324
0.000075
com.scribd
141
18759236
712
0.000039
google.blog
142
18756886
183
0.000143
com.salesforce
143
18756236
551
0.000048
com.slate
144
18753968
131
0.000208
com.dropbox
145
18751696
407
0.000061
uk.co.independent
146
18751242
299
0.000081
com.fastcompany
147
18746590
257
0.000094
com.googlecode
148
18746142
213
0.000111
com.hubspot
149
18744470
440
0.000057
com.newyorker
150
18744452
430
0.000058
com.box
151
18743332
120
0.000249
org.networkadvertising
152
18736956
667
0.000042
org.chromium
153
18735918
463
0.000053
gov.loc
154
18734190
297
0.000082
com.example
155
18733762
200
0.000121
com.cnn
156
18731428
671
0.000041
com.tinypic
157
18728160
269
0.000090
com.fc2
158
18726104
790
0.000035
com.nymag
159
18723184
707
0.000039
com.smashingmagazine
160
18719704
616
0.000045
com.evernote
161
18718576
272
0.000090
com.nbcnews
162
18716396
548
0.000048
net.azurewebsites
163
18710606
219
0.000108
com.npmjs
164
18709770
155
0.000167
org.archive
165
18708768
306
0.000079
com.w3schools
166
18705090
1024
0.000028
ca.utoronto
167
18703880
191
0.000130
jp.ne.hatena
168
18699974
477
0.000052
io.codepen
169
18699212
61
0.000544
com.vk
170
18699016
969
0.000029
com.ign
171
18694692
703
0.000039
com.speakerdeck
172
18694256
853
0.000033
com.mediafire
173
18691628
506
0.000049
com.foursquare
174
18686102
894
0.000031
com.nike
175
18684136
608
0.000046
com.trello
176
18679302
119
0.000251
info.aboutads
177
18676168
376
0.000066
com.mozilla
178
18670790
53
0.000604
com.wix
179
18669780
639
0.000044
uk.ac.ox
180
18664264
146
0.000174
com.amazon-adsystem
181
18661162
103
0.000317
com.paypalobjects
182
18658320
84
0.000366
com.bizjournals
183
18653438
342
0.000072
com.getpocket
184
18639078
316
0.000077
ca.google
185
18636324
498
0.000050
com.indiatimes
186
18628316
596
0.000047
com.pinimg
187
18626162
624
0.000045
com.cbslocal
188
18624278
311
0.000078
edu.mit
189
18623878
942
0.000030
com.chron
190
18622420
114
0.000272
net.windows
191
18618786
1158
0.000025
org.tensorflow
192
18618326
726
0.000038
ca.blogspot
193
18617602
842
0.000033
com.sap
194
18615678
841
0.000033
com.css-tricks
195
18612144
360
0.000068
com.entrepreneur
196
18606050
623
0.000045
com.libsyn
197
18603340
134
0.000205
com.unpkg
198
18602302
117
0.000253
com.stripe
199
18600352
308
0.000079
edu.harvard
200
18597464
226
0.000106
com.wsj
201
18595214
1070
0.000026
com.hackernoon
202
18594174
836
0.000033
com.thehill
203
18592786
59
0.000557
com.fb
204
18590510
625
0.000045
ca.cbc
205
18590172
912
0.000031
org.unicode
206
18586610
792
0.000035
com.buffer
207
18585880
369
0.000067
com.elsevier
208
18581126
794
0.000035
com.theglobeandmail
209
18580570
15
0.002238
com.wixstatic
210
18579326
363
0.000068
me.telegram
211
18578858
662
0.000042
com.searchengineland
212
18576402
179
0.000147
org.bbb
213
18574900
656
0.000043
site.business
214
18574476
481
0.000051
com.withgoogle
215
18574346
252
0.000097
es.google
216
18572616
874
0.000032
org.kernel
217
18572498
644
0.000044
com.flipboard
218
18571106
725
0.000038
co.ibb
219
18565014
658
0.000042
com.huffpost
220
18563588
1005
0.000028
edu.rutgers
221
18562788
848
0.000033
uk.co.wired
222
18560744
759
0.000036
com.ssrn
223
18560606
113
0.000272
com.weibo
224
18557646
1039
0.000027
com.aljazeera
225
18555860
736
0.000037
gov.archives
226
18554338
346
0.000071
com.mapbox
227
18554008
637
0.000044
org.d3js
228
18553278
151
0.000170
com.yimg
229
18551098
1093
0.000026
org.hrw
230
18549104
603
0.000046
gg.discord
231
18546834
1468
0.000020
com.hm
232
18546708
1146
0.000025
ly.visual
233
18545968
985
0.000029
com.geekwire
234
18545464
201
0.000120
com.optimizely
235
18544754
1251
0.000023
ca.huffingtonpost
236
18544114
212
0.000111
edu.stanford
237
18542860
943
0.000030
uk.co.huffingtonpost
238
18542334
618
0.000045
co.elastic
239
18539242
1877
0.000017
com.pearltrees
240
18535880
1829
0.000017
cn.people
241
18530690
1402
0.000021
com.diigo
242
18528804
296
0.000082
com.tinyurl
243
18528034
455
0.000054
com.mapquest
244
18525550
979
0.000029
org.slashdot
245
18524276
1106
0.000025
edu.osu
246
18523478
65
0.000473
net.akamaihd
247
18522316
882
0.000032
com.theconversation
248
18517436
278
0.000089
org.purl
249
18517362
375
0.000066
com.mashable
250
18514272
1097
0.000026
com.dw
251
18513770
934
0.000030
com.bt
252
18511512
844
0.000033
com.today
253
18511490
877
0.000032
com.marketwired
254
18510324
1267
0.000023
jp.co.ntv
255
18509842
1099
0.000026
com.mentalfloss
256
18509588
986
0.000029
com.computerworld
257
18505276
1376
0.000021
jp.ac.u-tokyo
258
18505224
840
0.000033
co.g
259
18503072
847
0.000033
com.healthline
260
18502704
782
0.000035
com.ecwid
261
18501610
1428
0.000021
com.sas
262
18501526
465
0.000053
com.yoast
263
18499006
1101
0.000025
edu.gatech
264
18490078
454
0.000054
com.moz
265
18487976
1636
0.000019
com.kaggle
266
18487772
1371
0.000021
com.makeuseof
267
18487400
501
0.000050
me.m
268
18487164
280
0.000088
com.bloomberg
269
18487154
771
0.000036
com.econsultancy
270
18486838
880
0.000032
uk.parliament
271
18483770
820
0.000034
com.newsweek
272
18483730
1300
0.000022
com.googlesource
273
18480640
1426
0.000021
blog.home
274
18480254
761
0.000036
com.outbrain
275
18478306
1077
0.000026
com.sfchronicle
276
18475570
204
0.000119
org.iana
277
18473642
313
0.000077
com.scorecardresearch
278
18471466
154
0.000169
gov.nih
279
18461762
1279
0.000022
com.avg
280
18459822
460
0.000054
com.theverge
281
18457732
636
0.000044
jp.shinobi
282
18455036
1009
0.000028
org.postgresql
283
18452056
1415
0.000021
com.dailydot
284
18449572
940
0.000030
com.foxbusiness
285
18448330
814
0.000034
com.adjust
286
18447054
764
0.000036
edu.brookings
287
18446074
767
0.000036
com.business2community
288
18437928
1861
0.000017
com.uniqlo
289
18435866
1446
0.000020
com.dezeen
290
18433310
347
0.000071
com.trustpilot
291
18432722
805
0.000035
com.contentmarketinginstitute
292
18431968
1036
0.000027
com.trendmicro
293
18431222
909
0.000031
org.aarp
294
18430922
850
0.000033
com.searchenginewatch
295
18428042
309
0.000078
org.python
296
18427984
160
0.000163
com.twimg
297
18427454
508
0.000049
edu.berkeley
298
18425200
768
0.000036
uk.co.pinterest
299
18423396
446
0.000055
com.bigcommerce
300
18418266
1358
0.000021
edu.iastate
301
18417172
1307
0.000022
com.motherjones
302
18416930
715
0.000039
com.techtarget
303
18414068
281
0.000087
com.myspace
304
18413522
1154
0.000025
com.hostgator
305
18411314
1046
0.000027
com.medicalnewstoday
306
18410696
1025
0.000028
com.bustle
307
18410084
69
0.000420
com.list-manage
308
18409562
323
0.000076
uk.co.telegraph
309
18409362
330
0.000074
com.meetup
310
18408690
1168
0.000024
org.openoffice
311
18406012
1296
0.000022
com.contently
312
18403532
720
0.000038
com.cdbaby
313
18402004
514
0.000049
com.adage
314
18401404
1337
0.000022
org.wnyc
315
18400286
714
0.000039
com.neilpatel
316
18398610
1527
0.000020
com.mathworks
317
18397518
478
0.000052
net.researchgate
318
18394598
938
0.000030
co.apple
319
18394586
329
0.000074
com.go
320
18393232
94
0.000347
com.godaddy
321
18392026
411
0.000060
com.msn
322
18391674
404
0.000061
com.ted
323
18390148
1484
0.000020
io.material
324
18389994
817
0.000034
com.arstechnica
325
18389508
860
0.000033
com.wikia
326
18388188
970
0.000029
com.vogue
327
18384912
377
0.000066
me.wa
328
18382020
1475
0.000020
se.blogspot
329
18380670
789
0.000035
edu.washington
330
18380362
165
0.000158
com.opera
331
18377106
248
0.000098
com.rawgit
332
18376938
819
0.000034
com.bandsintown
333
18374558
1066
0.000026
com.convinceandconvert
334
18374154
1238
0.000023
com.convertkit
335
18373850
1871
0.000017
io.soup
336
18370310
1438
0.000020
com.secondlife
337
18366152
1721
0.000018
com.zara
338
18364438
287
0.000086
com.live
339
18362628
238
0.000102
com.surveymonkey
340
18358654
188
0.000132
com.etsy
341
18356388
169
0.000153
com.feedburner
342
18356094
1944
0.000016
edu.uark
343
18355048
1911
0.000017
com.mysanantonio
344
18354726
266
0.000092
uk.org.ico
345
18352630
429
0.000058
org.hbr
346
18352406
602
0.000046
com.livechatinc
347
18352058
1493
0.000020
com.thenation
348
18351586
750
0.000037
com.yellowpages
349
18349922
112
0.000281
com.mailchimp
350
18349126
815
0.000034
com.wordstream
351
18349062
1506
0.000020
com.toptal
352
18347036
1216
0.000024
io.itch
353
18342626
494
0.000050
com.kickstarter
354
18341572
235
0.000104
com.typepad
355
18340608
420
0.000059
com.googleblog
356
18338046
366
0.000068
com.aliyuncs
357
18337760
1670
0.000018
com.manta
358
18337600
1463
0.000020
com.amcharts
359
18336652
1419
0.000021
com.indiewire
360
18335378
456
0.000054
com.fortune
361
18333310
51
0.000663
net.fbcdn
362
18333286
231
0.000105
uk.co.amazon
363
18331092
1587
0.000019
ly.adobe
364
18329760
922
0.000030
com.searchenginejournal
365
18328602
1569
0.000019
ms.nyti
366
18325744
374
0.000066
com.ft
367
18323516
2004
0.000016
com.zoominfo
368
18323442
1179
0.000024
com.grammarly
369
18321780
1620
0.000019
li.paper
370
18321750
1210
0.000024
com.csmonitor
371
18321512
2148
0.000015
com.brandyourself
372
18320716
2076
0.000015
me.websta
373
18310842
340
0.000072
com.getclicky
374
18310400
911
0.000031
uk.gov.nationalarchives
375
18306846
863
0.000033
com.engadget
376
18304066
159
0.000165
com.zendesk
377
18300980
962
0.000029
com.cio
378
18300896
87
0.000360
de.google
379
18300852
1625
0.000019
id.co.blogspot
380
18300752
1735
0.000018
org.unfpa
381
18299478
686
0.000040
com.intel
382
18297716
431
0.000058
com.nationalgeographic
383
18297684
1942
0.000016
com.cinemablend
384
18295492
1939
0.000016
com.wral
385
18295372
663
0.000042
com.vice
386
18294668
443
0.000056
com.oreilly
387
18294554
1020
0.000028
com.weddingwire
388
18293034
461
0.000053
com.nature
389
18292998
1440
0.000020
com.harpercollins
390
18291570
290
0.000085
gov.cdc
391
18290658
364
0.000068
com.githubusercontent
392
18290368
520
0.000048
com.photobucket
393
18290364
926
0.000030
com.socialmediaexaminer
394
18290020
998
0.000028
com.firebaseapp
395
18289150
875
0.000032
com.angieslist
396
18288842
901
0.000031
com.sendpulse
397
18288628
822
0.000034
edu.columbia
398
18287750
823
0.000034
com.pexels
399
18286600
1541
0.000019
com.mindbodygreen
400
18279452
1516
0.000020
com.mailjet
401
18278356
149
0.000171
com.tripadvisor
402
18278168
319
0.000077
com.wiley
403
18276930
1850
0.000017
com.merchantcircle
404
18276454
268
0.000090
com.digg
405
18276088
1890
0.000017
fr.huffingtonpost
406
18275746
1695
0.000018
com.thoughtworks
407
18273760
1014
0.000028
org.ocks
408
18273220
2062
0.000015
jp.pinterest
409
18272768
484
0.000051
com.cbsnews
410
18271878
352
0.000069
int.who
411
18270528
816
0.000034
com.format
412
18270108
255
0.000096
net.php
413
18269924
1464
0.000020
com.thecut
414
18268658
2055
0.000015
org.spie
415
18264554
214
0.000110
org.aboutcookies
416
18263300
1231
0.000023
com.mynewsdesk
417
18261732
409
0.000060
com.office
418
18261624
1071
0.000026
com.fastcodesign
419
18260856
1452
0.000020
fr.liberation
420
18260774
335
0.000073
com.time
421
18260366
444
0.000056
org.freecodecamp
422
18260020
1606
0.000019
com.dummies
423
18259400
1778
0.000018
com.instapaper
424
18258930
755
0.000036
com.mediapost
425
18255842
630
0.000044
com.proofpoint
426
18254118
1878
0.000017
it.binged
427
18254086
1321
0.000022
ly.snip
428
18252858
416
0.000059
uk.co.dailymail
429
18249260
604
0.000046
org.nodejs
430
18248590
392
0.000062
fr.free
431
18248492
464
0.000053
com.statista
432
18247356
879
0.000032
com.gizmodo
433
18246646
315
0.000077
com.st-hatena
434
18245388
1660
0.000018
com.superpages
435
18244078
1120
0.000025
com.theknot
436
18243678
357
0.000068
com.unsplash
437
18241494
1397
0.000021
com.jeffbullas
438
18236208
1522
0.000020
com.biography
439
18235946
2146
0.000015
de.huffingtonpost
440
18234820
1432
0.000021
com.csoonline
441
18234726
1486
0.000020
com.louisvuitton
442
18233512
121
0.000246
com.jimdo
443
18232920
1040
0.000027
uk.ac.cam
444
18232348
1338
0.000022
google.ai
445
18231586
2190
0.000014
com.mango
446
18230902
1227
0.000023
com.activecampaign
447
18226336
964
0.000029
com.netlify
448
18226172
953
0.000030
com.eater
449
18223984
1004
0.000028
com.smallbiztrends
450
18223564
2105
0.000015
site.negocio
451
18223100
277
0.000089
com.ebay
452
18221778
1301
0.000022
ca.yellowpages
453
18220422
689
0.000040
com.windowsphone
454
18220366
775
0.000035
com.marketwatch
455
18219714
1147
0.000025
com.redhat
456
18217972
2170
0.000015
edu.scad
457
18217660
1290
0.000022
com.digitaltrends
458
18217318
1123
0.000025
org.mathjax
459
18216670
1658
0.000019
com.politifact
460
18215546
2225
0.000014
com.dexknows
461
18214790
490
0.000050
gov.whitehouse
462
18210044
1225
0.000023
com.quicksprout
463
18207494
176
0.000150
com.slack
464
18205208
1655
0.000019
uk.co.bbci
465
18203194
1336
0.000022
com.cmswire
466
18202308
79
0.000382
net.jsfiddle
467
18199674
1683
0.000018
com.nyt
468
18198490
1928
0.000016
com.itsnicethat
469
18197492
835
0.000034
edu.psu
470
18196856
354
0.000068
com.booking
471
18196796
688
0.000040
com.webs
472
18195852
960
0.000030
edu.ucla
473
18191364
701
0.000039
gov.nist
474
18191138
945
0.000030
com.sprinklr
475
18191102
307
0.000079
gov.ca
476
18188332
76
0.000389
com.livestream
477
18186908
1375
0.000021
net.openid
478
18186750
1131
0.000025
gov.fbi
479
18185834
475
0.000052
tv.twitch
480
18183498
1982
0.000016
google.design
481
18176950
1790
0.000017
com.psmag
482
18175788
774
0.000036
com.oath
483
18173816
1498
0.000020
org.gnupg
484
18172144
351
0.000069
com.hp
485
18171630
291
0.000085
org.acm
486
18167296
2488
0.000013
org.travelblog
487
18167032
2243
0.000014
com.ingress
488
18165578
1264
0.000023
com.coschedule
489
18164766
1746
0.000018
com.financialexpress
490
18164648
1868
0.000017
com.allafrica
491
18164360
1110
0.000025
edu.princeton
492
18163672
2305
0.000014
com.tommy
493
18163376
1629
0.000019
org.whatbrowser
494
18162344
1299
0.000022
com.kinsta
495
18161312
2441
0.000014
com.algorithmia
496
18161140
530
0.000048
net.brightcove
497
18158432
2024
0.000016
jp.riken
498
18157666
831
0.000034
com.msdn
499
18157640
511
0.000049
edu.cornell
500
18156568
2279
0.000014
com.theminimalists
501
18153940
236
0.000103
to.amzn
502
18153840
1460
0.000020
net.noscript
503
18147056
305
0.000079
com.typeform
504
18146838
1727
0.000018
com.iconarchive
505
18145360
928
0.000030
org.weforum
506
18144622
838
0.000033
com.git-scm
507
18143540
2201
0.000014
net.organicfacts
508
18140242
1724
0.000018
com.gap
509
18138900
709
0.000039
org.bitbucket
510
18136630
403
0.000061
com.dailymotion
511
18134648
393
0.000062
com.nypost
512
18134424
2244
0.000014
com.bonfire
513
18133832
2095
0.000015
it.polito
514
18132572
903
0.000031
com.sfgate
515
18130544
239
0.000101
com.stumbleupon
516
18130386
2273
0.000014
net.brownbook
517
18129594
2073
0.000015
com.zynga
518
18127296
523
0.000048
edu.yale
519
18126134
1656
0.000019
com.wayfair
520
18125998
254
0.000096
org.drupal
521
18125926
381
0.000065
org.un
522
18123812
2300
0.000014
com.23hq
523
18122836
614
0.000045
gov.sec
524
18120156
415
0.000059
com.gmail
525
18119684
1196
0.000024
com.playstation
526
18118234
1717
0.000018
org.polymer-project
527
18113328
1772
0.000018
za.co.iol
528
18112548
1994
0.000016
au.com.huffingtonpost
529
18110922
2344
0.000014
com.marksdailyapple
530
18110894
1507
0.000020
com.impactbnd
531
18109740
648
0.000043
com.jwplatform
532
18109236
1513
0.000020
com.instapage
533
18107292
1374
0.000021
com.ning
534
18106990
2035
0.000015
com.dreamgrow
535
18106122
292
0.000085
cn.com.sina
536
18105010
2188
0.000014
net.openreview
537
18104806
1604
0.000019
com.aolcdn
538
18104004
210
0.000113
com.constantcontact
539
18103874
1889
0.000017
uk.ac.jisc
540
18103656
1993
0.000016
com.towardsdatascience
541
18101830
1669
0.000018
com.thermofisher
542
18100270
1924
0.000016
com.city-data
543
18099900
941
0.000030
uk.co.guardian
544
18099826
2136
0.000015
com.whitepages
545
18099500
1891
0.000017
com.deepmind
546
18098496
611
0.000046
com.mobirise
547
18097440
356
0.000068
com.springer
548
18096278
1929
0.000016
org.elasticsearch
549
18094390
743
0.000037
com.steampowered
550
18092048
1091
0.000026
com.auth0
551
18092008
192
0.000128
com.eepurl
552
18091694
1214
0.000024
kr.or.kisa
553
18090800
832
0.000034
gov.senate
554
18090404
71
0.000398
me.fb
555
18090188
3930
0.000010
com.artstation
556
18090110
684
0.000040
org.eff
557
18088506
2075
0.000015
com.quickanddirtytips
558
18088220
1856
0.000017
com.googledrive
559
18087890
2267
0.000014
lb.com.dailystar
560
18087370
1083
0.000026
de.spiegel
561
18087184
2189
0.000014
com.oilprice
562
18086598
1377
0.000021
io.bower
563
18086586
1997
0.000016
com.batchgeo
564
18086360
1060
0.000027
com.clicky
565
18085990
1172
0.000024
com.merriam-webster
566
18084746
1927
0.000016
com.nytco
567
18084272
284
0.000087
com.histats
568
18083856
1613
0.000019
org.jenkins-ci
569
18083580
1966
0.000016
com.underconsideration
570
18083090
2221
0.000014
com.swatch
571
18081868
617
0.000045
uk.co.blogspot
572
18078936
343
0.000071
com.sxsw
573
18078574
619
0.000045
com.patreon
574
18077388
1471
0.000020
io.getmdl
575
18076506
1081
0.000026
com.hollywoodreporter
576
18075394
610
0.000046
com.163
577
18075348
156
0.000166
ru.mail
578
18074940
1845
0.000017
com.rabbitmq
579
18074636
1783
0.000017
com.lexology
580
18074550
1665
0.000018
com.invisionapp
581
18074272
1987
0.000016
com.lightreading
582
18073906
1351
0.000021
edu.northwestern
583
18073556
996
0.000028
com.ubuntu
584
18073454
2111
0.000015
edu.dukeupress
585
18071852
2173
0.000015
org.onegreenplanet
586
18071480
2141
0.000015
com.hotfrog
587
18070592
2361
0.000014
edu.uah
588
18068872
1369
0.000021
org.khanacademy
589
18068438
1461
0.000020
uk.co.thesun
590
18066904
3487
0.000012
com.wikidot
591
18066370
1614
0.000019
com.digitaloceanspaces
592
18065602
2432
0.000014
net.sott
593
18065500
1547
0.000019
com.technologyreview
594
18065282
349
0.000070
com.staticflickr
595
18063590
78
0.000383
org.reactjs
596
18061954
660
0.000042
com.xinhuanet
597
18061776
2555
0.000013
com.idt
598
18061678
247
0.000098
de.amazon
599
18061268
739
0.000037
com.qz
600
18057936
1931
0.000016
com.googleapps
601
18057884
1753
0.000018
io.pantheon
602
18057768
2132
0.000015
net.eenews
603
18057116
779
0.000035
com.deloitte
604
18057038
1651
0.000019
com.checkatrade
605
18054622
657
0.000043
com.psychologytoday
606
18054596
900
0.000031
gov.nps
607
18051078
1973
0.000016
com.shoutmeloud
608
18049174
2761
0.000013
ca.411
609
18048620
1496
0.000020
com.citysearch
610
18048184
1748
0.000018
com.tutsplus
611
18044990
2126
0.000015
io.flutter
612
18044036
2064
0.000015
com.vanguardngr
613
18043242
1473
0.000020
edu.unc
614
18043174
1934
0.000016
com.gimletmedia
615
18042626
1627
0.000019
com.fifa
616
18041436
2463
0.000013
org.simile-widgets
617
18041230
932
0.000030
edu.upenn
618
18040930
2098
0.000015
com.designobserver
619
18040834
661
0.000042
org.pbs
620
18040552
2410
0.000014
com.ubu
621
18040422
1118
0.000025
net.recode
622
18039566
1208
0.000024
jobs.amazon
623
18038446
345
0.000071
com.tripod
624
18036562
1315
0.000022
edu.purdue
625
18035196
921
0.000030
com.variety
626
18034502
980
0.000029
com.alexa
627
18034150
1211
0.000024
us.imageshack
628
18033174
2198
0.000014
edu.arizona
629
18032862
2008
0.000016
in.huffingtonpost
630
18030734
1592
0.000019
com.yell
631
18030278
974
0.000029
org.sciencemag
632
18029728
1320
0.000022
uk.co.theregister
633
18028246
1679
0.000018
com.verywellmind
634
18025954
852
0.000033
org.worldbank
635
18025638
865
0.000033
io.readthedocs
636
18025104
130
0.000208
com.youku
637
18024714
2178
0.000015
com.epochtimes
638
18024306
2186
0.000015
info.bem
639
18023398
221
0.000107
com.taobao
640
18022288
1144
0.000025
com.elpais
641
18021800
1963
0.000016
org.dartlang
642
18021566
1088
0.000026
org.altervista
643
18021294
358
0.000068
org.debian
644
18020992
445
0.000056
com.force
645
18020940
1275
0.000023
com.ifttt
646
18020466
2209
0.000014
com.youm7
647
18019640
1073
0.000026
com.vox
648
18019568
1933
0.000016
com.hulu
649
18019032
2256
0.000014
au.com.yellowpages
650
18018980
2505
0.000013
com.pushwoosh
651
18016612
1177
0.000024
com.nydailynews
652
18016130
698
0.000039
gov.noaa
653
18014600
1657
0.000019
com.yext
654
18014022
958
0.000030
com.shutterstock
655
18013628
2320
0.000014
com.gifyu
656
18013320
1262
0.000023
com.storify
657
18013256
676
0.000041
com.samsung
658
18012944
1095
0.000026
edu.ucsd
659
18011978
422
0.000058
edu.nyu
660
18009736
696
0.000040
com.tandfonline
661
18009582
447
0.000055
com.atlassian
662
18009246
896
0.000031
com.geocities
663
18008812
439
0.000057
edu.cmu
664
18008746
2433
0.000014
com.yelloyello
665
18008602
780
0.000035
com.netflix
666
18007440
1291
0.000022
tv.ustream
667
18007104
620
0.000045
us.icio
668
18006812
1138
0.000025
edu.utexas
669
18005924
448
0.000055
com.gitlab
670
18005790
2093
0.000015
com.targetmarketingmag
671
18004306
2166
0.000015
com.cargurus
672
18004206
886
0.000032
com.docker
673
18002932
1191
0.000024
com.trustedshops
674
18002218
2479
0.000013
com.analyticsvidhya
675
18001434
2445
0.000013
com.2findlocal
676
17998520
845
0.000033
com.foxnews
677
17997146
2080
0.000015
jp.huffingtonpost
678
17995736
2637
0.000013
com.instructables
679
17995238
1945
0.000016
com.nokia
680
17995100
1197
0.000024
edu.academia
681
17992664
756
0.000036
com.gettyimages
682
17991230
245
0.000099
com.wpengine
683
17991084
2212
0.000014
ca.uwaterloo
684
17988686
2547
0.000013
com.cmgdigital
685
17987090
866
0.000033
edu.umich
686
17986974
693
0.000040
com.symantec
687
17986634
810
0.000034
net.2mdn
688
17986626
2129
0.000015
com.mondaq
689
17986164
952
0.000030
com.ycombinator
690
17985794
2206
0.000014
com.keepersecurity
691
17985096
388
0.000063
com.newrelic
692
17984746
2054
0.000015
com.doctoroz
693
17984534
908
0.000031
com.uservoice
694
17983862
207
0.000115
com.naver
695
17982684
1557
0.000019
com.pastebin
696
17980416
189
0.000132
com.xing
697
17978736
1857
0.000017
com.duckduckgo
698
17978312
956
0.000030
com.thinkwithgoogle
699
17978128
1907
0.000017
se.haxx
700
17976984
2007
0.000016
com.thecvf
701
17975926
2255
0.000014
au.com.truelocal
702
17974640
3534
0.000012
com.9to5mac
703
17974534
2130
0.000015
uk.co.yelp
704
17974162
1444
0.000020
fm.last
705
17974086
891
0.000032
com.dropboxusercontent
706
17973354
1554
0.000019
com.sankei
707
17972924
2205
0.000014
com.tiddlywiki
708
17971858
2315
0.000014
com.galvanize
709
17971240
2149
0.000015
es.huffingtonpost
710
17971108
249
0.000098
com.automattic
711
17969728
920
0.000031
com.investopedia
712
17967994
2235
0.000014
com.bizcommunity
713
17967458
1156
0.000025
org.cambridge
714
17967296
1220
0.000023
com.freeprivacypolicy
715
17967286
917
0.000031
org.change
716
17966542
2145
0.000015
com.winemag
717
17966324
2444
0.000014
com.maritime-executive
718
17965424
1052
0.000027
gov.uspto
719
17964464
2556
0.000013
com.alternion
720
17963358
1834
0.000017
com.autodesk
721
17963124
2411
0.000014
com.communitywalk
722
17962726
1839
0.000017
org.coursera
723
17962202
1255
0.000023
com.upwork
724
17960682
2341
0.000014
net.futurecdn
725
17959974
2089
0.000015
com.kudzu
726
17959858
2352
0.000014
com.ericsson
727
17958320
1832
0.000017
com.adespresso
728
17956922
2527
0.000013
edu.alamo
729
17956784
1260
0.000023
com.irishtimes
730
17956778
2342
0.000014
com.filedn
731
17956660
1353
0.000021
edu.usc
732
17956158
1041
0.000027
com.wunderground
733
17955722
864
0.000033
br.com.uol
734
17955718
697
0.000039
com.gartner
735
17955384
2254
0.000014
com.gamespot
736
17955254
2074
0.000015
com.btplc
737
17954432
2058
0.000015
com.showmelocal
738
17954256
2386
0.000014
com.massimodutti
739
17953774
2020
0.000016
edu.virginia
740
17953600
1731
0.000018
com.ikea
741
17953476
2260
0.000014
com.insiderpages
742
17953416
1274
0.000023
com.indiegogo
743
17952676
2030
0.000016
com.goinswriter
744
17949694
2440
0.000014
com.bershka
745
17949350
2184
0.000015
com.almanac
746
17949160
770
0.000036
gov.census
747
17946880
1233
0.000023
com.intuit
748
17945914
413
0.000060
com.inc
749
17944532
4347
0.000009
com.programmableweb
750
17943268
1132
0.000025
com.pcmag
751
17942666
2194
0.000014
com.writersdigest
752
17942134
2283
0.000014
com.citysquares
753
17941646
1535
0.000020
com.fiverr
754
17941316
1872
0.000017
com.csswizardry
755
17941296
1373
0.000021
com.vanityfair
756
17941172
1903
0.000017
jp.sankeibiz
757
17940796
2456
0.000013
com.live5news
758
17939234
1117
0.000025
gov.usgs
759
17938916
914
0.000031
com.zoho
760
17938282
3400
0.000012
com.freep
761
17937408
830
0.000034
com.blackberry
762
17937256
2163
0.000015
jp.booklog
763
17936844
2351
0.000014
com.thedrinksbusiness
764
17935426
1057
0.000027
com.politico
765
17935388
2197
0.000014
com.winefolly
766
17934768
687
0.000040
com.alibaba
767
17934366
2970
0.000013
com.jeeran
768
17934080
2494
0.000013
io.stackedit
769
17933862
1958
0.000016
ca.ubc
770
17933764
161
0.000163
me.line
771
17933538
3940
0.000010
org.greenpeace
772
17933062
2381
0.000014
com.yellowbook
773
17932336
2459
0.000013
za.co.bdlive
774
17932212
2396
0.000014
com.asianage
775
17932090
1631
0.000019
com.udemy
776
17932058
3415
0.000012
com.glamour
777
17931830
1635
0.000019
com.chrome
778
17931642
1194
0.000024
com.techrepublic
779
17931614
849
0.000033
com.unity3d
780
17931590
2033
0.000015
mp.j
781
17931536
598
0.000047
gov.usda
782
17931004
2365
0.000014
net.islamweb
783
17929334
808
0.000034
int.wipo
784
17928466
2355
0.000014
com.wsoctv
785
17927592
665
0.000042
com.marketo
786
17927048
1049
0.000027
edu.umn
787
17926932
414
0.000060
mp.mailchi
788
17926354
968
0.000029
com.aliexpress
789
17925900
2608
0.000013
org.torproject
790
17925360
2322
0.000014
com.utah
791
17925228
954
0.000030
com.sciencedaily
792
17924502
1932
0.000016
org.ap
793
17924098
724
0.000038
gov.house
794
17923976
2208
0.000014
com.chamberofcommerce
795
17923594
1953
0.000016
com.urbandictionary
796
17923558
2570
0.000013
com.spoke
797
17922794
2807
0.000013
com.salespider
798
17921176
2369
0.000014
com.ibmbigdatahub
799
17921126
981
0.000029
au.net.abc
800
17921074
1565
0.000019
com.problogger
801
17921008
533
0.000048
com.snapchat
802
17920922
1425
0.000021
fr.lemonde
803
17919346
141
0.000185
jp.co.google
804
17917592
5539
0.000006
cc.co
805
17917020
2017
0.000016
com.posterous
806
17916882
1129
0.000025
com.canva
807
17916104
1638
0.000019
com.britannica
808
17915782
2379
0.000014
com.wpxi
809
17915468
2040
0.000015
edu.cuny
810
17915450
2519
0.000013
com.americantowns
811
17914364
675
0.000041
gov.hhs
812
17913996
2287
0.000014
org.themoth
813
17913328
1420
0.000021
com.rollingstone
814
17913022
1245
0.000023
com.xkcd
815
17912984
3582
0.000011
edu.brown
816
17912880
634
0.000044
com.feedly
817
17912458
2447
0.000013
com.hdnux
818
17912396
2609
0.000013
com.zionsbank
819
17912316
2515
0.000013
com.pacegallery
820
17911434
2645
0.000013
com.tupalo
821
17911136
640
0.000044
au.com.google
822
17911060
843
0.000033
com.uk
823
17909056
128
0.000215
com.youtube-nocookie
824
17908862
1322
0.000022
com.vmware
825
17908568
2094
0.000015
org.semanticscholar
826
17908342
1970
0.000016
com.sanspo
827
17908248
1013
0.000028
com.java
828
17908178
2239
0.000014
it.scoop
829
17907842
470
0.000053
com.adweek
830
17907550
2301
0.000014
uk.co.dennis
831
17907474
2108
0.000015
jp.co.sankei
832
17906950
2576
0.000013
za.co.sowetanlive
833
17906748
504
0.000049
gov.copyright
834
17906250
362
0.000068
com.wufoo
835
17905562
2310
0.000014
edu.uci
836
17904908
2052
0.000015
jp.ne.iza
837
17904810
2543
0.000013
org.foodrevolution
838
17904456
2491
0.000013
com.thewritepractice
839
17904454
2147
0.000015
com.parksassociates
840
17904410
1111
0.000025
fr.blogspot
841
17904034
2636
0.000013
au.com.whitepages
842
17903878
1329
0.000022
com.billboard
843
17903360
1272
0.000023
com.prezi
844
17902216
2172
0.000015
com.local
845
17901414
320
0.000076
gov.ftc
846
17899970
1831
0.000017
edu.illinois
847
17899586
975
0.000029
com.indeed
848
17899066
811
0.000034
org.unesco
849
17898986
1852
0.000017
com.hatenablog
850
17898184
2497
0.000013
dk.brics
851
17898006
2118
0.000015
uk.ac.ed
852
17897618
1173
0.000024
org.unicef
853
17897232
425
0.000058
com.criteo
854
17896398
2151
0.000015
org.linuxfoundation
855
17896068
2215
0.000014
com.vendio
856
17895718
1981
0.000016
uk.ac.ucl
857
17894996
279
0.000089
com.marriott
858
17894196
6753
0.000005
com.blog
859
17893762
1187
0.000024
com.steamcommunity
860
17893582
834
0.000034
com.gofundme
861
17893354
2022
0.000016
net.privacypolicytemplate
862
17893038
4067
0.000009
com.virustotal
863
17892018
467
0.000053
com.iconfinder
864
17891616
2541
0.000013
com.lacartes
865
17891274
2269
0.000014
ai.fast
866
17891232
1796
0.000017
com.howstuffworks
867
17888984
1222
0.000023
com.dell
868
17888866
2473
0.000013
com.ibegin
869
17888166
1470
0.000020
com.over-blog
870
17887404
350
0.000069
net.themeforest
871
17887376
302
0.000080
com.netdna-ssl
872
17887272
3576
0.000011
edu.tufts
873
17886572
2487
0.000013
za.co.moneyweb
874
17886136
1689
0.000018
com.twilio
875
17886042
887
0.000032
com.hootsuite
876
17884576
1230
0.000023
com.gallup
877
17884332
2394
0.000014
com.machinelearningmastery
878
17882906
2389
0.000014
io.dropwizard
879
17882512
992
0.000028
com.att
880
17881416
2308
0.000014
com.ehow
881
17880788
3660
0.000011
com.discogs
882
17880724
2333
0.000014
com.blogs
883
17880688
2139
0.000015
com.dandb
884
17879696
486
0.000051
com.squareup
885
17879532
1037
0.000027
gov.bls
886
17879478
401
0.000061
com.bitly
887
17878540
3665
0.000011
com.twitpic
888
17878358
3064
0.000013
com.invoicesherpa
889
17877578
650
0.000043
com.herokuapp
890
17877102
2265
0.000014
ru.narod
891
17876570
1875
0.000017
com.tunein
892
17875252
1570
0.000019
com.com
893
17874650
1980
0.000016
jp.co.zakzak
894
17874052
613
0.000046
com.airbnb
895
17873488
2123
0.000015
uk.co.realbusiness
896
17872208
837
0.000033
gov.justice
897
17871838
1951
0.000016
co.gcdn
898
17871618
267
0.000091
com.myshopify
899
17870844
3504
0.000012
de.bild
900
17870478
234
0.000104
jp.co.amazon
901
17869976
1905
0.000017
org.filezilla-project
902
17869722
2574
0.000013
com.growtix
903
17869710
1922
0.000016
com.newsfactor
904
17868626
2775
0.000013
org.earthmagazine
905
17868240
3595
0.000011
cc.tiny
906
17868102
339
0.000072
org.opensource
907
17867628
1710
0.000018
org.owasp
908
17867434
1678
0.000018
org.cancer
909
17865220
370
0.000067
org.doi
910
17864752
1215
0.000024
ly.ow
911
17864458
2820
0.000013
co.iglobal
912
17863556
1330
0.000022
edu.uchicago
913
17863262
133
0.000206
de.bund
914
17862854
259
0.000094
com.getbootstrap
915
17862186
499
0.000050
com.nasdaq
916
17861824
1000
0.000028
com.lifehacker
917
17861748
1271
0.000023
org.pnas
918
17861644
395
0.000062
io.atom
919
17861324
1458
0.000020
in.blogspot
920
17860306
2644
0.000013
ai.becominghuman
921
17860248
2680
0.000013
com.googlemaps
922
17858530
2009
0.000016
net.nend
923
17856868
4730
0.000008
com.colourlovers
924
17856798
1413
0.000021
com.splashthat
925
17856676
982
0.000029
com.jetbrains
926
17856176
915
0.000031
jp.livedoor
927
17856152
303
0.000080
com.ssl-images-amazon
928
17854224
2643
0.000013
nl.zeelandnet
929
17853620
869
0.000032
com.pingdom
930
17853534
3078
0.000013
com.sophos
931
17852908
2525
0.000013
gr.huffingtonpost
932
17852142
1002
0.000028
de.blogspot
933
17850682
2760
0.000013
com.fox13memphis
934
17850488
2114
0.000015
com.richmediagallery
935
17850378
1821
0.000017
com.hotmail
936
17850366
72
0.000395
com.messenger
937
17850276
2231
0.000014
edu.asu
938
17850052
995
0.000028
org.iso
939
17849976
1389
0.000021
com.imimg
940
17849372
1145
0.000025
com.uber
941
17849120
2356
0.000014
com.tuck
942
17848556
1726
0.000018
com.nba
943
17848232
2404
0.000014
jp.news24
944
17847732
1559
0.000019
com.ogilvy
945
17847318
2772
0.000013
com.addustour
946
17846800
2831
0.000013
org.grayarea
947
17846714
2103
0.000015
com.homestars
948
17846650
1136
0.000025
com.seattletimes
949
17846580
265
0.000092
ru.rambler
950
17845988
2362
0.000014
edu.utah
951
17845868
3862
0.000010
com.starwars
952
17845640
479
0.000051
jp.ne.sakura
953
17844718
1063
0.000027
gov.congress
954
17843102
1410
0.000021
dk.datatilsynet
955
17842932
859
0.000033
com.stitcher
956
17842798
2971
0.000013
com.oilandgas360
957
17842486
1785
0.000017
edu.umd
958
17842430
758
0.000036
com.yandex
959
17840100
1885
0.000017
com.wetransfer
960
17839628
2457
0.000013
ms.1drv
961
17838212
977
0.000029
com.prweb
962
17838086
423
0.000058
com.smugmug
963
17837702
2414
0.000014
com.delta
964
17836356
2306
0.000014
edu.bu
965
17836156
1141
0.000025
com.500px
966
17834668
2796
0.000013
org.cmlibrary
967
17834248
2565
0.000013
com.fixr
968
17833764
1312
0.000022
com.firefox
969
17833368
2050
0.000015
edu.ufl
970
17831610
2409
0.000014
ca.ualberta
971
17831386
3977
0.000010
com.thingiverse
972
17830888
400
0.000061
com.discordapp
973
17830714
3817
0.000010
edu.unl
974
17829748
2746
0.000013
tw.com.ibon
975
17829134
2490
0.000013
au.com.hotfrog
976
17828966
2350
0.000014
de.mpg
977
17828928
1160
0.000025
com.timeanddate
978
17828580
2495
0.000013
com.figure-eight
979
17828574
2370
0.000014
com.codecademy
980
17827964
890
0.000032
gov.usa
981
17827518
256
0.000096
it.google
982
17827064
2427
0.000014
com.outboundengine
983
17826874
1308
0.000022
com.strikingly
984
17826840
1243
0.000023
com.target
985
17825758
2455
0.000013
com.theblogpress
986
17825300
2585
0.000013
com.expressbusinessdirectory
987
17825288
2216
0.000014
com.nfl
988
17825192
2607
0.000013
com.elocal
989
17825120
2628
0.000013
au.com.news
990
17824314
1116
0.000025
com.scientificamerican
991
17824168
1325
0.000022
co.vine
992
17823710
747
0.000037
com.cargocollective
993
17823530
691
0.000040
com.caniuse
994
17821930
2107
0.000015
com.angelfire
995
17820788
3005
0.000013
com.hbo
996
17820648
1639
0.000019
uk.co.screamingfrog
997
17820304
2617
0.000013
com.ovoenergy
998
17820010
737
0.000037
uk.co.eventbrite
999
17819726
2691
0.000013
com.normacomics
1000
17819576
752
0.000037
com.sagepub
Credits
Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.
We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!
The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th.
The July crawl contains page captures of 810 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:
a random sample of 2.0 billion outlinks taken from June crawl WAT files
1.8 billion URLs mined in a breadth-first side crawl within a maximum of 6 links (“hops”), started from
the homepages of the top 60 million hosts and domains and randomly selected samples of
2 million human-readable sitemap pages (HTML format)
2 million URLs of pages written in 130 less-represented languages (cf. language distributions)
900 million URLs extracted and sampled from 20 million sitemaps, RSS and Atom feeds
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.
The June crawl contains page captures of 880 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:
a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
a random sample of 2.0 billion outlinks taken from May crawl WAT files
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.
The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th.
The May crawl contains page captures of 825 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:
a breadth-first side crawl within a maximum of 4 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
a random sample of 1.6 billion outlinks taken from WAT files of the April crawl
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.