November 2019 crawl archive now available

The crawl archive for November 2019 is now available! It contains 2.55 billion web pages or 250 TiB of uncompressed content, crawled between November 11th and 23rd with a short operational break on Nov 16th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.

What’s new?

We’ve added two new fields to the URL indexes (CDX and columnar):

  • the redirect target location is stored in the CDX JSON field "redirect" resp. the column "fetch_redirect". The value is extracted from HTTP header field "Location" if the HTTP status code indicates a HTTP redirect. A relative URL path is converted to an absolute URL using the page URL as base URL. The key is absent (resp. the field value is null) in case the "Location" value is missing, not a valid URL or not a valid relative URL path.
  • truncation of the WARC record payload is indicated by the key "truncated" resp. the column "content_truncated". The reason for the truncation is given only for truncated records following the WARC header field "WARC-Truncated".

Additional details and examples can be found in the corresponding PR #15.

We’ve fixed a bug affecting the capture time (WARC-Date) in the the robots.txt subset which has been extracted from the HTTP "Date" field of the HTTP header and appeared to be occasionally wrong. Please see issue #14 for further details.

Archive Location and Download

The November crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-47/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-47/segment.paths.gz100
WARC filesCC-MAIN-2019-47/warc.paths.gz5600053.95
WAT filesCC-MAIN-2019-47/wat.paths.gz5600018.50
WET filesCC-MAIN-2019-47/wet.paths.gz560008.34
Robots.txt filesCC-MAIN-2019-47/robotstxt.paths.gz560000.24
Non-200 responses filesCC-MAIN-2019-47/non200responses.paths.gz560003.05
URL index filesCC-MAIN-2019-47/cc-index.paths.gz3020.20

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-47/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The following improvements have been made for this webgraph release:

  • the graphs now also included edges stemming from HTTP 303 “See Other” redirects (in addition to other HTTP redirect status codes)
  • the Common Crawl robots.txt WARC files are used to get additional host-level redirects including hosts which exclude the entire content in their robots.txt
  • links from robots.txt files to sitemaps are now extracted directly from the robots.txt WARC files, see the Feb/Mar/Apr 2018 web graph announcement for more details about this type of host-level links

Host-level graph

The graph consists of 820 million nodes and 4.55 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 752 million dangling nodes (92%) and the largest strongly connected component contains 50 million (6%) nodes.

You can download the graph and the ranks of all 820 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/host/ as prefix to access the files from everywhere.

SizeFileDescription
5.29 GBcc-main-2019-aug-sep-oct-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
20.73 GBcc-main-2019-aug-sep-oct-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 56 edges files
8.15 GBcc-main-2019-aug-sep-oct-host.graphgraph in BVGraph format
2 kBcc-main-2019-aug-sep-oct-host.properties
10.00 GBcc-main-2019-aug-sep-oct-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-aug-sep-oct-host-t.properties
1 kBcc-main-2019-aug-sep-oct-host.statsWebGraph statistics
11.59 GBcc-main-2019-aug-sep-oct-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 92.7 million nodes and 2.4 billion edges. 52% or 48 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-aug-sep-oct/domain/.

Download files of the Common Crawl Aug/Sep/Oct 2019 domain-level webgraph

SizeFileDescription
0.64 GBcc-main-2019-aug-sep-oct-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
9.06 GBcc-main-2019-aug-sep-oct-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.64 GBcc-main-2019-aug-sep-oct-domain.graphgraph in BVGraph format
2 kBcc-main-2019-aug-sep-oct-domain.properties
4.82 GBcc-main-2019-aug-sep-oct-domain-t.graphtranspose of the graph
2 kBcc-main-2019-aug-sep-oct-domain-t.properties
1 kBcc-main-2019-aug-sep-oct-domain.statsWebGraph statistics
1.97 GBcc-main-2019-aug-sep-oct-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 92 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Aug/Sept/Oct 2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13219322210.020989com.googleapis
22993949030.012691com.facebook
32920276220.012925com.google
42682358440.007369com.twitter
52630480450.006660org.w
62602310660.006435com.youtube
72433093090.003914com.instagram
82395469870.004993org.gmpg
92353721680.004863com.googletagmanager
1023456362130.002913com.linkedin
1122601398120.003086org.wordpress
1222511672100.003602com.cloudflare
1322484028220.001698com.gravatar
1422366260230.001509com.pinterest
1522337722190.002143com.wordpress
1622168804140.002422com.bootstrapcdn
1722152656320.001134org.wikipedia
1821939876200.001777com.apple
1921666694420.000842com.blogspot
2021595338210.001736com.jquery
2121576018490.000713be.youtu
2221574068350.001064com.vimeo
2321525328300.001154com.microsoft
2421517514180.002164com.gstatic
2521444512170.002176com.adobe
2621427572390.000964com.amazonaws
2721426674500.000695com.wp
2821393434510.000681com.amazon
2921314258650.000516com.tumblr
3021291646460.000767gl.goo
3121256658250.001309com.macromedia
3221253410290.001173com.baidu
3321136800670.000501ly.bit
3421087412270.001200com.flickr
3521083642240.001391com.github
3621068510890.000381com.yahoo
3721063868410.000928com.google-analytics
3821063204310.001139com.googlesyndication
3921053852570.000608eu.europa
4021051522610.000541org.mozilla
41209978121060.000300com.reddit
4220934970370.001002net.cloudfront
4320930738280.001184ru.yandex
4420907836380.000964com.addthis
4520875048480.000734co.t
4620863874470.000744net.doubleclick
4720860222700.000482org.w3
4820822794980.000329com.googleusercontent
4920819594430.000814com.squarespace
5020815248750.000462com.medium
5120812946910.000376org.creativecommons
52208110761750.000140org.wikimedia
5320788398860.000417com.weebly
5420786242630.000534me.wp
55207641441290.000221com.nytimes
5620754576880.000400io.github
5720744406550.000625com.paypal
58207310501650.000148uk.co.bbc
5920729776580.000557net.jsdelivr
60207236081080.000297com.soundcloud
61207206141720.000141com.imgur
62206964741300.000210com.dropbox
63206762261370.000181com.forbes
64206410781730.000141net.slideshare
6520640886540.000634org.schema
66206370161530.000163com.theguardian
67206194101870.000136com.cnn
68206144822040.000118com.businessinsider
69205893762170.000109com.wsj
70205753842810.000086edu.harvard
71205728061670.000147com.bing
72205710442410.000098com.techcrunch
73205672922900.000084edu.mit
74205571442850.000084com.reuters
75205540743750.000067com.msn
76205498523290.000075com.cnet
77205421041400.000178org.archive
78205381042500.000094com.bloomberg
7920522694330.001120com.fontawesome
80205213741410.000175gov.nih
8120513408930.000355com.shopify
82205131402710.000089com.myspace
83205070482070.000116edu.stanford
8420496152530.000647com.wix
85204891022000.000120com.stackoverflow
86204879384340.000057com.googleblog
87204846321540.000163org.apache
88204781942290.000102com.oracle
89204753922140.000110com.washingtonpost
90204723862600.000091com.android
91204692982670.000090com.bbc
92204664401940.000123org.ietf
93204385423100.000079com.time
94204363522980.000081uk.co.telegraph
95204179763690.000067com.ted
96204156623720.000067gov.nasa
97204123163680.000067com.githubusercontent
98204098881850.000136com.npmjs
99204016043940.000063com.quora
100203961466010.000042com.thenextweb
101203956741610.000156com.giphy
102203887787260.000037com.wikia
103203807283430.000072uk.co.dailymail
104203796522940.000082com.usatoday
105203783963710.000067com.latimes
106203702127130.000037org.chromium
107203697003060.000079org.un
108203681481440.000174com.wixsite
109203666104930.000050com.economist
11020361312260.001226com.qq
111203434262680.000090com.appspot
112203392664800.000052com.pixabay
113203373984910.000050com.zdnet
114203283083150.000079com.example
115203254223580.000070com.livejournal
116203223343800.000066com.mashable
117203082003020.000080com.cnbc
118203080662530.000093org.ampproject
119203069844420.000056com.nationalgeographic
120202934265050.000049com.venturebeat
121202923804040.000062com.dailymotion
122202855021390.000178com.twimg
123202841644760.000052org.bitbucket
124202823685470.000046com.pexels
125202807143270.000075com.springer
126202799922180.000108com.huffingtonpost
12720279190940.000355com.whatsapp
128202779284590.000054com.cisco
129202684161460.000170com.blogger
130202676841230.000234com.ytimg
131202647304130.000061com.fortune
132202630146410.000040uk.ac.ox
133202622582310.000100com.getbootstrap
134202616488470.000035org.cambridge
135202612686290.000040org.weforum
136202508541970.000123com.typepad
137202506982790.000086com.sciencedirect
138202501625120.000048com.about
139202471922860.000084com.wired
140202401303170.000078com.skype
141202352025580.000045org.worldbank
142202301921340.000186com.issuu
143202250045040.000049com.mysql
144202209966500.000039org.sciencemag
145202209725310.000047org.arxiv
146202182966240.000041uk.co.guardian
147202161944070.000062com.nature
148202140121270.000226com.unpkg
149202136381430.000175com.spotify
150201955008240.000036com.playstation
151201953521770.000139uk.co.google
152201952604400.000057gov.noaa
153201935743230.000077com.staticflickr
154201935123660.000068com.gmail
1552019193410370.000028org.eclipse
156201918323950.000063net.researchgate
157201859343420.000072com.fc2
158201791946030.000042org.ieee
159201771401320.000201com.zendesk
160201771083830.000065com.theatlantic
161201738505900.000043com.git-scm
162201737221820.000136me.t
163201694462820.000085com.googlecode
164201679642120.000113net.behance
165201669603640.000068com.w3schools
166201654086570.000039com.stackexchange
167201475661280.000222com.youtube-nocookie
168201442664300.000058com.buzzfeed
169201431685730.000043br.com.uol
170201412228280.000036ca.blogspot
171201385285920.000042com.evernote
172201375368540.000034com.scientificamerican
173201230002270.000102com.dribbble
174201229664950.000049com.vice
175201198121800.000137com.feedburner
176201187865740.000043net.azurewebsites
177201133705360.000046com.alexa
178201107804180.000059com.outlook
179201033824240.000059com.gitlab
180200925884220.000059me.about
181200922324090.000061com.goodreads
1822009184211020.000026com.nvidia
183200824504190.000059com.mozilla
184200785244470.000056com.entrepreneur
185200737402360.000099com.ft
186200715344520.000055com.wikihow
187200661242450.000096com.disqus
1882006494210920.000026com.jetbrains
1892006375613270.000023org.phys
190200620666020.000042org.greenpeace
191200614743860.000065org.hbr
192200594681780.000139com.salesforce
193200585325370.000046com.adage
194200560123000.000080org.doi
1952005591411060.000026org.ap
196200540688600.000034com.500px
197200518244880.000051gov.loc
198200513429570.000030com.sap
199200505006260.000041com.marketwatch
2002004982412650.000024com.siemens
2012004958411730.000025ca.utoronto
202200493004280.000058uk.co.independent
203200480342220.000104com.hubspot
204200457885930.000042com.slate
205200420183490.000071gg.discord
2062002495614350.000021com.hackernoon
207200220964870.000051uk.co.blogspot
2082001213014510.000021org.tensorflow
209200076824010.000062com.indiatimes
2102000748610350.000028org.kernel
211200016985300.000047com.trello
212199990346660.000038com.searchengineland
2131999708410090.000029com.unity3d
214199969404730.000052com.computerworld
215199962325490.000045com.withgoogle
2161999307813690.000022edu.osu
217199918809490.000030edu.si
218199902366120.000041au.net.abc
2191998808814280.000021com.lego
220199875322870.000084com.nbcnews
2211997748213560.000022com.angelfire
222199760804990.000049com.moz
223199753581990.000122net.sourceforge
224199692366670.000038co.ibb
2251996811416180.000019org.edx
226199670725150.000048com.box
227199614589860.000029com.huffpost
228199613705980.000042gov.state
2291995641815630.000019blog.home
2301995560816780.000018com.oregonlive
231199542846310.000040com.pinimg
232199531808630.000034gov.usgs
2331994989220480.000016com.sputniknews
2341994895010470.000027co.elastic
2351994746011960.000025edu.rutgers
236199466142110.000115com.optimizely
2371994541814090.000021org.maven
2381994266813730.000022net.seesaa
239199395122370.000099com.aliyuncs
240199393002910.000083com.tinyurl
241199391821880.000134com.eepurl
242199381522240.000103com.wpengine
2431993653822350.000014com.slides
244199355866590.000039com.sciencedaily
245199332621360.000183com.addtoany
246199330889460.000031com.storify
247199321941420.000175com.yimg
248199270323540.000070com.getpocket
249199256427150.000037com.vox
25019922530600.000546com.vk
251199209941710.000142org.allaboutcookies
2521991999011650.000025com.vogue
253199183643350.000074com.wufoo
2541991467612820.000023ms.1drv
2551990648414810.000020io.itch
256199063128340.000035com.techtarget
257199051626000.000042org.change
258199015305970.000042com.uk
259199012584210.000059com.squareup
2601989757614080.000021com.itv
261198968029540.000030com.thehill
2621989677212910.000023com.scmp
2631989451417770.000017com.diigo
264198931923160.000079es.google
265198902446510.000039com.lifehacker
266198887866710.000038gov.fcc
267198869807390.000037com.chicagotribune
2681988618023090.000014com.pearltrees
2691988551615540.000019org.unep
270198819603130.000079net.windows
271198818422480.000094ru.rambler
272198806425060.000049us.icio
27319877580920.000358com.weibo
274198765561090.000290com.paypalobjects
275198748268910.000033com.strikingly
2761987359811780.000025com.netlify
277198676544560.000055gov.epa
278198663502920.000083com.criteo
279198640807140.000037org.pewresearch
280198611365330.000047org.plos
2811986095412250.000024com.newscientist
282198608368490.000035uk.co.mirror
2831986070010100.000029com.mediafire
2841986029810720.000027com.sky
285198599469280.000031com.buffer
2861985891012280.000024com.aljazeera
2871985816813390.000022it.scoop
288198580402090.000116org.iana
2891985726020700.000016com.coca-colacompany
290198569126830.000038com.flipboard
2911985390018010.000017jp.ac.u-tokyo
2921985311610180.000028uk.co.metro
293198510543090.000079com.ibm
294198469683220.000077com.go
2951984683815520.000019uk.bl
2961984155612640.000024com.nikkei
29719840090520.000667com.fb
2981983984425060.000013it.unimi
2991983685815950.000019com.googlesource
300198345044740.000052com.udacity
301198340248350.000035uk.co.thetimes
302198322621680.000144com.imdb
303198316608430.000035gov.congress
304198281426680.000038org.fao
3051982665611910.000025org.acs
3061982523817280.000018com.toptal
3071982473610650.000027edu.duke
308198239826210.000041site.business
3091982092011330.000026com.trendmicro
310198178229550.000030com.theconversation
311198142589830.000029co.g
312198130348510.000034com.bmj
313198122021700.000143com.amazon-adsystem
3141980839810450.000027com.searchenginewatch
3151980612813760.000022edu.gatech
3161980347422070.000015com.viki
3171980338811350.000026edu.brookings
318198031789710.000030com.reverbnation
3191979896010690.000027au.com.smh
32019797938440.000797com.googleadservices
321197961644750.000052org.freecodecamp
322197928066580.000039br.com.google
3231979189617660.000017jp.co.japantimes
324197912344000.000063me.telegram
3251979020813320.000022com.msnbc
3261978967219150.000016org.wikibooks
3271978935612960.000023com.dw
3281978762213660.000022com.hostgator
329197841544770.000052com.theverge
3301978091615740.000019com.bankofamerica
331197769869940.000029com.yoast
332197757429970.000029com.socialmediaexaminer
333197741468410.000035org.apa
334197727484260.000058com.elsevier
335197714044580.000055com.bigcartel
3361977019022400.000014com.kinja
3371977002417010.000018com.mediaplex
3381976908010580.000027uk.co.huffingtonpost
3391976682016820.000018org.bitcoin
3401976566814300.000021com.grammarly
3411976522020710.000016com.mathworks
3421976466212530.000024com.livescience
343197642022490.000094com.live
3441976351622650.000014org.biorxiv
3451976202417940.000017com.makeuseof
346197607009420.000031com.econsultancy
347197592965180.000047com.bigcommerce
348197590889530.000030com.searchenginejournal
34919757028620.000537net.akamaihd
3501975586617640.000017com.colourlovers
351197512323140.000079com.rackcdn
3521974916218340.000017com.sas
353197469622230.000104org.gnu
3541974239024900.000013com.itsnicethat
3551974169422960.000014uk.ac.sussex
356197392128200.000036com.neilpatel
357197385541620.000156com.opera
358197385409510.000030com.gumroad
359197334348680.000034com.business2community
360197311389090.000032uk.co.pinterest
361197305706170.000041uk.parliament
362197295608980.000032com.ecwid
363197290245260.000047me.m
3641972830211860.000025com.thelancet
3651972747616770.000018uk.co.timesonline
3661972556816620.000018edu.iastate
367197209488900.000033com.thedrum
3681971820012340.000024com.seattletimes
369197167721160.000258com.jimdo
3701971515817480.000018org.rsc
371197133183180.000078me.wa
3721971305223120.000014io.soup
373197121742400.000098net.php
374197105249960.000029com.healthline
375197066621030.000317net.facebook
376197006623890.000064com.meetup
3771969816813970.000021int.unfccc
3781969784223640.000014com.autoblog
3791969718411470.000026uk.co.ebay
3801969628415120.000020com.channel4
381196961023450.000072int.who
382196958428560.000034com.photoshelter
383196934262970.000081org.python
3841969316821030.000016edu.miami
3851969310824450.000013com.mysanantonio
3861969305213140.000023com.bustle
3871969300424160.000013com.smore
3881969087212440.000024uk.co.express
3891968949618820.000016com.smashwords
3901968934614540.000021com.gawker
3911968926614920.000020org.hrc
3921968857013780.000022uk.gov.blog
393196882102660.000090com.rawgit
394196853862510.000094uk.org.ico
3951968437222290.000015org.vim
3961968369421480.000015uk.ac.york
3971968304819020.000016com.discovermagazine
3981968246620170.000016com.dummies
3991968227028110.000012com.iht
4001967870214980.000020fr.lesechos
4011967719016430.000019org.amnesty
4021967718410870.000026org.aarp
403196759128360.000035uk.gov.legislation
4041967566615820.000019com.pbworks
4051967522811970.000025com.cio
4061967503615410.000020com.googlegroups
407196736968880.000033uk.gov.nationalarchives
408196717864890.000051com.nwsource
4091966919013440.000022com.thestar
4101966832819930.000016com.treehugger
4111966827616020.000019com.brainyquote
412196678685130.000048com.livechatinc
4131966726211950.000025org.heart
414196660462590.000091com.unsplash
4151966593814750.000020ie.independent
4161966566224440.000013org.sciencenews
4171966400814780.000020fi.google
4181966289612010.000025uk.co.standard
419196624041630.000156com.eventbrite
4201966185019970.000016com.timesofisrael
4211966134013040.000023com.surveygizmo
4221965977812450.000024org.ohchr
4231965671619890.000016com.nationalreview
4241965426020220.000016com.gucci
425196532546050.000041org.mediawiki
426196512349720.000029com.wordstream
4271965110215840.000019com.netvibes
4281964956619760.000016org.bitcointalk
4291964822823720.000014com.deepmind
4301964812417730.000017org.iucn
4311964790414960.000020com.startribune
432196460242930.000082com.ebay
4331963938813550.000022com.convinceandconvert
434196371005220.000047edu.yale
435196366143840.000065com.kickstarter
436196357761000.000321com.godaddy
4371963491221570.000015com.instapaper
4381963383017670.000017uk.co.ibtimes
4391963137812610.000024com.imageshack
440196301461100.000284com.mailchimp
4411962700828870.000011net.openreview
442196269244810.000052gov.whitehouse
4431962688413010.000023ch.ipcc
444196258589590.000030com.bandsintown
445196255983880.000064com.office
4461962403220390.000016edu.udel
4471962363618180.000017uk.ac.kcl
448196198669880.000029org.ilo
4491961863618800.000016tl.we
4501961812820920.000016io.gitlab
4511961669819750.000016com.digitaljournal
45219615278840.000440com.list-manage
45319614194150.002224com.wixstatic
4541961192817910.000017com.secondlife
4551960499811710.000025uk.gov.tfl
4561960364619940.000016org.peta
4571960288012520.000024com.medicalnewstoday
4581960184417440.000018com.teenvogue
45919601126450.000773net.fbcdn
4601960076818130.000017com.upi
461196004102050.000117com.etsy
4621959880015770.000019no.google
4631959770620970.000016com.shell
4641959673215350.000020com.quicksprout
465195966224060.000062com.fastcompany
4661959622613240.000023org.hrw
467195961645590.000045edu.berkeley
468195957368260.000036com.intel
4691959340819110.000016com.tomsguide
4701959276216550.000018ca.pinterest
471195914623650.000068com.hp
472195903126490.000039org.nodejs
4731958929621350.000015com.politifact
4741958851624000.000013com.towardsdatascience
4751958835622920.000014com.dailykos
4761958805817490.000018com.oprah
4771958523830390.000011org.arkive
478195847328590.000034com.engadget
4791958423817400.000018com.shareholder
480195842289670.000030ly.snip
4811957764613590.000022com.smallbiztrends
4821957760423840.000014com.hsbc
483195774141040.000312com.statcounter
484195773345660.000044com.photobucket
4851957646821610.000015org.jenkins-ci
4861957402410170.000028com.contentmarketinginstitute
4871956923824470.000013uk.co.spectator
4881956795819660.000016com.thecut
4891956739826550.000012uk.ac.mmu
4901956303014580.000021net.convio
4911956262618970.000016org.project-syndicate
492195626028570.000034com.deviantart
4931956231216580.000018google.ai
4941956091219210.000016com.ogilvy
4951956052817750.000017com.csoonline
496195594349900.000029com.cognitoforms
4971955839820290.000016link.page
4981955745222240.000015com.upworthy
4991955535616700.000018com.kinsta
500195515743930.000063com.getclicky
5011954879419070.000016ms.nyti
5021954829419510.000016uk.ac.leeds
5031954682212470.000024st.po
504195466903590.000069com.mapbox
5051954595823410.000014com.sciencealert
5061954512023870.000013com.instructure
5071954389423430.000014org.theiet
5081954329226200.000012com.ksl
5091954005421680.000015com.webbyawards
5101953788628520.000011com.brandyourself
5111953556427350.000012jp.hatenablog
5121953455227410.000012com.zynga
513195337803820.000066org.acm
5141953232218410.000017com.cmswire
515195319504310.000058io.codepen
5161953103213430.000022org.pocoo
5171953011229310.000011uk.co.autocar
518195299001600.000158com.tripadvisor
519195293722340.000099org.drupal
520195280289910.000029com.gizmodo
5211952514423170.000014org.aei
522195241488640.000034com.matterport
5231952314218710.000017uk.co.thesundaytimes
5241952123010410.000027com.tinypic
525195209448130.000036com.netflix
5261952042024390.000013com.newatlas
5271951876424100.000013com.triplepundit
528195186663810.000066com.booking
5291951832029780.000011fr.hellocoton
5301951736622010.000015org.unfpa
5311951630016030.000019pt.google
5321951400217150.000018net.openid
5331951133230800.000011com.blogsky
5341951124017630.000017com.bloglines
535195080142720.000089com.adnxs
5361950723221060.000015org.royalsociety
5371950658626590.000012com.asiaone
5381950428423080.000014com.waterstones
5391950385823420.000014com.financialexpress
5401950321216390.000019uk.org.nationaltrust
5411950277216460.000019org.pypi
542195012028990.000032com.highcharts
5431950079018890.000016org.panda
5441950070228980.000011org.ifaw
5451950070018280.000017org.thinkprogress
546194997249010.000032com.arstechnica
5471949823622030.000015com.kaggle
5481949765619480.000016org.wri
5491949480426930.000012co.electrek
5501949378623060.000014uk.org.wwf
5511949342624360.000013com.mongabay
5521949328233190.000010com.carscoops
5531949216210820.000027com.mixpanel
5541948655015020.000020io.fabric
5551948625812690.000023com.firebaseapp
556194858309060.000032edu.psu
5571948486818480.000017com.infolinks
5581948405616470.000018com.coschedule
5591948194016720.000018us.pa.state
5601948020022090.000015uk.ac.nhm
5611947965013020.000023com.clicky
562194777265000.000049tv.twitch
563194775445320.000047edu.cornell
564194770848720.000033edu.washington
56519476626710.000478com.livestream
5661947560023070.000014com.autonews
5671947452026600.000012pt.publico
5681947448619290.000016org.americanprogress
5691947419025780.000012com.nordvpn
5701947397222060.000015org.sonatype
5711947193014570.000021com.activecampaign
572194716126250.000041com.samsung
5731947130627300.000012com.delawareonline
5741947086028480.000011com.topgear
575194682409990.000029edu.upenn
5761946549417600.000017uk.gov.metoffice
5771946435227330.000012com.sc
5781946429825730.000013br.inpe
5791946038618730.000017com.prweek
5801946008625890.000012com.ecowatch
58119459484720.000477net.jsfiddle
5821945859032930.000010com.algorithmia
5831945721420270.000016com.scotsman
584194571264290.000058com.slack
5851945537218870.000016com.impactbnd
5861945374810080.000029uk.ac.cam
5871945331622630.000014com.articulate
5881945314027800.000012com.nouw
5891945126628960.000011com.flock
5901944903825710.000013org.globalcitizen
591194470065380.000046com.proofpoint
5921944599823350.000014com.googledrive
5931944426224340.000013nz.co.radionz
5941944422427630.000012jp.riken
5951944369023880.000013de.greenpeace
596194431901190.000244com.youku
597194421181740.000141jp.co.yahoo
5981944159828310.000011com.mumsnet
5991943992418740.000017com.crashlytics
600194391749650.000030edu.umich
6011943902821140.000015uk.org.rspb
602194380282080.000116uk.co.amazon
603194374481010.000321de.google
6041943579027480.000012com.quickanddirtytips
6051943183426680.000012au.com.huffingtonpost
6061943121618960.000016uk.gov.london
6071943069825410.000013com.thejakartapost
6081942948630970.000011com.shanghaidaily
609194288604150.000061com.xinhuanet
6101942861430690.000011com.theminimalists
6111942848612710.000023com.sprinklr
6121942649612080.000025org.iea
6131942646625120.000013ie.thejournal
6141942615217850.000017com.jeffbullas
6151942490229790.000011com.art
6161942464028370.000011it.polito
6171942300818080.000017com.martechtoday
6181942242625990.000012uk.co.profilebusiness
6191942149225340.000013com.db
6201942075628510.000011org.onegreenplanet
6211941839623400.000014net.opendemocracy
6221941695218690.000017org.iucnredlist
6231941390826880.000012uk.org.savethechildren
6241941261423790.000014com.theyworkforyou
625194116666950.000037com.xiti
6261940919826610.000012org.oceanconservancy
6271940871826830.000012com.dreamgrow
6281940797622540.000014com.rabbitmq
6291940737225680.000013com.shoutmeloud
6301940717010280.000028com.mcafeesecure
631194068664490.000055fr.free
632194036403620.000069org.npr
6331940207218650.000017com.copyscape
6341940130827910.000012com.sitesell
635194008803120.000079gov.cdc
6361939982824230.000013com.cleantechnica
6371939968628090.000012pl.edu.uw
638193972743990.000063com.nypost
639193968285690.000044com.aol
6401939644631670.000010com.seeker
6411939639027600.000012uk.org.amnesty
642193962122650.000090com.sohu
6431939596216130.000019com.flashtalking
6441939530825160.000013com.generalmills
6451939347220490.000016com.cityam
6461939247433800.000010com.dremel
647193923703960.000063com.163
6481939176230200.000011com.brothersoft
6491939167020610.000016org.gnupg
65019388022360.001003com.createjs
6511938766010270.000028edu.ucla
652193866305110.000048com.dmca
6531938544214950.000020scot.gov
6541938380623470.000014org.grist
6551938359224740.000013uk.org.oxfam
6561938176624570.000013uk.co.thisismoney
6571938048032590.000010org.aqicn
6581937984825660.000013uk.org.rspca
6591937919011690.000025com.hollywoodreporter
6601937874627260.000012org.irena
6611937782629080.000011org.kuow
6621937586629340.000011eu.i-scoop
6631937528231370.000011com.winefolly
664193742302440.000096com.bandcamp
6651937380613500.000022net.leadpages
6661937129818550.000017net.noscript
6671937072614380.000021com.pastebin
6681937012026920.000012com.targetmarketingmag
6691936852435160.000010co.edureka
6701936837627730.000012com.ipsos-mori
6711936828425460.000013org.zsl
6721936804423930.000013com.moodys
6731936789611700.000025gov.fbi
6741936768621820.000015com.thermofisher
6751936619828000.000012uk.ac.ceh
676193654842730.000089com.surveymonkey
6771936445617030.000018uk.co.which
6781936311814310.000021uk.gov.defra
6791936209226260.000012com.wikidot
6801936186421120.000015com.problogger
6811936143227940.000012com.pnsegypt
6821936048631320.000011com.hatenadiary
683193595721690.000143com.taobao
684193595063330.000074com.pubmatic
685193587703770.000066com.scribd
6861935874829850.000011org.storyofstuff
6871935810631680.000010org.heartland
6881935699829020.000011com.nationalgrid
689193557283520.000070com.wiley
690193550148860.000033com.windowsphone
6911935152825110.000013uk.gov.forestry
6921934981827460.000012org.spie
693193495968160.000036com.mobirise
6941934682229630.000011uk.ac.mdx
695193459364630.000054com.oreilly
6961934522822980.000014com.iconarchive
6971934497432130.000010edu.uah
698193441308930.000032edu.columbia
6991934384621960.000015uk.gov.food
7001934249227700.000012edu.dukeupress
7011934192825180.000013com.wral
7021933730612390.000024google.blog
703193371804530.000055com.sxsw
704193371086860.000038com.steampowered
7051933297228910.000011com.almanac
706193324969150.000031com.docker
707193321384330.000057com.force
708193308909130.000032org.reactjs
7091933043431580.000011com.dbs
7101933001233200.000010uk.org.bornfree
7111932994412830.000023uk.org.greenpeace
7121932832811000.000026com.redhat
7131932800412480.000024com.elpais
714193279247850.000036com.webs
7151932493434010.000010org.sciencenewsforstudents
7161932454834760.000010org.sharktrust
7171932367834470.000010uk.org.caat
718193222183050.000080com.digg
719193203843250.000076com.typeform
7201932019627560.000012com.batchgeo
7211931955821160.000015com.fifa
7221931748023890.000013org.chathamhouse
7231931711613220.000023org.whatbrowser
7241931709420980.000016org.fsc
7251931602417060.000018com.nike
7261931592623570.000014uk.co.inews
7271931582413620.000022edu.ucsd
7281931545834000.000010com.artstation
729193153868550.000034org.unesco
7301931526026540.000012com.ingress
7311931341415610.000019com.technologyreview
7321931275823750.000014io.pantheon
7331931184629520.000011com.climatechangenews
7341931108229810.000011org.c2es
7351930971417710.000017com.ikea
7361930950630100.000011com.foodsafetynews
7371930659825740.000012uk.org.38degrees
7381930574426760.000012com.thecvf
7391930547825880.000012org.carbonbrief
7401930545829900.000011org.sourcewatch
741193049685710.000043com.cbsnews
7421930459429860.000011com.moneysupermarket
743193041684690.000053com.statista
7441930409434140.000010me.start
7451930150828440.000011com.tiddlywiki
7461929969226450.000012com.bnef
7471929862030950.000011uk.co.bristolpost
748192974461980.000122io.polyfill
7491929700230590.000011jp.ac.kobe-u
750192968021220.000238org.networkadvertising
751192963185020.000049com.atlassian
752192940763380.000073com.prnewswire
7531929152211280.000026com.canva
7541928897830120.000011org.twinery
7551928882827370.000012com.adcolony
7561928845831170.000011no.forskning
7571928624627850.000012com.doctoroz
7581928485035560.000010com.cmgdigital
7591928467831430.000011com.sunherald
7601928406231720.000010com.ibmbigdatahub
7611928399235170.000010com.2createawebsite
7621928371629960.000011net.organicfacts
7631928285822430.000014com.privacypolicies
7641928212229050.000011com.winemag
7651928174610560.000027com.ubuntu
7661928151214190.000021uk.co.thesun
767192810864700.000053com.inc
7681928101021430.000015org.cites
7691928099022900.000014uk.gov.dft
7701927928031460.000011com.insideevs
7711927917427340.000012de.ksta
7721927842226840.000012com.e-activist
7731927837614120.000021com.speakerdeck
7741927689427470.000012com.chubb
7751927391626080.000012org.rspo
776192738949640.000030net.2mdn
7771927314232650.000010com.jordantimes
778192720343190.000078gov.ca
7791926891035060.000010com.idt
7801926842627570.000012com.theinnovationenterprise
7811926754223490.000014uk.gov.environment-agency
7821926747834960.000010com.sutori
783192664061510.000163ru.mail
784192662241640.000152com.yelp
7851926551031840.000010com.galvanize
7861926480034250.000010com.thewritepractice
7871926477832120.000010org.carbontracker
7881926457034640.000010org.earthworksaction
7891926354817130.000018com.martechseries
790192626389810.000029com.visualstudio
7911926216833830.000010com.nutraingredients
7921926169432220.000010com.quandl
7931926145214840.000020uk.co.foe
794192609242320.000100to.amzn
7951926017417310.000018org.khanacademy
7961926013026990.000012com.businessgreen
797192599205240.000047com.airbnb
7981925963432000.000010com.thedrinksbusiness
7991925870433840.000010com.monbiot
8001925848826850.000012au.com.mumbrella
8011925710230720.000011fr.thelocal
8021925672833300.000010org.cnduk
803192566286600.000039org.eff
8041925647614410.000021com.tutsplus
8051925592230900.000011ai.fast
8061925542227230.000012com.goinswriter
8071925517033580.000010org.thechicagocouncil
8081925393630290.000011jp.hatenadiary
8091925273027430.000012gov.ferc
8101925263413840.000022com.uber
8111925209434440.000010com.visitdublin
8121925095425820.000012nz.govt.mfat
8131924984422230.000015uk.gov.charitycommission
8141924940611920.000025edu.utexas
8151924911232730.000010com.chemistryworld
8161924899833000.000010org.alaskapublic
8171924898414180.000021fr.lemonde
8181924881231440.000011com.tuck
8191924722631560.000011com.marksdailyapple
8201924628410050.000029com.americanexpress
821192462045790.000043com.patreon
8221924506228140.000012com.ing
823192450321660.000147jp.co.google
8241924424419320.000016uk.gov.education
8251924289627530.000012com.webestools
8261924250225040.000013com.instructables
8271924246011850.000025edu.princeton
8281924055236450.000010com.theppk
8291924053633050.000010com.machinelearningmastery
8301923886417160.000018se.haxx
8311923871211490.000026com.digiday
832192384628960.000032com.zoho
8331923826846690.000009com.9to5mac
8341923760237610.000010org.muslimaid
835192358365410.000046com.alibaba
8361923573628170.000012uk.ac.rcplondon
837192338825560.000045gov.sec
8381923288030430.000011com.platts
8391923268826510.000012com.recyclenow
8401923261834410.000010org.thebestschools
8411923199433520.000010com.beruby
842192318262020.000119com.constantcontact
8431923100223540.000014net.privacypolicytemplate
8441923014232070.000010com.gpsvisualizer
8451922777431040.000011com.rabobank
8461922721633060.000010com.seat61
8471922719834120.000010uk.co.lep
848192261223110.000079com.marriott
849192246662390.000098cn.com.sina
850192242827530.000036com.css-tricks
851192235322460.000095jp.co.amazon
8521922284612990.000023gd.is
8531922182823500.000014uk.co.vogue
8541922142413810.000022com.dell
855192211187220.000037fm.last
8561922110420090.000016io.getmdl
8571922043037560.000010uk.org.stopwar
8581922019626270.000012org.ramsar
8591921798819870.000016com.instapage
860192174345950.000042com.psychologytoday
8611921720235920.000010com.fox13memphis
8621921639631340.000011uk.org.sja
8631921634235380.000010com.breakingenergy
8641921607034360.000010com.star2
8651921578431030.000011org.scielo
86619215692970.000332com.sharethis
867192156868810.000033com.aliexpress
8681921532036130.000010it.diggita
869192148362100.000116jp.ne.hatena
8701921461411250.000026com.firefox
871192144926340.000040gov.nist
8721921294032520.000010org.beatthemicrobead
8731921237236030.000010nl.zoom
8741921231012320.000024com.convertkit
875192078205450.000046uk.co.eventbrite
8761920733431450.000011com.abnamro
8771920638429040.000011org.wildlifetrusts
8781920608816370.000019org.whales
8791920575010680.000027com.shutterstock
8801920467639810.000009com.visitguatemala
8811920384631280.000011uk.org.scope
8821920328810300.000028com.foxnews
8831920314826750.000012org.soilassociation
8841920284210190.000028com.cbslocal
8851920084837020.000010no.haugenbok
8861919958629140.000011com.ironsrc
887191994269520.000030com.variety
8881919934426220.000012com.feedreader
889191988765170.000048com.ea
8901919832235950.000010uk.co.theboltonnews
8911919805211900.000025com.globo
8921919681828630.000011com.itsma
8931919609815640.000019org.freecsstemplates
8941919597820780.000016com.hulu
8951919567231480.000011com.rebekahradice
896191952622890.000084com.discordapp
8971919476235430.000010info.e-ir
8981919454633640.000010org.swi-prolog
8991919228831820.000010com.wpxi
900191916984860.000051com.nasdaq
9011919086431700.000010uk.co.dennis
9021919066835510.000010com.alaskadispatch
9031919046011500.000026com.java
904191902762300.000100com.googletagservices
9051918967635040.000010es.ree
9061918956231620.000010com.sgx
9071918862637210.000010br.org.imazon
9081918833437760.000010com.citymayors
9091918818235820.000010au.com.hotfrog
9101918706632720.000010uk.org.cat
9111918628032790.000010aq.ats
912191861886780.000038com.newyorker
9131918469040020.000009net.politicalscrapbook
9141918434436060.000010com.southernfriedscience
9151918348431250.000011app.web
916191825782200.000106com.naver
9171918177017360.000018com.techrepublic
9181918097836320.000010com.theoildrum
9191918072837280.000010org.worldnuclearreport
920191806741560.000162gov.privacyshield
9211917911828710.000011uk.co.realbusiness
9221917725414630.000021edu.uchicago
9231917672614530.000021tv.ustream
9241917516418070.000017com.nba
9251917335232710.000010uk.org.cpre
9261917318817880.000017org.golang
9271917240229550.000011com.writetothem
9281917236820410.000016com.howstuffworks
9291917089614070.000021uk.co.theregister
930191706684640.000054com.adweek
931191706302430.000096com.stumbleupon
9321917048015790.000019edu.unc
9331916944422110.000015edu.virginia
9341916886036190.000010com.renewablesnow
9351916851813900.000022com.over-blog
9361916780014430.000021com.digitaltrends
9371916778240730.000009uk.co.moblog
9381916514014060.000021us.imageshack
9391916472436040.000010com.at0086
9401916449221440.000015org.coursera
9411916442837990.000010com.avivaromm
942191625829840.000029com.thinkwithgoogle
9431916245036290.000010com.eremnews
944191616604660.000053com.snapchat
9451915991814420.000021com.billboard
9461915990433940.000010uk.gov.peterborough
9471915950635300.000010org.campaigncc
948191586006420.000039org.pbs
9491915759032990.000010uk.co.siemens
9501915757434700.000010org.ilga-europe
951191562589780.000029com.dropboxusercontent
952191543808940.000032com.uservoice
9531915425215780.000019com.ssllabs
9541915399233670.000010com.trafficgenerationcafe
9551915225616140.000019com.warnerbros
956191520429220.000031com.libsyn
9571915185236650.000010uk.org.biofuelwatch
9581915171836170.000010uk.org.garyhall
9591915154823990.000013com.ehow
9601915082037710.000010no.universitetsforlaget
9611914843035590.000010br.org.idec
962191482908390.000035com.qz
9631914816429110.000011net.nend
964191474226900.000038com.webmd
9651914723816940.000018com.codeplex
9661914485213740.000022com.fiverr
9671914458439220.000009net.kjokkenutstyr
968191445724970.000049edu.cmu
9691914415840060.000009org.freedom-now
9701914393211670.000025com.smashingmagazine
9711914360431470.000011uk.org.refill
9721914337219710.000016com.invisionapp
9731914225628640.000011com.dzone
9741914215834900.000010io.dataquest
9751914153839840.000009org.alqaws
9761914123032420.000010io.dropwizard
9771914066238210.000010com.superiorthreads
9781914042033980.000010uk.co.firstnews
979191389943550.000070org.debian
9801913815421340.000015com.w3layouts
981191343888770.000033com.foursquare
9821913404030350.000011com.vungle
9831913371632050.000010org.corporateeurope
984191336448660.000034gov.census
9851913349639950.000009com.tinnedtomatoes
9861913338214000.000021com.blackberry
9871913333613350.000022jp.livedoor
9881913247633340.000010com.drillordrop
9891913183633290.000010com.ovoenergy
9901913117238040.000010com.descarteslabs
9911913077810640.000027com.politico
9921912888837880.000010org.ianfairlie
9931912872818660.000017com.nokia
9941912778626690.000012in.bbc
9951912751229870.000011org.vegsoc
9961912710833870.000010com.figure-eight
997191258183570.000070gov.ftc
998191255962420.000097org.icann
9991912535814010.000021com.xkcd
10001912521636090.000010br.com.ambev

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

October 2019 crawl archive now available

The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-43/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-43/segment.paths.gz100
WARC filesCC-MAIN-2019-43/warc.paths.gz5600059.56
WAT filesCC-MAIN-2019-43/wat.paths.gz5600021.7
WET filesCC-MAIN-2019-43/wet.paths.gz560009.94
Robots.txt filesCC-MAIN-2019-43/robotstxt.paths.gz560000.15
Non-200 responses filesCC-MAIN-2019-43/non200responses.paths.gz560001.69
URL index filesCC-MAIN-2019-43/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-43/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

September 2019 crawl archive now available

The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th. It includes page captures of 1.0 billion URLs not contained in any crawl archive before. The other 1.5 billion pages have been already captured in prior crawls and are now revisited.

Archive Location and Download

The September crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-39/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-39/segment.paths.gz100
WARC filesCC-MAIN-2019-39/warc.paths.gz5600055.99
WAT filesCC-MAIN-2019-39/wat.paths.gz5600018.01
WET filesCC-MAIN-2019-39/wet.paths.gz560008.16
Robots.txt filesCC-MAIN-2019-39/robotstxt.paths.gz560000.14
Non-200 responses filesCC-MAIN-2019-39/non200responses.paths.gz560001.63
URL index filesCC-MAIN-2019-39/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-39/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August 2019 crawl archive now available

The crawl archive for August 2019 is now available! It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th.

The August crawl contains page captures of 1.1 billion URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the May/Jun/Jul 2019 webgraph data set from the following sources:

  • a random sample of 2.1 billion outlinks extracted from July crawl WAT files
  • 1.8 billion URLs mined in a breadth-first side crawl within a maximum of 6 links (“hops”), started from
    • the homepages of the top 60 million hosts and domains and randomly selected samples of
    • 2 million human-readable sitemap pages (HTML format)
    • 3 million URLs of pages written in 130 less-represented languages (cf. language distributions)
  • 1 billion URLs extracted and sampled from 20 million sitemaps, RSS and Atom feeds

Starting with this crawl the following fixes and improvements are applied to the provided data formats:

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-35/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-35/segment.paths.gz100
WARC filesCC-MAIN-2019-35/warc.paths.gz5600053.53
WAT filesCC-MAIN-2019-35/wat.paths.gz5600020.85
WET filesCC-MAIN-2019-35/wet.paths.gz560009.29
Robots.txt filesCC-MAIN-2019-35/robotstxt.paths.gz560000.18
Non-200 responses filesCC-MAIN-2019-35/non200responses.paths.gz560001.79
URL index filesCC-MAIN-2019-35/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-35/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs May/June/July 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark on GitHub which host all scripts and tools required to construct the graphs.

What’s new?

Links from Content-Location and Link HTTP headers are now also used to span up the web graphs. This is in accordance with RFC 5988 which defines the Link HTTP header as semantically equivalent to the element in HTML. It also fits previous web graph releases which used to include all kinds of links including technical ones and redirects.

Host-level graph

The graph consists of 445 million nodes and 3.14 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 382 million dangling nodes (86%) and the largest strongly connected component contains 48 million (11%) nodes.

You can download the graph and the ranks of all 445 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl May/June/July 2019 host-level webgraph

SizeFileDescription
3.02 GBcc-main-2019-may-jun-jul-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
14.75 GBcc-main-2019-may-jun-jul-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 56 edges files
6.42 GBcc-main-2019-may-jun-jul-host.graphgraph in BVGraph format
2 kBcc-main-2019-may-jun-jul-host.properties
7.01 GBcc-main-2019-may-jun-jul-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-may-jun-jul-host-t.properties
1 kBcc-main-2019-may-jun-jul-host.statsWebGraph statistics
7.22 GBcc-main-2019-may-jun-jul-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 88 million nodes and 1.9 billion edges. 52% or 46 million nodes are dangling nodes, the largest strongly connected component covers 35 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-may-jun-jul/domain/.

Download files of the Common Crawl May/June/July 2019 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2019-may-jun-jul-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.50 GBcc-main-2019-may-jun-jul-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.06 GBcc-main-2019-may-jun-jul-domain.graphgraph in BVGraph format
2 kBcc-main-2019-may-jun-jul-domain.properties
3.99 GBcc-main-2019-may-jun-jul-domain-t.graphtranspose of the graph
2 kBcc-main-2019-may-jun-jul-domain-t.properties
1 kBcc-main-2019-may-jun-jul-domain.statsWebGraph statistics
1.91 GBcc-main-2019-may-jun-jul-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (May/June/July 2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12997766810.020841com.googleapis
22786770430.011812com.facebook
32741998020.012857com.google
42519603040.007273com.twitter
52455883650.006439org.w
62453370260.005984com.youtube
72259209890.003799com.instagram
82206065070.004857org.gmpg
921829028130.002863com.linkedin
102159544680.004481com.googletagmanager
1120930920220.001704com.gravatar
1220912076240.001531com.pinterest
1320730700110.003384com.cloudflare
1420698732170.002180com.wordpress
1520613210120.003087org.wordpress
1620607942260.001241org.wikipedia
1720408594140.002452com.bootstrapcdn
1820351540200.001823com.apple
1920148418410.000904com.blogspot
2020103846300.001124com.vimeo
2120036764210.001719com.jquery
2219874716500.000673com.wp
2319870332290.001130com.microsoft
2419839912430.000816gl.goo
2519828406450.000769com.amazon
2619793040180.002021com.gstatic
2719790998190.002015com.adobe
2819788744570.000573com.tumblr
2919754126310.001104com.amazonaws
3019619798250.001407com.macromedia
3119616602340.001057com.googlesyndication
3219585788470.000744be.youtu
3319585670390.000937com.google-analytics
3419583342620.000531ly.bit
3519572994680.000440com.yahoo
3619549710330.001080com.flickr
3719526876350.001023net.cloudfront
3819526762230.001676com.github
3919503814600.000553me.wp
4019467672270.001170ru.yandex
4119467424580.000568org.mozilla
42194548901060.000305com.googleusercontent
4319411724490.000725net.doubleclick
4419374766520.000658co.t
4519366860440.000776com.baidu
4619322188700.000401com.weebly
47193217541050.000310com.reddit
48193170941230.000234com.nytimes
4919313908460.000749com.paypal
50193080941040.000312com.soundcloud
5119278436670.000448com.medium
5219268558660.000451io.github
5319266970630.000517org.w3
5419255616800.000379org.creativecommons
55192280341840.000143uk.co.bbc
56192194701750.000151com.imgur
57191914841370.000194com.forbes
58191843201680.000154net.slideshare
5919169524560.000588org.schema
60191662141620.000162com.bing
61191633881800.000144net.sourceforge
62191558821820.000143org.wikimedia
6319145618480.000738com.googleadservices
64191438402150.000109com.businessinsider
65191360402330.000104com.techcrunch
66191251982730.000089com.reuters
67191127301520.000169com.theguardian
68190917081770.000147com.imdb
6919081148640.000496net.jsdelivr
70190766421450.000177org.apache
71190689122020.000120org.gnu
72190677202500.000097com.ibm
73190659042740.000089com.cnet
74190604021940.000124com.washingtonpost
75190561621640.000159com.blogger
76190496223360.000073gov.nasa
77190434062710.000090com.android
7819038878320.001080com.fontawesome
79190308241960.000123com.huffingtonpost
80190227642430.000100com.oracle
8119022114990.000323com.shopify
82190100921780.000147com.stackoverflow
83190088662640.000092com.bbc
84189915041380.000194com.wixsite
85189797941930.000128org.ampproject
86189796063310.000074com.latimes
87189669243340.000073com.livejournal
88189543521480.000171com.eventbrite
89189529144060.000061com.zdnet
9018951470380.000950com.addthis
91189411682600.000093com.usatoday
92189303062610.000093com.wired
93189299484730.000052com.economist
94189248941220.000237com.ytimg
95189158202950.000083com.prnewswire
96189077841070.000304com.whatsapp
97189055622410.000101com.appspot
98189037502890.000086org.npr
99188998266050.000046com.thenextweb
100188987321390.000192com.issuu
101188971301980.000122org.ietf
102188931881810.000143jp.co.yahoo
103188890961420.000183com.spotify
104188887604490.000055com.venturebeat
10518888186550.000590eu.europa
106188862403820.000064com.goodreads
10718880882370.000994com.qq
108188808806010.000046org.ieee
109188769882090.000114com.bandcamp
110188744483590.000068com.quora
111188726664260.000058com.cisco
112188696402110.000112net.behance
113188665604740.000052org.arxiv
114188520803940.000062com.buzzfeed
11518844806950.000330com.sharethis
116188345024270.000058com.deviantart
117188341668990.000031com.ibtimes
118188297621850.000141com.giphy
11918828960960.000328com.statcounter
120188250746490.000043com.stackexchange
121188236241700.000152uk.co.google
122188188482830.000087com.cnbc
123188173848250.000034org.eclipse
124188145663330.000074com.aol
125188143924850.000051com.pixabay
126188069442060.000117com.disqus
127188009124580.000054com.about
12818793968420.000849com.squarespace
129187935725220.000048com.mysql
130187927401440.000180com.yelp
131187907943550.000068com.theatlantic
132187874244170.000059me.about
133187870063170.000077com.skype
134187826364760.000052com.visualstudio
135187805382320.000104me.t
136187726669480.000030com.nvidia
137187725604680.000053com.wikihow
138187683582760.000089com.sciencedirect
139187678222240.000106com.dribbble
140187622663240.000075com.scribd
141187592367120.000039google.blog
142187568861830.000143com.salesforce
143187562365510.000048com.slate
144187539681310.000208com.dropbox
145187516964070.000061uk.co.independent
146187512422990.000081com.fastcompany
147187465902570.000094com.googlecode
148187461422130.000111com.hubspot
149187444704400.000057com.newyorker
150187444524300.000058com.box
151187433321200.000249org.networkadvertising
152187369566670.000042org.chromium
153187359184630.000053gov.loc
154187341902970.000082com.example
155187337622000.000121com.cnn
156187314286710.000041com.tinypic
157187281602690.000090com.fc2
158187261047900.000035com.nymag
159187231847070.000039com.smashingmagazine
160187197046160.000045com.evernote
161187185762720.000090com.nbcnews
162187163965480.000048net.azurewebsites
163187106062190.000108com.npmjs
164187097701550.000167org.archive
165187087683060.000079com.w3schools
1661870509010240.000028ca.utoronto
167187038801910.000130jp.ne.hatena
168186999744770.000052io.codepen
16918699212610.000544com.vk
170186990169690.000029com.ign
171186946927030.000039com.speakerdeck
172186942568530.000033com.mediafire
173186916285060.000049com.foursquare
174186861028940.000031com.nike
175186841366080.000046com.trello
176186793021190.000251info.aboutads
177186761683760.000066com.mozilla
17818670790530.000604com.wix
179186697806390.000044uk.ac.ox
180186642641460.000174com.amazon-adsystem
181186611621030.000317com.paypalobjects
18218658320840.000366com.bizjournals
183186534383420.000072com.getpocket
184186390783160.000077ca.google
185186363244980.000050com.indiatimes
186186283165960.000047com.pinimg
187186261626240.000045com.cbslocal
188186242783110.000078edu.mit
189186238789420.000030com.chron
190186224201140.000272net.windows
1911861878611580.000025org.tensorflow
192186183267260.000038ca.blogspot
193186176028420.000033com.sap
194186156788410.000033com.css-tricks
195186121443600.000068com.entrepreneur
196186060506230.000045com.libsyn
197186033401340.000205com.unpkg
198186023021170.000253com.stripe
199186003523080.000079edu.harvard
200185974642260.000106com.wsj
2011859521410700.000026com.hackernoon
202185941748360.000033com.thehill
20318592786590.000557com.fb
204185905106250.000045ca.cbc
205185901729120.000031org.unicode
206185866107920.000035com.buffer
207185858803690.000067com.elsevier
208185811267940.000035com.theglobeandmail
20918580570150.002238com.wixstatic
210185793263630.000068me.telegram
211185788586620.000042com.searchengineland
212185764021790.000147org.bbb
213185749006560.000043site.business
214185744764810.000051com.withgoogle
215185743462520.000097es.google
216185726168740.000032org.kernel
217185724986440.000044com.flipboard
218185711067250.000038co.ibb
219185650146580.000042com.huffpost
2201856358810050.000028edu.rutgers
221185627888480.000033uk.co.wired
222185607447590.000036com.ssrn
223185606061130.000272com.weibo
2241855764610390.000027com.aljazeera
225185558607360.000037gov.archives
226185543383460.000071com.mapbox
227185540086370.000044org.d3js
228185532781510.000170com.yimg
2291855109810930.000026org.hrw
230185491046030.000046gg.discord
2311854683414680.000020com.hm
2321854670811460.000025ly.visual
233185459689850.000029com.geekwire
234185454642010.000120com.optimizely
2351854475412510.000023ca.huffingtonpost
236185441142120.000111edu.stanford
237185428609430.000030uk.co.huffingtonpost
238185423346180.000045co.elastic
2391853924218770.000017com.pearltrees
2401853588018290.000017cn.people
2411853069014020.000021com.diigo
242185288042960.000082com.tinyurl
243185280344550.000054com.mapquest
244185255509790.000029org.slashdot
2451852427611060.000025edu.osu
24618523478650.000473net.akamaihd
247185223168820.000032com.theconversation
248185174362780.000089org.purl
249185173623750.000066com.mashable
2501851427210970.000026com.dw
251185137709340.000030com.bt
252185115128440.000033com.today
253185114908770.000032com.marketwired
2541851032412670.000023jp.co.ntv
2551850984210990.000026com.mentalfloss
256185095889860.000029com.computerworld
2571850527613760.000021jp.ac.u-tokyo
258185052248400.000033co.g
259185030728470.000033com.healthline
260185027047820.000035com.ecwid
2611850161014280.000021com.sas
262185015264650.000053com.yoast
2631849900611010.000025edu.gatech
264184900784540.000054com.moz
2651848797616360.000019com.kaggle
2661848777213710.000021com.makeuseof
267184874005010.000050me.m
268184871642800.000088com.bloomberg
269184871547710.000036com.econsultancy
270184868388800.000032uk.parliament
271184837708200.000034com.newsweek
2721848373013000.000022com.googlesource
2731848064014260.000021blog.home
274184802547610.000036com.outbrain
2751847830610770.000026com.sfchronicle
276184755702040.000119org.iana
277184736423130.000077com.scorecardresearch
278184714661540.000169gov.nih
2791846176212790.000022com.avg
280184598224600.000054com.theverge
281184577326360.000044jp.shinobi
2821845503610090.000028org.postgresql
2831845205614150.000021com.dailydot
284184495729400.000030com.foxbusiness
285184483308140.000034com.adjust
286184470547640.000036edu.brookings
287184460747670.000036com.business2community
2881843792818610.000017com.uniqlo
2891843586614460.000020com.dezeen
290184333103470.000071com.trustpilot
291184327228050.000035com.contentmarketinginstitute
2921843196810360.000027com.trendmicro
293184312229090.000031org.aarp
294184309228500.000033com.searchenginewatch
295184280423090.000078org.python
296184279841600.000163com.twimg
297184274545080.000049edu.berkeley
298184252007680.000036uk.co.pinterest
299184233964460.000055com.bigcommerce
3001841826613580.000021edu.iastate
3011841717213070.000022com.motherjones
302184169307150.000039com.techtarget
303184140682810.000087com.myspace
3041841352211540.000025com.hostgator
3051841131410460.000027com.medicalnewstoday
3061841069610250.000028com.bustle
30718410084690.000420com.list-manage
308184095623230.000076uk.co.telegraph
309184093623300.000074com.meetup
3101840869011680.000024org.openoffice
3111840601212960.000022com.contently
312184035327200.000038com.cdbaby
313184020045140.000049com.adage
3141840140413370.000022org.wnyc
315184002867140.000039com.neilpatel
3161839861015270.000020com.mathworks
317183975184780.000052net.researchgate
318183945989380.000030co.apple
319183945863290.000074com.go
32018393232940.000347com.godaddy
321183920264110.000060com.msn
322183916744040.000061com.ted
3231839014814840.000020io.material
324183899948170.000034com.arstechnica
325183895088600.000033com.wikia
326183881889700.000029com.vogue
327183849123770.000066me.wa
3281838202014750.000020se.blogspot
329183806707890.000035edu.washington
330183803621650.000158com.opera
331183771062480.000098com.rawgit
332183769388190.000034com.bandsintown
3331837455810660.000026com.convinceandconvert
3341837415412380.000023com.convertkit
3351837385018710.000017io.soup
3361837031014380.000020com.secondlife
3371836615217210.000018com.zara
338183644382870.000086com.live
339183626282380.000102com.surveymonkey
340183586541880.000132com.etsy
341183563881690.000153com.feedburner
3421835609419440.000016edu.uark
3431835504819110.000017com.mysanantonio
344183547262660.000092uk.org.ico
345183526304290.000058org.hbr
346183524066020.000046com.livechatinc
3471835205814930.000020com.thenation
348183515867500.000037com.yellowpages
349183499221120.000281com.mailchimp
350183491268150.000034com.wordstream
3511834906215060.000020com.toptal
3521834703612160.000024io.itch
353183426264940.000050com.kickstarter
354183415722350.000104com.typepad
355183406084200.000059com.googleblog
356183380463660.000068com.aliyuncs
3571833776016700.000018com.manta
3581833760014630.000020com.amcharts
3591833665214190.000021com.indiewire
360183353784560.000054com.fortune
36118333310510.000663net.fbcdn
362183332862310.000105uk.co.amazon
3631833109215870.000019ly.adobe
364183297609220.000030com.searchenginejournal
3651832860215690.000019ms.nyti
366183257443740.000066com.ft
3671832351620040.000016com.zoominfo
3681832344211790.000024com.grammarly
3691832178016200.000019li.paper
3701832175012100.000024com.csmonitor
3711832151221480.000015com.brandyourself
3721832071620760.000015me.websta
373183108423400.000072com.getclicky
374183104009110.000031uk.gov.nationalarchives
375183068468630.000033com.engadget
376183040661590.000165com.zendesk
377183009809620.000029com.cio
37818300896870.000360de.google
3791830085216250.000019id.co.blogspot
3801830075217350.000018org.unfpa
381182994786860.000040com.intel
382182977164310.000058com.nationalgeographic
3831829768419420.000016com.cinemablend
3841829549219390.000016com.wral
385182953726630.000042com.vice
386182946684430.000056com.oreilly
3871829455410200.000028com.weddingwire
388182930344610.000053com.nature
3891829299814400.000020com.harpercollins
390182915702900.000085gov.cdc
391182906583640.000068com.githubusercontent
392182903685200.000048com.photobucket
393182903649260.000030com.socialmediaexaminer
394182900209980.000028com.firebaseapp
395182891508750.000032com.angieslist
396182888429010.000031com.sendpulse
397182886288220.000034edu.columbia
398182877508230.000034com.pexels
3991828660015410.000019com.mindbodygreen
4001827945215160.000020com.mailjet
401182783561490.000171com.tripadvisor
402182781683190.000077com.wiley
4031827693018500.000017com.merchantcircle
404182764542680.000090com.digg
4051827608818900.000017fr.huffingtonpost
4061827574616950.000018com.thoughtworks
4071827376010140.000028org.ocks
4081827322020620.000015jp.pinterest
409182727684840.000051com.cbsnews
410182718783520.000069int.who
411182705288160.000034com.format
412182701082550.000096net.php
4131826992414640.000020com.thecut
4141826865820550.000015org.spie
415182645542140.000110org.aboutcookies
4161826330012310.000023com.mynewsdesk
417182617324090.000060com.office
4181826162410710.000026com.fastcodesign
4191826085614520.000020fr.liberation
420182607743350.000073com.time
421182603664440.000056org.freecodecamp
4221826002016060.000019com.dummies
4231825940017780.000018com.instapaper
424182589307550.000036com.mediapost
425182558426300.000044com.proofpoint
4261825411818780.000017it.binged
4271825408613210.000022ly.snip
428182528584160.000059uk.co.dailymail
429182492606040.000046org.nodejs
430182485903920.000062fr.free
431182484924640.000053com.statista
432182473568790.000032com.gizmodo
433182466463150.000077com.st-hatena
4341824538816600.000018com.superpages
4351824407811200.000025com.theknot
436182436783570.000068com.unsplash
4371824149413970.000021com.jeffbullas
4381823620815220.000020com.biography
4391823594621460.000015de.huffingtonpost
4401823482014320.000021com.csoonline
4411823472614860.000020com.louisvuitton
442182335121210.000246com.jimdo
4431823292010400.000027uk.ac.cam
4441823234813380.000022google.ai
4451823158621900.000014com.mango
4461823090212270.000023com.activecampaign
447182263369640.000029com.netlify
448182261729530.000030com.eater
4491822398410040.000028com.smallbiztrends
4501822356421050.000015site.negocio
451182231002770.000089com.ebay
4521822177813010.000022ca.yellowpages
453182204226890.000040com.windowsphone
454182203667750.000035com.marketwatch
4551821971411470.000025com.redhat
4561821797221700.000015edu.scad
4571821766012900.000022com.digitaltrends
4581821731811230.000025org.mathjax
4591821667016580.000019com.politifact
4601821554622250.000014com.dexknows
461182147904900.000050gov.whitehouse
4621821004412250.000023com.quicksprout
463182074941760.000150com.slack
4641820520816550.000019uk.co.bbci
4651820319413360.000022com.cmswire
46618202308790.000382net.jsfiddle
4671819967416830.000018com.nyt
4681819849019280.000016com.itsnicethat
469181974928350.000034edu.psu
470181968563540.000068com.booking
471181967966880.000040com.webs
472181958529600.000030edu.ucla
473181913647010.000039gov.nist
474181911389450.000030com.sprinklr
475181911023070.000079gov.ca
47618188332760.000389com.livestream
4771818690813750.000021net.openid
4781818675011310.000025gov.fbi
479181858344750.000052tv.twitch
4801818349819820.000016google.design
4811817695017900.000017com.psmag
482181757887740.000036com.oath
4831817381614980.000020org.gnupg
484181721443510.000069com.hp
485181716302910.000085org.acm
4861816729624880.000013org.travelblog
4871816703222430.000014com.ingress
4881816557812640.000023com.coschedule
4891816476617460.000018com.financialexpress
4901816464818680.000017com.allafrica
4911816436011100.000025edu.princeton
4921816367223050.000014com.tommy
4931816337616290.000019org.whatbrowser
4941816234412990.000022com.kinsta
4951816131224410.000014com.algorithmia
496181611405300.000048net.brightcove
4971815843220240.000016jp.riken
498181576668310.000034com.msdn
499181576405110.000049edu.cornell
5001815656822790.000014com.theminimalists
501181539402360.000103to.amzn
5021815384014600.000020net.noscript
503181470563050.000079com.typeform
5041814683817270.000018com.iconarchive
505181453609280.000030org.weforum
506181446228380.000033com.git-scm
5071814354022010.000014net.organicfacts
5081814024217240.000018com.gap
509181389007090.000039org.bitbucket
510181366304030.000061com.dailymotion
511181346483930.000062com.nypost
5121813442422440.000014com.bonfire
5131813383220950.000015it.polito
514181325729030.000031com.sfgate
515181305442390.000101com.stumbleupon
5161813038622730.000014net.brownbook
5171812959420730.000015com.zynga
518181272965230.000048edu.yale
5191812613416560.000019com.wayfair
520181259982540.000096org.drupal
521181259263810.000065org.un
5221812381223000.000014com.23hq
523181228366140.000045gov.sec
524181201564150.000059com.gmail
5251811968411960.000024com.playstation
5261811823417170.000018org.polymer-project
5271811332817720.000018za.co.iol
5281811254819940.000016au.com.huffingtonpost
5291811092223440.000014com.marksdailyapple
5301811089415070.000020com.impactbnd
531181097406480.000043com.jwplatform
5321810923615130.000020com.instapage
5331810729213740.000021com.ning
5341810699020350.000015com.dreamgrow
535181061222920.000085cn.com.sina
5361810501021880.000014net.openreview
5371810480616040.000019com.aolcdn
538181040042100.000113com.constantcontact
5391810387418890.000017uk.ac.jisc
5401810365619930.000016com.towardsdatascience
5411810183016690.000018com.thermofisher
5421810027019240.000016com.city-data
543180999009410.000030uk.co.guardian
5441809982621360.000015com.whitepages
5451809950018910.000017com.deepmind
546180984966110.000046com.mobirise
547180974403560.000068com.springer
5481809627819290.000016org.elasticsearch
549180943907430.000037com.steampowered
5501809204810910.000026com.auth0
551180920081920.000128com.eepurl
5521809169412140.000024kr.or.kisa
553180908008320.000034gov.senate
55418090404710.000398me.fb
5551809018839300.000010com.artstation
556180901106840.000040org.eff
5571808850620750.000015com.quickanddirtytips
5581808822018560.000017com.googledrive
5591808789022670.000014lb.com.dailystar
5601808737010830.000026de.spiegel
5611808718421890.000014com.oilprice
5621808659813770.000021io.bower
5631808658619970.000016com.batchgeo
5641808636010600.000027com.clicky
5651808599011720.000024com.merriam-webster
5661808474619270.000016com.nytco
567180842722840.000087com.histats
5681808385616130.000019org.jenkins-ci
5691808358019660.000016com.underconsideration
5701808309022210.000014com.swatch
571180818686170.000045uk.co.blogspot
572180789363430.000071com.sxsw
573180785746190.000045com.patreon
5741807738814710.000020io.getmdl
5751807650610810.000026com.hollywoodreporter
576180753946100.000046com.163
577180753481560.000166ru.mail
5781807494018450.000017com.rabbitmq
5791807463617830.000017com.lexology
5801807455016650.000018com.invisionapp
5811807427219870.000016com.lightreading
5821807390613510.000021edu.northwestern
583180735569960.000028com.ubuntu
5841807345421110.000015edu.dukeupress
5851807185221730.000015org.onegreenplanet
5861807148021410.000015com.hotfrog
5871807059223610.000014edu.uah
5881806887213690.000021org.khanacademy
5891806843814610.000020uk.co.thesun
5901806690434870.000012com.wikidot
5911806637016140.000019com.digitaloceanspaces
5921806560224320.000014net.sott
5931806550015470.000019com.technologyreview
594180652823490.000070com.staticflickr
59518063590780.000383org.reactjs
596180619546600.000042com.xinhuanet
5971806177625550.000013com.idt
598180616782470.000098de.amazon
599180612687390.000037com.qz
6001805793619310.000016com.googleapps
6011805788417530.000018io.pantheon
6021805776821320.000015net.eenews
603180571167790.000035com.deloitte
6041805703816510.000019com.checkatrade
605180546226570.000043com.psychologytoday
606180545969000.000031gov.nps
6071805107819730.000016com.shoutmeloud
6081804917427610.000013ca.411
6091804862014960.000020com.citysearch
6101804818417480.000018com.tutsplus
6111804499021260.000015io.flutter
6121804403620640.000015com.vanguardngr
6131804324214730.000020edu.unc
6141804317419340.000016com.gimletmedia
6151804262616270.000019com.fifa
6161804143624630.000013org.simile-widgets
617180412309320.000030edu.upenn
6181804093020980.000015com.designobserver
619180408346610.000042org.pbs
6201804055224100.000014com.ubu
6211804042211180.000025net.recode
6221803956612080.000024jobs.amazon
623180384463450.000071com.tripod
6241803656213150.000022edu.purdue
625180351969210.000030com.variety
626180345029800.000029com.alexa
6271803415012110.000024us.imageshack
6281803317421980.000014edu.arizona
6291803286220080.000016in.huffingtonpost
6301803073415920.000019com.yell
631180302789740.000029org.sciencemag
6321802972813200.000022uk.co.theregister
6331802824616790.000018com.verywellmind
634180259548520.000033org.worldbank
635180256388650.000033io.readthedocs
636180251041300.000208com.youku
6371802471421780.000015com.epochtimes
6381802430621860.000015info.bem
639180233982210.000107com.taobao
6401802228811440.000025com.elpais
6411802180019630.000016org.dartlang
6421802156610880.000026org.altervista
643180212943580.000068org.debian
644180209924450.000056com.force
6451802094012750.000023com.ifttt
6461802046622090.000014com.youm7
6471801964010730.000026com.vox
6481801956819330.000016com.hulu
6491801903222560.000014au.com.yellowpages
6501801898025050.000013com.pushwoosh
6511801661211770.000024com.nydailynews
652180161306980.000039gov.noaa
6531801460016570.000019com.yext
654180140229580.000030com.shutterstock
6551801362823200.000014com.gifyu
6561801332012620.000023com.storify
657180132566760.000041com.samsung
6581801294410950.000026edu.ucsd
659180119784220.000058edu.nyu
660180097366960.000040com.tandfonline
661180095824470.000055com.atlassian
662180092468960.000031com.geocities
663180088124390.000057edu.cmu
6641800874624330.000014com.yelloyello
665180086027800.000035com.netflix
6661800744012910.000022tv.ustream
667180071046200.000045us.icio
6681800681211380.000025edu.utexas
669180059244480.000055com.gitlab
6701800579020930.000015com.targetmarketingmag
6711800430621660.000015com.cargurus
672180042068860.000032com.docker
6731800293211910.000024com.trustedshops
6741800221824790.000013com.analyticsvidhya
6751800143424450.000013com.2findlocal
676179985208450.000033com.foxnews
6771799714620800.000015jp.huffingtonpost
6781799573626370.000013com.instructables
6791799523819450.000016com.nokia
6801799510011970.000024edu.academia
681179926647560.000036com.gettyimages
682179912302450.000099com.wpengine
6831799108422120.000014ca.uwaterloo
6841798868625470.000013com.cmgdigital
685179870908660.000033edu.umich
686179869746930.000040com.symantec
687179866348100.000034net.2mdn
6881798662621290.000015com.mondaq
689179861649520.000030com.ycombinator
6901798579422060.000014com.keepersecurity
691179850963880.000063com.newrelic
6921798474620540.000015com.doctoroz
693179845349080.000031com.uservoice
694179838622070.000115com.naver
6951798268415570.000019com.pastebin
696179804161890.000132com.xing
6971797873618570.000017com.duckduckgo
698179783129560.000030com.thinkwithgoogle
6991797812819070.000017se.haxx
7001797698420070.000016com.thecvf
7011797592622550.000014au.com.truelocal
7021797464035340.000012com.9to5mac
7031797453421300.000015uk.co.yelp
7041797416214440.000020fm.last
705179740868910.000032com.dropboxusercontent
7061797335415540.000019com.sankei
7071797292422050.000014com.tiddlywiki
7081797185823150.000014com.galvanize
7091797124021490.000015es.huffingtonpost
710179711082490.000098com.automattic
711179697289200.000031com.investopedia
7121796799422350.000014com.bizcommunity
7131796745811560.000025org.cambridge
7141796729612200.000023com.freeprivacypolicy
715179672869170.000031org.change
7161796654221450.000015com.winemag
7171796632424440.000014com.maritime-executive
7181796542410520.000027gov.uspto
7191796446425560.000013com.alternion
7201796335818340.000017com.autodesk
7211796312424110.000014com.communitywalk
7221796272618390.000017org.coursera
7231796220212550.000023com.upwork
7241796068223410.000014net.futurecdn
7251795997420890.000015com.kudzu
7261795985823520.000014com.ericsson
7271795832018320.000017com.adespresso
7281795692225270.000013edu.alamo
7291795678412600.000023com.irishtimes
7301795677823420.000014com.filedn
7311795666013530.000021edu.usc
7321795615810410.000027com.wunderground
733179557228640.000033br.com.uol
734179557186970.000039com.gartner
7351795538422540.000014com.gamespot
7361795525420740.000015com.btplc
7371795443220580.000015com.showmelocal
7381795425623860.000014com.massimodutti
7391795377420200.000016edu.virginia
7401795360017310.000018com.ikea
7411795347622600.000014com.insiderpages
7421795341612740.000023com.indiegogo
7431795267620300.000016com.goinswriter
7441794969424400.000014com.bershka
7451794935021840.000015com.almanac
746179491607700.000036gov.census
7471794688012330.000023com.intuit
748179459144130.000060com.inc
7491794453243470.000009com.programmableweb
7501794326811320.000025com.pcmag
7511794266621940.000014com.writersdigest
7521794213422830.000014com.citysquares
7531794164615350.000020com.fiverr
7541794131618720.000017com.csswizardry
7551794129613730.000021com.vanityfair
7561794117219030.000017jp.sankeibiz
7571794079624560.000013com.live5news
7581793923411170.000025gov.usgs
759179389169140.000031com.zoho
7601793828234000.000012com.freep
761179374088300.000034com.blackberry
7621793725621630.000015jp.booklog
7631793684423510.000014com.thedrinksbusiness
7641793542610570.000027com.politico
7651793538821970.000014com.winefolly
766179347686870.000040com.alibaba
7671793436629700.000013com.jeeran
7681793408024940.000013io.stackedit
7691793386219580.000016ca.ubc
770179337641610.000163me.line
7711793353839400.000010org.greenpeace
7721793306223810.000014com.yellowbook
7731793233624590.000013za.co.bdlive
7741793221223960.000014com.asianage
7751793209016310.000019com.udemy
7761793205834150.000012com.glamour
7771793183016350.000019com.chrome
7781793164211940.000024com.techrepublic
779179316148490.000033com.unity3d
7801793159020330.000015mp.j
781179315365980.000047gov.usda
7821793100423650.000014net.islamweb
783179293348080.000034int.wipo
7841792846623550.000014com.wsoctv
785179275926650.000042com.marketo
7861792704810490.000027edu.umn
787179269324140.000060mp.mailchi
788179263549680.000029com.aliexpress
7891792590026080.000013org.torproject
7901792536023220.000014com.utah
791179252289540.000030com.sciencedaily
7921792450219320.000016org.ap
793179240987240.000038gov.house
7941792397622080.000014com.chamberofcommerce
7951792359419530.000016com.urbandictionary
7961792355825700.000013com.spoke
7971792279428070.000013com.salespider
7981792117623690.000014com.ibmbigdatahub
799179211269810.000029au.net.abc
8001792107415650.000019com.problogger
801179210085330.000048com.snapchat
8021792092214250.000021fr.lemonde
803179193461410.000185jp.co.google
8041791759255390.000006cc.co
8051791702020170.000016com.posterous
8061791688211290.000025com.canva
8071791610416380.000019com.britannica
8081791578223790.000014com.wpxi
8091791546820400.000015edu.cuny
8101791545025190.000013com.americantowns
811179143646750.000041gov.hhs
8121791399622870.000014org.themoth
8131791332814200.000021com.rollingstone
8141791302212450.000023com.xkcd
8151791298435820.000011edu.brown
816179128806340.000044com.feedly
8171791245824470.000013com.hdnux
8181791239626090.000013com.zionsbank
8191791231625150.000013com.pacegallery
8201791143426450.000013com.tupalo
821179111366400.000044au.com.google
822179110608430.000033com.uk
823179090561280.000215com.youtube-nocookie
8241790886213220.000022com.vmware
8251790856820940.000015org.semanticscholar
8261790834219700.000016com.sanspo
8271790824810130.000028com.java
8281790817822390.000014it.scoop
829179078424700.000053com.adweek
8301790755023010.000014uk.co.dennis
8311790747421080.000015jp.co.sankei
8321790695025760.000013za.co.sowetanlive
833179067485040.000049gov.copyright
834179062503620.000068com.wufoo
8351790556223100.000014edu.uci
8361790490820520.000015jp.ne.iza
8371790481025430.000013org.foodrevolution
8381790445624910.000013com.thewritepractice
8391790445421470.000015com.parksassociates
8401790441011110.000025fr.blogspot
8411790403426360.000013au.com.whitepages
8421790387813290.000022com.billboard
8431790336012720.000023com.prezi
8441790221621720.000015com.local
845179014143200.000076gov.ftc
8461789997018310.000017edu.illinois
847178995869750.000029com.indeed
848178990668110.000034org.unesco
8491789898618520.000017com.hatenablog
8501789818424970.000013dk.brics
8511789800621180.000015uk.ac.ed
8521789761811730.000024org.unicef
853178972324250.000058com.criteo
8541789639821510.000015org.linuxfoundation
8551789606822150.000014com.vendio
8561789571819810.000016uk.ac.ucl
857178949962790.000089com.marriott
8581789419667530.000005com.blog
8591789376211870.000024com.steamcommunity
860178935828340.000034com.gofundme
8611789335420220.000016net.privacypolicytemplate
8621789303840670.000009com.virustotal
863178920184670.000053com.iconfinder
8641789161625410.000013com.lacartes
8651789127422690.000014ai.fast
8661789123217960.000017com.howstuffworks
8671788898412220.000023com.dell
8681788886624730.000013com.ibegin
8691788816614700.000020com.over-blog
870178874043500.000069net.themeforest
871178873763020.000080com.netdna-ssl
8721788727235760.000011edu.tufts
8731788657224870.000013za.co.moneyweb
8741788613616890.000018com.twilio
875178860428870.000032com.hootsuite
8761788457612300.000023com.gallup
8771788433223940.000014com.machinelearningmastery
8781788290623890.000014io.dropwizard
879178825129920.000028com.att
8801788141623080.000014com.ehow
8811788078836600.000011com.discogs
8821788072423330.000014com.blogs
8831788068821390.000015com.dandb
884178796964860.000051com.squareup
8851787953210370.000027gov.bls
886178794784010.000061com.bitly
8871787854036650.000011com.twitpic
8881787835830640.000013com.invoicesherpa
889178775786500.000043com.herokuapp
8901787710222650.000014ru.narod
8911787657018750.000017com.tunein
8921787525215700.000019com.com
8931787465019800.000016jp.co.zakzak
894178740526130.000046com.airbnb
8951787348821230.000015uk.co.realbusiness
896178722088370.000033gov.justice
8971787183819510.000016co.gcdn
898178716182670.000091com.myshopify
8991787084435040.000012de.bild
900178704782340.000104jp.co.amazon
9011786997619050.000017org.filezilla-project
9021786972225740.000013com.growtix
9031786971019220.000016com.newsfactor
9041786862627750.000013org.earthmagazine
9051786824035950.000011cc.tiny
906178681023390.000072org.opensource
9071786762817100.000018org.owasp
9081786743416780.000018org.cancer
909178652203700.000067org.doi
9101786475212150.000024ly.ow
9111786445828200.000013co.iglobal
9121786355613300.000022edu.uchicago
913178632621330.000206de.bund
914178628542590.000094com.getbootstrap
915178621864990.000050com.nasdaq
9161786182410000.000028com.lifehacker
9171786174812710.000023org.pnas
918178616443950.000062io.atom
9191786132414580.000020in.blogspot
9201786030626440.000013ai.becominghuman
9211786024826800.000013com.googlemaps
9221785853020090.000016net.nend
9231785686847300.000008com.colourlovers
9241785679814130.000021com.splashthat
925178566769820.000029com.jetbrains
926178561769150.000031jp.livedoor
927178561523030.000080com.ssl-images-amazon
9281785422426430.000013nl.zeelandnet
929178536208690.000032com.pingdom
9301785353430780.000013com.sophos
9311785290825250.000013gr.huffingtonpost
9321785214210020.000028de.blogspot
9331785068227600.000013com.fox13memphis
9341785048821140.000015com.richmediagallery
9351785037818210.000017com.hotmail
93617850366720.000395com.messenger
9371785027622310.000014edu.asu
938178500529950.000028org.iso
9391784997613890.000021com.imimg
9401784937211450.000025com.uber
9411784912023560.000014com.tuck
9421784855617260.000018com.nba
9431784823224040.000014jp.news24
9441784773215590.000019com.ogilvy
9451784731827720.000013com.addustour
9461784680028310.000013org.grayarea
9471784671421030.000015com.homestars
9481784665011360.000025com.seattletimes
949178465802650.000092ru.rambler
9501784598823620.000014edu.utah
9511784586838620.000010com.starwars
952178456404790.000051jp.ne.sakura
9531784471810630.000027gov.congress
9541784310214100.000021dk.datatilsynet
955178429328590.000033com.stitcher
9561784279829710.000013com.oilandgas360
9571784248617850.000017edu.umd
958178424307580.000036com.yandex
9591784010018850.000017com.wetransfer
9601783962824570.000013ms.1drv
961178382129770.000029com.prweb
962178380864230.000058com.smugmug
9631783770224140.000014com.delta
9641783635623060.000014edu.bu
9651783615611410.000025com.500px
9661783466827960.000013org.cmlibrary
9671783424825650.000013com.fixr
9681783376413120.000022com.firefox
9691783336820500.000015edu.ufl
9701783161024090.000014ca.ualberta
9711783138639770.000010com.thingiverse
972178308884000.000061com.discordapp
9731783071438170.000010edu.unl
9741782974827460.000013tw.com.ibon
9751782913424900.000013au.com.hotfrog
9761782896623500.000014de.mpg
9771782892811600.000025com.timeanddate
9781782858024950.000013com.figure-eight
9791782857423700.000014com.codecademy
980178279648900.000032gov.usa
981178275182560.000096it.google
9821782706424270.000014com.outboundengine
9831782687413080.000022com.strikingly
9841782684012430.000023com.target
9851782575824550.000013com.theblogpress
9861782530025850.000013com.expressbusinessdirectory
9871782528822160.000014com.nfl
9881782519226070.000013com.elocal
9891782512026280.000013au.com.news
9901782431411160.000025com.scientificamerican
9911782416813250.000022co.vine
992178237107470.000037com.cargocollective
993178235306910.000040com.caniuse
9941782193021070.000015com.angelfire
9951782078830050.000013com.hbo
9961782064816390.000019uk.co.screamingfrog
9971782030426170.000013com.ovoenergy
998178200107370.000037uk.co.eventbrite
9991781972626910.000013com.normacomics
1000178195767520.000037com.sagepub

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

July 2019 crawl archive now available

The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th.

The July crawl contains page captures of 810 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:

  • a random sample of 2.0 billion outlinks taken from June crawl WAT files
  • 1.8 billion URLs mined in a breadth-first side crawl within a maximum of 6 links (“hops”), started from
    • the homepages of the top 60 million hosts and domains and randomly selected samples of
    • 2 million human-readable sitemap pages (HTML format)
    • 2 million URLs of pages written in 130 less-represented languages (cf. language distributions)
  • 900 million URLs extracted and sampled from 20 million sitemaps, RSS and Atom feeds

Archive Location and Download

The July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-30/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-30/segment.paths.gz100
WARC filesCC-MAIN-2019-30/warc.paths.gz5600046.10
WAT filesCC-MAIN-2019-30/wat.paths.gz5600017.62
WET filesCC-MAIN-2019-30/wet.paths.gz560007.80
Robots.txt filesCC-MAIN-2019-30/robotstxt.paths.gz560000.14
Non-200 responses filesCC-MAIN-2019-30/non200responses.paths.gz560001.63
URL index filesCC-MAIN-2019-30/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-30/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

June 2019 crawl archive now available

The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

The June crawl contains page captures of 880 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
  • a random sample of 2.0 billion outlinks taken from May crawl WAT files

Starting with this crawl the WAT extraction has been improved by properly decoding HTML character entities in URLs and strings. For details, please see the issue report “WAT: unescape XML/HTML character entities”.

Archive Location and Download

The June crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-26/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-26/segment.paths.gz100
WARC filesCC-MAIN-2019-26/warc.paths.gz5600049.42
WAT filesCC-MAIN-2019-26/wat.paths.gz5600017.24
WET filesCC-MAIN-2019-26/wet.paths.gz560007.59
Robots.txt filesCC-MAIN-2019-26/robotstxt.paths.gz560000.14
Non-200 responses filesCC-MAIN-2019-26/non200responses.paths.gz560001.52
URL index filesCC-MAIN-2019-26/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-26/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May 2019 crawl archive now available

The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th.

The May crawl contains page captures of 825 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Feb/Mar/Apr 2019 webgraph data set from the following sources:

  • sitemaps, RSS and Atom feeds
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million human-readable sitemap pages (HTML format)
  • a random sample of 1.6 billion outlinks taken from WAT files of the April crawl

Archive Location and Download

The May crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-22/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-22/segment.paths.gz100
WARC filesCC-MAIN-2019-22/warc.paths.gz5600051.48
WAT filesCC-MAIN-2019-22/wat.paths.gz5600017.75
WET filesCC-MAIN-2019-22/wet.paths.gz560007.7
Robots.txt filesCC-MAIN-2019-22/robotstxt.paths.gz560000.17
Non-200 responses filesCC-MAIN-2019-22/non200responses.paths.gz560001.84
URL index filesCC-MAIN-2019-22/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-22/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The software which builds the graph from WAT and WARC files has been extended to extract more links from the HTML <head> element:

  • more links are taken from <metadata> elements, e.g, the thumbnail meta name, Open Graph or twitter:* properties
  • links from <script> elements are now included

Note that previous web graph releases already include all kinds of links: not only <a href="..."> but also links to images and multi-media content, links from <form> elements, canonical links, and many more.

While the domain-level graph shows almost the same size and metrics as the previous one released three months ago, the host-level graph has increased in size by 85 million nodes but is less densely connected. The growth in the number of nodes is mainly caused by a link spam cluster of 190 million hosts distributed over 15k domains. Thanks to the webgraph these domains (e.g., 24340.tw) are detected and the crawler is advised not to visit them again.

Host-level graph

The graph consists of 492 million nodes and 3.0 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 426 million dangling nodes (87%) and the largest strongly connected component contains 52 million (10.5%) nodes.

You can download the graph and the ranks of all 492 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/Apr 2019 host-level webgraph

SizeFileDescription
3.36 GBcc-main-2019-feb-mar-apr-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
14.40 GBcc-main-2019-feb-mar-apr-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 56 edges files
6.33 GBcc-main-2019-feb-mar-apr-host.graphgraph in BVGraph format
2 kBcc-main-2019-feb-mar-apr-host.properties
7.02 GBcc-main-2019-feb-mar-apr-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-feb-mar-apr-host-t.properties
1 kBcc-main-2019-feb-mar-apr-host.statsWebGraph statistics
7.85 GBcc-main-2019-feb-mar-apr-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 91 million nodes and 1.89 billion edges. 51% or 46 million nodes are dangling nodes, the largest strongly connected component covers 38 million or 42% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-feb-mar-apr/domain/.

Download files of the Common Crawl Feb/Mar/Apr 2019 domain-level webgraph

SizeFileDescription
0.63 GBcc-main-2019-feb-mar-apr-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.48 GBcc-main-2019-feb-mar-apr-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.01 GBcc-main-2019-feb-mar-apr-domain.graphgraph in BVGraph format
2 kBcc-main-2019-feb-mar-apr-domain.properties
4.02 GBcc-main-2019-feb-mar-apr-domain-t.graphtranspose of the graph
2 kBcc-main-2019-feb-mar-apr-domain-t.properties
1 kBcc-main-2019-feb-mar-apr-domain.statsWebGraph statistics
1.98 GBcc-main-2019-feb-mar-apr-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/Apr 2019)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12909644410.020470com.googleapis
22800798230.012308com.facebook
32680149420.013202com.google
42486204240.006929com.twitter
52469642850.006794com.youtube
62417404860.006211org.w
72232517690.003651com.instagram
82227754670.004565org.gmpg
921663616130.002903com.linkedin
102125374080.003880com.googletagmanager
1120994008220.001629com.gravatar
1220770844110.003144com.cloudflare
1320763426120.002915org.wordpress
1420723350150.002103com.wordpress
1520597380190.001856com.pinterest
1620589106260.001344org.wikipedia
1720538196140.002455com.bootstrapcdn
1820340120180.001857com.apple
1920168332280.001208com.vimeo
2020100244400.000945com.blogspot
2120076842210.001721com.jquery
2219900496440.000861gl.goo
2319874514490.000756be.youtu
2419845858240.001528com.adobe
2519808478270.001240com.microsoft
2619798758500.000749com.amazon
2719710384600.000607com.tumblr
2819689216530.000667com.wp
2919584236340.001016com.amazonaws
3019567898250.001456com.macromedia
3119563206880.000433com.yahoo
3219557094510.000734com.flickr
3319526652420.000880com.google-analytics
3419499686800.000502ly.bit
3519489648320.001034com.googlesyndication
3619472580620.000580org.mozilla
3719466578230.001557com.gstatic
3819459466310.001093net.cloudfront
3919428390200.001823com.github
4019302390660.000558me.wp
4119278286390.000949net.doubleclick
4219253848460.000802com.paypal
4319222312990.000316com.googleusercontent
4419214440820.000487com.medium
4519194374410.000882com.squarespace
4619181944850.000440com.weebly
4719164390790.000520org.w3
48191628801270.000234com.nytimes
4919140860860.000440io.github
50191386961020.000307com.reddit
5119125448920.000375org.creativecommons
52190520881540.000166net.slideshare
53190501261620.000162com.theguardian
54190479641390.000189com.imgur
5519010700570.000626com.bing
56190078041360.000202com.forbes
57189750241660.000158net.sourceforge
58189693442170.000110com.businessinsider
5918964518640.000566org.schema
60189305622020.000115com.myspace
61189294121610.000162com.blogger
62189290982060.000113com.techcrunch
63189290961880.000132com.android
64189077161010.000313com.mailchimp
65188870382510.000097com.tinyurl
6618886912540.000649com.baidu
67188815982490.000098com.wired
6818879314910.000411de.google
69188721463540.000068com.photobucket
70188700821820.000140com.stackoverflow
71188447181000.000316org.ampproject
7218842202380.000953org.apache
73188292062660.000090com.bbc
74188242021030.000307com.shopify
75188214923610.000068com.quora
76188183043150.000076com.appspot
7718801506370.000974com.fontawesome
78187978721130.000275com.ytimg
7918796778360.000976com.addthis
80187760402090.000112com.oracle
81187664845580.000045org.chromium
82187614663530.000069com.googleblog
83187537183800.000064com.theverge
84187287225260.000047org.ieee
85187261665100.000048edu.washington
86187248984620.000053com.economist
8718724374960.000330com.statcounter
8818720414980.000317com.soundcloud
89187173401510.000171org.ietf
90187140605530.000046edu.yale
91187063623190.000075com.githubusercontent
92187032143000.000078com.ted
9318695606610.000589eu.europa
94186945324410.000056com.venturebeat
95186918882350.000103com.hubspot
96186885426550.000043com.tinypic
97186802521440.000180com.spotify
98186737381410.000185com.yelp
99186718301330.000213com.issuu
100186625163950.000063com.cisco
10118657986930.000354co.t
10218652360950.000340com.sharethis
103186484404380.000056com.deviantart
104186449447020.000040edu.princeton
105186440362650.000090com.sciencedirect
106186357323600.000068me.about
107186317104600.000053org.arxiv
108186276502790.000086org.npr
109186158441790.000141org.wikimedia
110186085307510.000038google.blog
111186064023410.000071com.theatlantic
112185968923450.000071com.mozilla
113185925025720.000044edu.ucla
114185879044540.000054com.mysql
115185844161340.000211com.dropbox
116185818909630.000033com.jetbrains
117185793861210.000250com.whatsapp
118185766762940.000081com.example
11918575790810.000499net.jsdelivr
120185748122710.000089com.fastcompany
121185670243310.000072com.typeform
122185604724000.000062com.zdnet
123185564064680.000052com.wikihow
12418554544300.001112ru.yandex
125185527845400.000046com.thenextweb
126185516785020.000049com.git-scm
1271855082410630.000030com.chrome
128185466501690.000156com.salesforce
129185429823750.000065uk.co.blogspot
130185377704430.000055com.about
131185356581170.000263org.networkadvertising
132185356084880.000050com.pixabay
133185260842120.000112com.dribbble
134185257902010.000116com.stumbleupon
1351852465814340.000021com.diigo
136185093344990.000049com.ubuntu
137185019307410.000038org.eclipse
138184975965050.000049com.slate
139184972522080.000112com.googlecode
14018490168580.000611com.wix
141184866044250.000058com.moz
142184811861910.000127com.cnn
143184754721220.000242com.stripe
144184754241810.000141uk.co.bbc
145184643406520.000043com.stackexchange
146184628643690.000066com.entrepreneur
147184589602750.000087com.nbcnews
148184533762530.000095gov.ca
149184445045950.000044com.withgoogle
150184404305180.000048com.qz
151184394145420.000046com.trello
152184298922140.000111edu.stanford
1531842623810710.000030edu.illinois
1541842482410400.000030edu.gatech
155184242902930.000081com.foursquare
1561842171815090.000020org.wikibooks
157184181785730.000044com.searchengineland
158184174205160.000048com.unity3d
159184147686700.000042org.sciencemag
160184120402670.000090com.npmjs
161184016684630.000053gov.loc
162183978569260.000034com.sap
16318397328160.002068com.wixstatic
1641839622410970.000029edu.rutgers
165183942701560.000165org.bbb
166183929542130.000111es.google
167183926826610.000042com.variety
168183912961550.000166com.twimg
169183819365380.000046com.libsyn
170183803405460.000046com.evernote
171183800641740.000152com.imdb
172183788962110.000112com.wsj
17318377090330.001018net.fbcdn
174183731241420.000185gov.privacyshield
175183691367050.000040com.techtarget
17618368040450.000851com.fb
1771836597811900.000026edu.utah
178183659601460.000180org.archive
179183654863770.000065com.getpocket
180183585924390.000056gov.fda
181183574821940.000125com.optimizely
182183505424190.000060au.com.google
183183462469040.000034com.econsultancy
184183461302100.000112net.windows
1851834586013150.000024com.douban
186183451484490.000055org.freecodecamp
1871833827413210.000023com.discogs
188183381746130.000043uk.ac.ox
1891833578010190.000031com.nike
1901833267812070.000026org.tensorflow
19118325660750.000531com.vk
192183248042870.000082edu.mit
193183239089530.000033com.buffer
1941832365412030.000026com.aljazeera
1951832118411330.000028ca.utoronto
196183179828130.000036com.netlify
1971831589410790.000030com.nvidia
198183142545310.000047net.azurewebsites
199183116123500.000069com.msn
200183107489840.000032org.kernel
2011830706411230.000028it.scoop
20218305946940.000346com.paypalobjects
203182991787230.000039com.indeed
204182988708070.000036com.mixcloud
205182965842360.000103com.live
2061829111611210.000028org.postgresql
207182890648100.000036com.neilpatel
208182805403650.000067com.discordapp
2091826930012170.000026ms.1drv
210182691469420.000034com.business2community
211182676223030.000078com.reuters
212182661663890.000064gov.nasa
2131826334414520.000021com.makeuseof
214182618801530.000168gov.nih
215182616604440.000055com.udacity
2161825726212890.000024com.hostgator
217182539949960.000032com.chron
218182524162640.000091com.ibm
2191824336610240.000031com.socialmediaexaminer
2201824059411260.000028com.trendmicro
221182380162030.000114com.washingtonpost
2221823694811170.000028com.computerworld
223182351545680.000045com.images-amazon
224182325441800.000141com.etsy
2251823129213400.000023io.itch
226182262267730.000037co.g
2271821810412060.000026edu.osu
228182150609680.000033com.yoast
2291821167012770.000024com.hbo
230182103261900.000130com.ebay
231182084883040.000077com.cnet
232182075422910.000082edu.harvard
2331820731217660.000017com.pearltrees
234182063949560.000033com.mediafire
235182060367150.000039site.business
23618201166560.000629net.akamaihd
2371820005410550.000030com.healthline
238181984904830.000051com.usnews
239181966621960.000120com.huffingtonpost
2401819255411440.000027com.bustle
2411819024210010.000032com.me
242181880005570.000045org.d3js
243181860741650.000159com.eventbrite
24418185850870.000437com.list-manage
2451818573017370.000018com.panoramio
246181844503840.000064com.mashable
247181817124650.000053edu.berkeley
248181811048030.000036co.ibb
249181807962770.000087com.bloomberg
250181779227600.000037com.adjust
251181770088220.000036com.ecwid
252181740043490.000070com.mapbox
253181721948910.000034gov.wa
2541817215210280.000031org.aarp
2551817022411580.000027edu.brookings
256181695021920.000127org.iana
2571816577212500.000025com.dw
2581816565412450.000025com.medicalnewstoday
259181635842180.000109net.php
260181635744200.000059me.telegram
261181630482960.000081org.acm
262181623542070.000112org.gnu
2631815819814760.000021com.sas
264181521166110.000043me.paypal
2651815024815610.000020com.dezeen
2661815007210880.000029com.cio
267181491007990.000036co.elastic
2681814852412930.000024uk.org.tate
269181476844020.000062com.latimes
270181464121990.000118uk.co.amazon
271181461724550.000054com.bigcommerce
2721814515815650.000020be.blogspot
2731814446812600.000025com.hackernoon
274181430803160.000076uk.co.telegraph
2751814180814280.000022com.googlesource
2761814169014480.000021edu.iastate
2771813888416980.000018org.edublogs
2781813739215590.000020com.mathworks
2791813402412340.000025gov.michigan
280181329643720.000066com.livejournal
2811813282811400.000028com.xrea
2821813209216230.000019li.paper
28318128158470.000767com.qq
2841812788016790.000018com.dummies
285181265861430.000183com.unpkg
2861812385610150.000031com.searchenginejournal
287181228249390.000034com.searchenginewatch
2881812080618020.000017fr.unblog
289181191642860.000083com.go
290181145986580.000042com.livechatinc
291181130781490.000173com.opera
292181123607570.000037au.gov.nsw
2931811032611620.000027va.vatican
2941810461614740.000021jp.ac.u-tokyo
2951810436010160.000031uk.co.pinterest
296181030343740.000065com.elsevier
2971809771214120.000022com.activecampaign
298180976143110.000076com.meetup
2991809710619250.000016com.jigsy
3001809426810840.000029uk.gov.nationalarchives
3011809263212190.000026us.mn.state
3021809171812460.000025com.firebaseapp
3031809141612470.000025com.convinceandconvert
3041809015813670.000023us.fl.state
3051808316617000.000018org.emojipedia
306180792665290.000047com.adage
3071807907611920.000026org.maven
3081807760413410.000023gov.mo
30918073450430.000878net.facebook
310180700407120.000039gov.dot
311180698661640.000160uk.co.google
31218069354890.000415com.godaddy
313180682561720.000155com.zendesk
314180664182220.000106com.typepad
315180660742780.000087com.usatoday
316180620783240.000074com.mapquest
3171805767014230.000022gov.ky
3181805210616870.000018com.manta
319180502864260.000058org.hbr
320180501864900.000050net.researchgate
321180494083270.000073com.getclicky
3221804939613980.000022com.convertkit
3231804828621550.000014it.justpaste
3241804804813480.000023com.creativebloq
3251804759215440.000020org.aclweb
326180450828890.000034com.wordstream
3271804306415390.000020ly.snip
328180430142270.000105com.giphy
3291804092420140.000015me.websta
330180405703560.000068com.sxsw
331180400588400.000035edu.psu
3321803726613650.000023gov.maryland
3331803281618130.000017ca.yelp
3341803111011870.000026com.fastcodesign
3351803067816850.000018io.material
3361803005214940.000021org.amnesty
337180272202690.000089org.python
338180260665220.000048org.mediawiki
339180260024790.000051com.buzzfeed
3401802477211500.000027com.findlaw
341180231067830.000037com.arstechnica
342180227103820.000064com.oreilly
3431802218016670.000019edu.toronto
3441801942817030.000018com.healthgrades
3451801823220780.000015tl.page
346180165864770.000051edu.cornell
347180131063420.000071com.springer
348180089624040.000062it.placehold
3491800636815660.000020com.raywenderlich
350180056383830.000064com.nypost
351180050809370.000034com.contentmarketinginstitute
352180027543250.000073int.who
353180019545130.000048org.nodejs
3541799968215240.000020gov.mt
3551799797413450.000023us.pa.state
356179966523290.000073com.cnbc
3571799346613630.000023gov.oregon
3581799055810660.000030com.bandsintown
359179886843920.000063com.gmail
3601798855017340.000018com.wayfair
361179844803320.000072fr.free
362179808782370.000102org.drupal
363179793269100.000034com.angieslist
364179791464500.000055com.kickstarter
3651797902421540.000014com.brandyourself
366179788003990.000062uk.co.dailymail
3671797827613520.000023com.quicksprout
368179776062600.000093uk.org.ico
369179772424690.000052gov.whitehouse
3701797463210140.000031com.speakerdeck
371179727802400.000102com.rawgit
372179645006590.000042com.intel
373179539388990.000034com.wikia
37417951944830.000482com.googleadservices
375179514965060.000049com.box
3761794564214180.000022com.huffpost
3771794487810860.000029net.leadpages
378179432785080.000049com.cbsnews
379179432123230.000075com.time
3801794260619760.000015com.zynga
381179406342420.000101com.getbootstrap
382179405148110.000036com.superpages
3831794020415830.000019com.impactbnd
384179401441770.000144jp.co.yahoo
38517937886670.000548net.jsfiddle
3861793606212710.000024com.smallbiztrends
3871793602615110.000020org.gnupg
3881793488614610.000021co.leadpages
389179345003140.000076com.staticflickr
3901793379812250.000025com.googlegroups
3911793325014880.000021com.thumbtack
392179323023670.000066com.ft
3931793146420510.000015com.ewtn
394179298783700.000066com.office
3951792868615760.000019com.kaggle
396179269341370.000200com.wixsite
3971792207619960.000015org.spie
3981791951016410.000019com.thecut
3991791891812610.000025com.ebayimg
4001791784018060.000017com.googledrive
401179177483910.000063com.aol
4021791751215730.000019org.jenkins-ci
403179153344340.000056com.fortune
4041791115222750.000014net.organicfacts
405179106803570.000068com.unsplash
4061791019621180.000015it.polito
4071790270617530.000018com.mindbodygreen
408178999425990.000043com.proofpoint
4091789948411610.000027edu.ucsd
4101789818222390.000014net.furaffinity
411178959267660.000037com.engadget
412178958601310.000218com.weibo
413178956162290.000104com.surveymonkey
4141789512415560.000020com.crashlytics
4151789164615220.000020com.toptal
416178910642980.000079com.skype
4171789043413100.000024com.avvo
4181788975420120.000015com.doctoroz
4191788927813510.000023io.fabric
4201788840416520.000019com.thoughtworks
421178883441190.000253com.jimdo
422178842963390.000071com.w3schools
423178828263810.000064org.un
4241788236019120.000016com.mysanantonio
4251788001214260.000022com.carto
4261787783214780.000021com.grammarly
427178769289330.000034com.pexels
4281787577414850.000021org.sqlite
429178755861320.000214com.youtube-nocookie
430178730609860.000032com.gizmodo
4311786842617470.000018gov.arts
432178681389920.000032edu.upenn
4331786605810740.000030org.vim
4341786484218120.000017com.instapaper
435178623546900.000040com.vice
436178618986740.000041gov.nist
43717857890700.000536org.reactjs
4381785742818770.000016gov.la
4391785739017120.000018com.politifact
440178568468260.000036com.blackberry
4411785638215840.000019com.ogilvy
442178561227190.000039com.msdn
4431785530222490.000014edu.utep
4441785512415780.000019com.citysearch
445178541508930.000034edu.umich
446178525382230.000106net.behance
4471785012021560.000014com.dynamics
448178490383900.000063com.booking
4491784600022730.000014com.asmallorange
4501784380613110.000024com.curbed
451178427124780.000051com.herokuapp
452178422982160.000111com.automattic
4531784209815410.000020org.aiga
454178420809230.000034org.worldbank
455178418881470.000176com.aspnetcdn
4561784127818500.000017com.deepmind
4571784102012280.000025com.sprinklr
4581784068210580.000030com.thinkwithgoogle
4591783927023600.000013it.clyp
4601783861215400.000020com.instapage
461178379522720.000088com.digg
4621783754016140.000019com.cmswire
463178366364470.000055com.goodreads
4641783520220330.000015au.com.huffingtonpost
465178341946810.000041com.symantec
466178327943850.000064com.dailymotion
4671783250016060.000019com.vendio
4681783200022040.000014net.openreview
469178317088650.000035net.openid
4701783053222060.000014com.kvue
471178300321710.000155com.feedburner
4721782934814790.000021gov.wi
4731782669220560.000015com.kudzu
4741782652620470.000015com.stamen
4751782637812750.000024com.merriam-webster
4761782499615100.000020com.csoonline
4771781922819010.000016it.binged
4781781812813940.000022com.coschedule
4791781737821870.000014com.writersdigest
480178170486990.000040org.bitbucket
481178157208760.000035edu.columbia
4821781570018970.000016google.ai
4831781460612440.000025com.auth0
4841781410811120.000029edu.utexas
4851781351011030.000029org.weforum
4861781180817570.000018com.merchantcircle
4871781177627320.000013com.bitballoon
4881781161821210.000015edu.dukeupress
4891781084620180.000015com.ingress
490178086941480.000175com.tripadvisor
4911780854818580.000016com.king5
492178081803070.000077com.wiley
4931780378219580.000016com.nngroup
4941780373814570.000021com.vanityfair
495178010283370.000072com.hp
496177979941250.000236jp.co.google
497177975583200.000075com.scribd
498177956843360.000072com.tripod
499177947427010.000040io.codepen
5001779463021620.000014io.prototypr
501177945284460.000055com.aliyuncs
502177940529720.000033uk.co.guardian
503177936625660.000045com.samsung
504177932864510.000055com.slack
505177930346850.000041org.eff
506177919365470.000046com.webs
507177899944740.000052com.atlassian
508177898001980.000119de.amazon
5091778976428150.000012edu.alamo
5101778887215200.000020com.jeffbullas
5111778557218440.000017ca.ubc
512177836664520.000054com.newrelic
5131777899817760.000017com.financialexpress
5141777748410510.000030com.yellowpages
5151777716416470.000019org.owasp
5161777695012090.000026org.whatbrowser
5171777270414220.000022org.tigris
5181777179417230.000018com.thermofisher
519177711044290.000057com.businesswire
5201776937416640.000019org.wikidata
521177692202050.000113com.bandcamp
522177684041950.000122com.constantcontact
5231776707014440.000021com.pcworld
524177662828610.000035com.dropboxusercontent
5251776352612330.000025edu.purdue
526177624442970.000080com.wufoo
527177620307340.000038com.createjs
528177618103960.000063com.force
529177598465650.000045in.co.google
530177594663640.000067org.doi
5311775770621930.000014com.hotfrog
532177572348630.000035com.foxnews
5331775672614020.000022org.letsencrypt
534177559562000.000117org.icann
535177559084180.000060com.inc
5361775582415280.000020com.invisionapp
5371775537423220.000013com.yellowbook
538177550842950.000081gov.cdc
5391775245211350.000028org.altervista
5401774995421670.000014com.khou
5411774958021060.000015com.quickanddirtytips
5421774925414160.000022org.sonatype
5431774917624220.000013es.iac
544177491621700.000156ru.mail
5451774810212810.000024com.storify
5461774556411850.000026us.imageshack
5471774543423590.000013org.hg
548177438566960.000040com.psychologytoday
5491774345812510.000025com.upwork
5501774332410520.000030com.ycombinator
5511774222816460.000019com.kinsta
5521774220410270.000031com.hootsuite
5531774177212040.000026ca.blogspot
5541774145022840.000014com.theminimalists
5551773832812540.000025com.ifttt
556177327282990.000079com.prnewswire
5571773264020860.000015jp.riken
5581773073619380.000016at.tugraz
559177306528410.000035com.docker
5601773011013370.000023in.blogspot
5611772810221320.000014com.theoutline
5621772742211720.000027com.indiegogo
563177241289540.000033com.alexa
5641772392235500.000012com.twitpic
565177233245760.000044com.windowsphone
5661772308411730.000027com.homeadvisor
5671772260016940.000018uk.co.metro
5681772003827450.000013com.idt
5691771940424560.000013com.23hq
5701771798414710.000021org.khanacademy
5711771619618210.000017org.elasticsearch
572177159787670.000037com.indiatimes
5731771546219890.000015com.shoutmeloud
574177145864800.000051com.nature
575177136984140.000060edu.cmu
5761771311218560.000017com.city-data
5771771235621160.000015com.kgw
5781771172412380.000025org.pewresearch
579177113809750.000033com.sfgate
5801771118816720.000018gov.nh
5811771107019830.000015google.design
582177084604570.000053com.gitlab
583177083845070.000049uk.co.independent
5841770818016780.000018org.polymer-project
5851770811221290.000014org.designmuseum
586177080802190.000108jp.ne.hatena
587177071742240.000106to.amzn
5881770428611430.000027edu.wisc
589177037085270.000047com.statista
590177026768020.000036com.netflix
5911770244412080.000026com.firefox
5921770164029440.000012edu.brown
5931770031617770.000017com.tutsplus
5941769933221970.000014ca.uwaterloo
5951769703021890.000014com.company
5961769651216090.000019com.martechtoday
597176964145600.000045org.pbs
5981769617816320.000019com.fiverr
5991769393623530.000013com.instructables
6001769383612260.000025com.clicky
601176935162440.000101com.wpengine
6021769338410090.000031com.uservoice
6031769024831330.000012net.digitalcongo
604176881044960.000049us.icio
6051768809817240.000018us.nm.state
6061768573229200.000012com.wvec
6071768563028310.000012com.growtix
6081768481217860.000017us.ma.state
6091768453211160.000028uk.ac.cam
6101768433023720.000013com.warriorplus
611176841349570.000033com.shutterstock
6121768363613140.000024uk.co.theregister
6131768218810070.000032es.agpd
6141768215821680.000014com.what3words
6151768038218760.000016com.itsnicethat
616176798703260.000073org.joomla
6171767637421530.000014com.dreamgrow
6181767574012120.000026com.playstation
6191767491223250.000013org.webpagetest
6201767466418150.000017io.pantheon
6211767395230260.000012org.nalip
6221767307013820.000022com.digitaltrends
6231767280222560.000014com.googlelabs
624176727981060.000299net.2mdn
625176717345090.000049tv.twitch
6261767117411680.000027com.steamcommunity
6271767082020610.000015com.targetmarketingmag
628176706921780.000144me.line
6291767058628010.000013co.edureka
6301767036022300.000014eu.i-scoop
6311767022819290.000016com.wral
6321766987219740.000015us.wi.state
6331766857822290.000014net.wrightflyer
6341766682223550.000013gov.cabq
635176665343400.000071com.bitly
636176662083680.000066cn.com.sina
6371766583013760.000022com.intuit
6381766548613390.000023kr.or.kisa
6391766546410430.000030com.newsweek
6401766527811520.000027edu.northwestern
6411766428223840.000013edu.uah
6421766365818160.000017com.rabbitmq
6431766288820670.000015com.wfaa
6441766282213120.000024com.ning
6451766249819230.000016ch.ethz
6461766165216220.000019com.sharefile
6471766125212590.000025com.pcmag
648176604684070.000061edu.nyu
6491765978810290.000031gov.fcc
650176589923480.000070org.opensource
65117658162740.000531me.ogp
6521765804824000.000013com.wikidot
6531765734416100.000019com.com
654176572461870.000133com.eepurl
6551765701412270.000025com.ssrn
656176569886940.000040com.xinhuanet
6571765406417800.000017org.scala-lang
6581765318814080.000022edu.unc
6591765256818940.000016org.iihs
660176521047710.000037org.plos
6611765173212740.000024tv.ustream
6621765138210680.000030ly.ow
6631765096621940.000014com.almanac
6641765052620550.000015com.gamespot
6651765022023350.000013com.bibliocommons
666176494446600.000042com.feedly
667176486947970.000036com.deloitte
668176465769590.000033gov.senate
6691764629022180.000014org.onegreenplanet
6701764534621250.000014com.yourdomain
671176451684330.000057com.squareup
6721764498218860.000016com.mariadb
6731764341415480.000020org.postimg
6741764297812910.000024org.cambridge
6751764250623810.000013com.marksdailyapple
676176420722610.000091com.histats
6771764150416660.000019com.digitaloceanspaces
6781764147412670.000024com.canva
6791764142413920.000022im.gitter
6801764132611980.000026com.techrepublic
6811764073427490.000013com.themonitor
6821764068815770.000019uk.co.thesun
6831764056016180.000019com.nba
6841763954421700.000014com.winemag
6851763827614310.000022com.mcafee
686176382689130.000034gov.justice
687176356947220.000039com.steampowered
688176336148860.000035com.timeanddate
689176335664450.000055com.adweek
690176316848340.000035com.aliexpress
691176301963020.000078com.netdna-ssl
6921763011017640.000017us.oh.state
6931762937612410.000025com.optinmonster
6941762864413890.000022org.js
6951762862422400.000014jp.ac.kobe-u
696176277566570.000042gov.noaa
6971762657617820.000017org.openweathermap
698176258668530.000035com.marketwatch
6991762521423710.000013com.winefolly
7001762465415850.000019org.golang
701176238843430.000071ca.google
7021762388211710.000027com.hollywoodreporter
7031762339426420.000013org.travelblog
7041762181229150.000012me.pxlme
7051762174217180.000018com.crunchbase
7061762110424170.000013com.thedrinksbusiness
7071762091812530.000025com.mlb
7081762084421830.000014com.designobserver
7091761979819570.000016com.whitepages
7101761886013080.000024fr.lemonde
7111761727616500.000019com.pastebin
7121761602026670.000013com.backyardchickens
713176159963780.000065com.themeisle
714176153242470.000099io.polyfill
7151761467236740.000011org.torproject
7161761446211960.000026com.politico
717176125989650.000033de.blogspot
7181761246821430.000014com.programmableweb
719176123807770.000037gov.house
7201761235023780.000013uk.ac.hud
721176122263130.000076com.fc2
722176095723510.000069jp.co.rakuten
7231760942612840.000024se.haxx
724176091704010.000062com.smugmug
7251760904821910.000014com.azfamily
726176073521260.000236info.aboutads
7271760703250500.000007com.formula1
7281760632029480.000012com.locationrebel
729176040202520.000097com.marriott
730176033541850.000134com.xing
7311760315615430.000020org.doxygen
732176029564910.000050com.snapchat
7331760190227710.000013com.trendland
7341760064010730.000030com.americanexpress
7351760063611150.000028com.redhat
7361760060623940.000013com.sitejabber
7371760043623110.000014com.galvanize
7381760009042850.000009com.dreamstime
7391759969020350.000015com.insiderpages
7401759912614190.000022kr.flic
7411759906611100.000029gov.uspto
742175990608370.000035br.com.uol
743175960145300.000047com.163
744175958762900.000082gov.ftc
745175954724950.000049com.nasdaq
7461759512627530.000013com.lookuppage
7471759355011340.000028fr.blogspot
7481759257012780.000024com.prezi
7491759171226590.000013com.avsforum
750175913284100.000061mp.mailchi
7511759060820200.000015edu.arizona
752175902307930.000036com.nielsen
7531758973823640.000013com.chamberofcommerce
7541758941421470.000014com.towardsdatascience
7551758909010500.000030com.sciencedaily
756175881149780.000033io.readthedocs
757175878442830.000083com.dedecms
7581758750415490.000020uk.co.wired
7591758657812520.000025com.dell
7601758581014350.000021com.billboard
761175856604210.000059com.criteo
7621758552422830.000014org.zenit
7631758518810620.000030org.change
7641758484013040.000024edu.academia
765175838185880.000044com.newyorker
7661758220035910.000012com.sophos
7671758218017410.000018de.welt
768175814883520.000069net.themeforest
7691758130422930.000014org.gwtproject
7701758066227880.000013io.setosa
7711758065612760.000024st.prom
7721758061414330.000021fm.last
7731758054017300.000018com.fifa
7741758053026870.000013com.storeboard
7751758028221690.000014au.com.truelocal
7761758019422970.000014com.2findlocal
7771758007010930.000029com.visualstudio
7781757974011110.000029com.500px
779175795382500.000097jp.co.amazon
7801757858022640.000014net.webhostingsecretrevealed
7811757497617520.000018org.rubyonrails
78217574924590.000607com.messenger
7831757488616900.000018com.mtv
7841757466222770.000014com.newsbank
7851757378210950.000029de.heise
7861757338419110.000016com.ibtimes
7871757061016160.000019com.problogger
7881757012619950.000015com.ehow
7891756978419980.000015mp.j
7901756858010360.000031com.cbslocal
7911756837023980.000013com.wcnc
7921756824410960.000029com.investopedia
7931756778029300.000012edu.unl
7941756710423170.000014ly.cl
795175663446870.000041com.caniuse
796175663024310.000057com.verisign
7971756612014550.000021com.hotmail
7981756591421810.000014au.com.yellowpages
7991756560014530.000021com.rollingstone
8001756557221150.000015com.local
801175644282310.000104fr.google
802175636882150.000111it.google
8031756328820030.000015com.smartblogger
8041756286816630.000019org.coursera
8051756247222200.000014gov.louisvilleky
8061756208212020.000026com.domain
807175608685970.000043com.nationalgeographic
8081756010421230.000015com.theinnovationenterprise
8091755964028540.000012ke.co.blogspot
8101755880222650.000014io.kubernetes
8111755870223190.000014net.brownbook
8121755817218590.000016de.zeit
8131755810613440.000023com.freepik
8141755762022050.000014com.goinswriter
815175574327310.000039com.tandfonline
8161755694614800.000021edu.jhu
8171755652220710.000015com.riddle
8181755637612560.000025com.vox
8191755560211270.000028com.smashingmagazine
8201755466017560.000018edu.msu
821175544228380.000035com.uk
8221755430429580.000012org.dyndns
8231755391424030.000013com.wsoctv
8241755377624060.000013com.independent
8251755377613870.000022com.nymag
8261755298818090.000017com.posterous
8271755083411890.000026com.digitalocean
828175505168830.000035com.gofundme
829175498042550.000095com.myshopify
8301754935627460.000013com.spoke
8311754912220640.000015com.chambermaster
8321754830211790.000027de.spiegel
8331754818817840.000017com.ikea
8341754815422630.000014com.bizcommunity
8351754809427300.000013com.communitywalk
8361754751623990.000013com.ibmbigdatahub
8371754748619060.000016com.thewritepractice
8381754684615990.000019org.filezilla-project
8391754681018990.000016com.techradar
8401754667819630.000015com.visioncritical
8411754615419690.000015com.brafton
8421754585216270.000019com.codeplex
843175453384280.000057com.sohu
844175443163350.000072com.jotform
8451754371417790.000017com.lawyers
8461754344223160.000014edu.hbs
8471754305814010.000022edu.usc
848175429641520.000169com.addtoany
8491754280827050.000013com.nation2
8501754260213170.000023edu.uchicago
8511754237621820.000014com.w3techs
8521754149613710.000023sh.brew
8531754127213250.000023com.strikingly
8541754026219880.000015org.aclu
8551754023625790.000013com.kens5
856175399064530.000054jp.ne.sakura
8571753989410920.000029com.prweb
8581753981024280.000013com.tractorsupply
8591753938235400.000012com.gyazo
8601753924027170.000013com.yelloyello
8611753882012310.000025com.elpais
8621753866238640.000010com.rottentomatoes
8631753829621380.000014net.hockeyapp
8641753791216970.000018com.howstuffworks
8651753668628050.000012com.lacartes
8661753628816280.000019io.getmdl
8671753539223430.000013com.citysquares
868175342227610.000037net.daum
8691753346024070.000013com.kmov
8701753071028160.000012com.mothering
871175304264840.000051com.iconfinder
8721752968628730.000012org.rethinkingschools
8731752881014560.000021org.wiktionary
874175285327070.000040com.emarketer
875175285122590.000094me.t
8761752836828850.000012com.asus
8771752749419040.000016com.rt
878175274829930.000032com.oup
8791752724813830.000022com.theglobeandmail
8801752471212700.000024co.vine
8811752442027680.000013org.foodrevolution
8821752403223650.000013com.wpxi
883175232329730.000033com.airbnb
884175229549700.000033gov.usa
8851752287227020.000013com.njmonthly
8861752260210110.000031org.unesco
8871752204828000.000013org.thebestschools
8881752126823090.000014com.ezlocal
889175211241730.000153com.bluehost
890175210402280.000105com.maxcdn
8911752073624010.000013com.cbs
8921751997014050.000022org.example
8931751974628060.000012com.calmclinic
894175196549640.000033gov.copyright
8951751913021340.000014edu.ncsu
8961751762639000.000010com.domaintools
8971751726427990.000013com.trepup
8981751719018490.000017edu.indiana
8991751623614100.000022org.unicode
9001751492827340.000013com.mykaratestore
9011751477018600.000016com.adespresso
902175147424870.000050org.whatwg
9031751445037820.000011gd.is
9041751338418370.000017re.cli
9051751313431620.000012com.000webhostapp
906175129429070.000034com.alibaba
9071751291616240.000019com.britannica
9081751273612100.000026com.reverbnation
909175122566090.000043com.patreon
9101751153040220.000010edu.iu
911175111247940.000036com.yandex
912175110745250.000047com.outlook
9131751021011510.000027org.fao
9141750997627560.000013co.wanelo
9151750894616930.000018com.udemy
9161750881411880.000026gov.usgs
917175085889410.000034com.ggpht
9181750850613090.000024uk.co.mirror
9191750842211560.000027edu.umn
920175077303180.000075nl.google
921175050542580.000094com.disqus
9221750475413130.000024com.pwc
923175046389610.000033com.pinimg
9241750445818000.000017com.html5rocks
9251750414811240.000028com.sun
9261750346812000.000026com.uber
9271750179246430.000008com.mysite
9281750175623080.000014org.gimp
9291750172220660.000015com.packtpub
9301750169026760.000013com.pages10
9311750146824210.000013com.tuck
9321750099226150.000013org.swi-prolog
9331750021818350.000017edu.virginia
9341749988628970.000012be.brussels
9351749978411060.000029au.net.abc
936174996782260.000105com.googletagservices
9371749699635310.000012ch.cern
9381749628225800.000013com.ktvb
939174961025010.000049com.bigcartel
9401749579819300.000016com.nfl
9411749579413000.000024com.showmelocal
9421749451813580.000023org.pnas
9431749401421750.000014uk.co.realbusiness
9441749401027280.000013ly.visual
9451749244221090.000015com.discovery
9461749238426220.000013org.virginiadot
9471749197810690.000030com.us
9481749185218290.000017edu.cuny
9491749129816810.000018com.podbean
9501749117211820.000026com.accenture
9511749114027550.000013com.pushwoosh
9521749087625880.000013com.yellowbot
9531749065229030.000012com.watchuseek
9541749051014510.000021com.thehill
9551749044228340.000012com.callupcontact
9561749015623970.000013com.echelman
9571749000038530.000010org.greenpeace
9581748917417830.000017com.screencast
9591748847620430.000015com.webnode
9601748835611190.000028com.lifehacker
961174870609910.000032org.iso
962174870487280.000039com.gartner
9631748580416800.000018com.hulu
9641748578419270.000016co.gcdn
9651748460015250.000020com.windows
9661748386620010.000015com.birdeye
967174830349510.000034ru.google
9681748297039090.000010org.bitcoin
9691748132624130.000013com.topsy
9701748103624290.000013com.texasbar
971174804548820.000035com.stitcher
9721747894029240.000012com.talkbass
9731747858422780.000014ca.ualberta
974174785048430.000035gg.discord
9751747845028550.000012com.cylex-usa
9761747751436730.000011nl.xs4all
9771747739629250.000012info.ufacity
978174773821040.000304com.namecheap
9791747712228490.000012com.louisville
9801747631624600.000013uk.gov.westsussex
9811747568626680.000013com.salespider
9821747549015680.000019com.nokia
9831747547610340.000031com.digiday
9841747526228890.000012org.stnicholascenter
9851747505628500.000012au.com.hotfrog
9861747479812570.000025org.webkit
9871747449224300.000013net.blog5
9881747423622710.000014tv.periscope
989174739028770.000035uk.co.tripadvisor
9901747380229000.000012org.phys
9911747318816120.000019edu.umd
992174731048840.000035gov.ny
9931747212219810.000015ru.narod
994174716482330.000104jp.ameblo
9951747160651780.000007net.minecraft
996174707841630.000162com.youku
9971747048216770.000018org.gnome
998174701661840.000137com.nginx
9991747015614470.000021com.splashthat
10001746987826810.000013com.bleacherreport

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!