October 2020 crawl archive now available

The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-45/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-45/segment.paths.gz100
WARC filesCC-MAIN-2020-45/warc.paths.gz7200063.79
WAT filesCC-MAIN-2020-45/wat.paths.gz7200018.39
WET filesCC-MAIN-2020-45/wet.paths.gz720008.23
Robots.txt filesCC-MAIN-2020-45/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-45/non200responses.paths.gz720001.75
URL index filesCC-MAIN-2020-45/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Interactive Webgraph Statistics Notebook Released

We are pleased to announce the release of an interactive Jupyter notebook that is used to provide:

  • Visualization of web graph statistics
  • An interface for interacting with the webgraph

The visualization of the web graph statistics is done by leveraging the WebGraph framework, which provides means of gathering many interesting data points of a web graph, such as the frequency distribution of indegrees/outdegrees in the graph, or size distributions of the connected components. We then are able to use pandas and matplotlib to provide a visualization for the data provided by WebGraph. This effort was largely inspired by the Topology of the 2012 WDC Hyperlink Graph document. Further details of WebGraph tool installation/usage, and the data visualization may be found in the cc-notebooks repository.

The interface for interacting with the webgraph is done by using pyWebGraph, a front end that interfaces Jython with WebGraph. First, before using this interface we must re-build the string maps, in order to create a mapping between the node ID (a numerical value), to domain name (and vice versa). Once this is established we are able to simply load up the graph into pyWebGraph, and you will be able to traverse the graph interactively.

Further details of pyWebGraph installation/usage, and how to rebuild the string maps may be found in interactive webgraph README of the cc-notebooks repository.

The Jupyter notebook is available on Github in the same repository. More details about how to navigate the repository can be found in the notebook itself, as well as in the README.

We hope that users will be able to use these notebooks to gain more insight into the web graph in a numerical and practical sense.

We are grateful for WebGraph for providing extremely useful tools for processing the web graph itself, and Massimo Santini for developing pyWebGraph.

Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 539 million nodes and 3.02 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 467 million dangling nodes (86.7%) and the largest strongly connected component contains 46 million (8.5%) nodes.

You can download the graph and the ranks of all 539 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/ as prefix to access the files from everywhere.

SizeFileDescription
3.32 GBcc-main-2020-jul-aug-sep-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
13.7 GBcc-main-2020-jul-aug-sep-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
5.95 GBcc-main-2020-jul-aug-sep-host.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-host.properties
6.76 GBcc-main-2020-jul-aug-sep-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-jul-aug-sep-host-t.properties
1 kBcc-main-2020-jul-aug-sep-host.statsWebGraph statistics
7.77 GBcc-main-2020-jul-aug-sep-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 89 million nodes and 1.71 billion edges. 51% or 45 million nodes are dangling nodes, the largest strongly connected component covers 35 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/domain/.

Download files of the Common Crawl Jul/Aug/Sep 2020 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2020-jul-aug-sep-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.80 GBcc-main-2020-jul-aug-sep-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.75 GBcc-main-2020-jul-aug-sep-domain.graphgraph in BVGraph format
2 kBcc-main-2020-jul-aug-sep-domain.properties
3.69 GBcc-main-2020-jul-aug-sep-domain-t.graphtranspose of the graph
2 kBcc-main-2020-jul-aug-sep-domain-t.properties
1 kBcc-main-2020-jul-aug-sep-domain.statsWebGraph statistics
1.91 GBcc-main-2020-jul-aug-sep-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 89 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Jul/Aug/Sep 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13202792810.018888com.googleapis
23031294430.012001com.facebook
32902594820.013237com.google
42656047240.007343org.w
52651653450.007172com.twitter
62601646460.006600com.youtube
72461419090.004795com.instagram
82422071280.005190org.gmpg
92357297070.005599com.googletagmanager
1023188190110.003202com.linkedin
1122457894150.002590com.gravatar
1222451350100.003967com.cloudflare
1322364152140.002726com.gstatic
1422350042120.003105org.wordpress
1521926906220.001505com.pinterest
1621699168210.001752com.wordpress
1721599006260.001181org.wikipedia
1821538264160.002431com.bootstrapcdn
1921497526180.001836com.apple
2021314410300.001106com.vimeo
2121248994410.000830be.youtu
2221186566200.001794com.jquery
2321081822230.001444com.microsoft
2421073240450.000773com.blogspot
2520994964390.000952com.amazonaws
2620975988460.000732gl.goo
2720971574250.001384com.wp
2820921220470.000723com.amazon
2920788608720.000439com.tumblr
3020716256190.001804com.adobe
3120694562670.000535ly.bit
3220675418340.001018com.google-analytics
3320627694530.000673org.mozilla
3420618998170.001975com.github
3520617620310.001059net.cloudfront
3620579928710.000449com.yahoo
3720571130290.001127com.googlesyndication
3820570586600.000612eu.europa
3920562028520.000679com.flickr
4020560188420.000818net.jsdelivr
4120526264970.000347com.googleusercontent
4220481758620.000606co.t
43204802181090.000313com.reddit
4420451670240.001419com.fontawesome
4520436180830.000389com.weebly
4620387228560.000628com.paypal
4720375802400.000910com.macromedia
4820372972700.000450com.medium
4920370180430.000808com.addthis
5020360678280.001156ru.yandex
5120338498270.001156me.wp
5220331252640.000559org.w3
5320326560790.000411io.github
54202928361380.000223com.nytimes
5520275824760.000414org.creativecommons
5620274244590.000615org.schema
57202553261500.000192com.forbes
58202460681730.000151com.imgur
5920227930360.000979net.doubleclick
60202196121940.000133uk.co.bbc
61202109241140.000285com.soundcloud
6220171070660.000548com.vk
63201552221950.000133com.cnn
6420142696440.000803org.apache
6520134806630.000587com.whatsapp
66201295823140.000082edu.mit
67201230321800.000146com.imdb
68201183102080.000124net.slideshare
69201166262430.000101com.wsj
70201157681970.000128org.wikimedia
7120089462850.000388com.shopify
72200822042150.000120edu.stanford
73200766841540.000181gov.cdc
74200756323280.000079com.wired
75200697242680.000094com.techcrunch
76200570662550.000096edu.harvard
77200513363530.000076com.appspot
78200512922070.000124net.sourceforge
79200512642570.000096com.oracle
80200512501550.000177int.who
81200508882060.000124com.businessinsider
82200460501370.000227org.archive
83200381982300.000113com.washingtonpost
84200358102500.000097com.live
85200299401640.000163com.bing
86200282105490.000054com.livejournal
87200276224240.000069com.go
88200246664560.000066com.msn
89200199924070.000072uk.co.telegraph
90200093061700.000154com.theguardian
91200025145270.000056edu.cornell
92199971461990.000128org.ietf
93199967144860.000063gov.nasa
94199954762590.000096com.android
95199862523020.000084com.reuters
9619983946510.000702net.fbcdn
97199748902400.000102com.bloomberg
98199664641620.000164com.giphy
9919960428770.000414com.list-manage
100199590465200.000057com.googleblog
101199565582690.000093com.bbc
102199552044090.000071com.slack
103199420561430.000205com.spotify
104199388285910.000049com.zdnet
10519936894480.000721net.facebook
106199350105860.000050com.quora
107199310721260.000265com.ytimg
108199227744440.000067com.myspace
109199220467570.000038edu.umich
110199201787150.000040edu.upenn
111199174821510.000185gov.nih
112199078863440.000077com.usatoday
113199038966540.000045com.economist
114199037223130.000082com.cnbc
115199027003080.000083com.example
116198965525250.000056com.pixabay
117198950144180.000070net.researchgate
118198827904490.000066com.latimes
119198811641880.000138com.blogger
120198700463870.000075org.python
12119864804650.000555com.wix
122198607604330.000068com.githubusercontent
123198587326930.000042org.ieee
124198542544990.000061com.mashable
125198509185710.000052edu.berkeley
126198475541350.000241com.youtube-nocookie
127198451301600.000167com.issuu
128198430682180.000118org.acm
129198397368340.000036org.chromium
130198395502350.000106uk.co.google
131198357905510.000054org.arxiv
132198330202460.000099net.behance
133198326822910.000086org.npr
134198319941080.000320com.unpkg
135198311368840.000034com.arstechnica
136198268402130.000121com.unsplash
137198228843410.000078com.outlook
138198226701100.000303de.google
13919812430540.000654com.googleadservices
140198108723470.000077com.prnewswire
141198064586780.000043edu.columbia
142198053821710.000153me.t
143198048862970.000085com.dribbble
144198041422560.000096com.squarespace
145197990321390.000215gov.privacyshield
146197988063060.000083com.huffingtonpost
147197979642600.000096com.bandcamp
148197951123980.000074com.time
14919793874370.000975com.baidu
150197920826160.000048com.gitlab
151197904063340.000079com.nationalgeographic
152197882144430.000067com.nature
153197851787940.000037com.stackexchange
154197821141790.000147gle.forms
155197816762580.000096org.ampproject
156197785345480.000054com.fortune
157197779028130.000036com.git-scm
15819776608330.001030com.wixstatic
159197740307710.000038com.qz
160197723902810.000089com.wiley
161197722686460.000046au.net.abc
162197709306380.000046edu.yale
163197695824280.000068com.meetup
164197678764680.000064com.ted
1651976138611600.000026com.hatenablog
166197590524480.000066com.patreon
167197574722830.000089com.disqus
168197567489360.000032edu.ucla
169197539981470.000195com.dropbox
170197533801680.000158com.yelp
171197506782710.000093org.un
172197463842120.000122com.twimg
173197431182540.000096org.drupal
174197414746890.000042org.bitbucket
175197365404220.000069com.statista
176197354409030.000033uk.ac.cam
177197319407180.000040com.evernote
178197319166820.000043com.newyorker
179197256386030.000049com.buzzfeed
180197195446060.000049me.about
181197186547220.000040com.mysql
182197168048500.000035com.thenextweb
183197154204950.000061com.theatlantic
184197109202790.000091com.sciencedirect
185197108264030.000073com.getpocket
186197053266690.000043uk.co.blogspot
1871970212612930.000023com.tinypic
188196967304500.000066com.booking
189196956525140.000058com.xinhuanet
190196949047430.000039org.weforum
191196942682470.000098gov.ca
192196923226020.000049gov.loc
1931969099812820.000023org.postgresql
194196899088280.000036edu.princeton
195196879542390.000103uk.co.amazon
196196859424800.000063com.dailymotion
1971967967214520.000021ru.narod
198196789261890.000138com.xing
199196759148790.000034edu.jhu
200196736705000.000060gov.whitehouse
201196718466650.000044org.worldbank
2021966870613650.000022org.eclipse
203196677704000.000073com.springer
204196676844450.000067com.nypost
205196658723160.000081com.ft
20619660930610.000606com.fb
207196589862040.000125com.feedburner
208196583948260.000036org.cambridge
209196547624760.000063uk.co.dailymail
210196543867660.000038edu.washington
211196542424960.000061org.eff
21219653044320.001054com.qq
213196501444730.000064com.goodreads
214196495242640.000095org.doi
215196495025120.000058com.w3schools
2161964124213110.000023edu.virginia
217196412124400.000067com.googlecode
218196383486330.000047com.vice
219196331285060.000059com.force
220196329767230.000040com.trello
221196327808360.000035com.about
222196305625230.000056com.inc
223196294824530.000066com.scribd
2241962936820530.000016com.wikidot
225196284366190.000048org.semver
226196144966070.000049com.cbsnews
227196077946510.000045com.withgoogle
228196055121460.000196me.line
2291960341020890.000016com.googlesource
230196014762190.000118org.iana
231196014525460.000054gov.usda
232195998003090.000083com.tinyurl
2331959829010900.000027com.techradar
234195976748580.000035com.dropboxusercontent
235195974463840.000076com.ibm
2361959520012840.000023co.elastic
237195940242890.000087com.squareup
2381959333614340.000021org.linuxfoundation
2391959238811340.000026org.coursera
2401958983010270.000029gov.fbi
2411958828411580.000026edu.unc
242195860087050.000041com.vox
243195833501930.000134de.amazon
244195830965500.000054uk.co.independent
2451958055414230.000021ms.1drv
246195789503830.000076com.digg
2471956761213930.000022org.kernel
248195639481130.000287com.sharethis
249195634687510.000039org.d3js
250195574908010.000037gov.fcc
2511955729210260.000029com.hollywoodreporter
2521955625813690.000022com.howstuffworks
253195537004300.000068com.cnet
254195520688040.000037com.foxnews
255195471341520.000183com.addtoany
256195470066440.000046com.indiatimes
257195469289950.000029com.steamcommunity
2581954686411050.000026cn.com.chinadaily
259195456285840.000050com.psychologytoday
260195441308230.000036uk.co.guardian
2611954392014630.000021it.scoop
262195437541330.000247com.mailchimp
263195422348370.000035com.slate
264195422141530.000182com.opera
265195384125890.000050com.mckinsey
2661953681610200.000029com.sap
2671953641826050.000013org.wikiquote
268195343343070.000083com.bitly
269195333086270.000047com.mozilla
270195330542620.000095jp.ameblo
271195312607350.000039org.sciencemag
272195282461160.000284com.paypalobjects
2731952810823450.000014org.wikibooks
274195271041760.000151com.amazon-adsystem
275195269486880.000042gov.noaa
276195248683050.000083com.netdna-ssl
277195245443100.000083com.nbcnews
278195233309890.000030com.target
2791952277615230.000020com.instructables
280195175269750.000030edu.umn
281195165309650.000031com.merriam-webster
2821951626014310.000021hk.com.google
283195148521850.000140com.tripadvisor
2841951460823770.000014com.diigo
285195039164970.000061ca.google
286194992622360.000106com.wpengine
2871949924610290.000028com.sun
2881949656211890.000025com.digitaltrends
289194963403910.000075com.stumbleupon
290194918461150.000284com.weibo
2911949163816260.000019com.ign
2921949121013140.000023com.mercurynews
2931949096413520.000022de.zeit
294194906362290.000114com.etsy
295194891067970.000037uk.ac.ox
296194874542840.000089com.optimizely
29719485106730.000425net.akamaihd
2981948436812070.000025net.speedtest
2991948428415220.000020org.greenpeace
3001948362215530.000020net.seesaa
301194794507200.000040au.com.google
302194786049040.000033de.spiegel
3031947633610770.000027com.podbean
304194751426280.000047org.pbs
305194747225160.000058com.gofundme
306194744844160.000070com.kickstarter
3071947359013400.000022com.urbandictionary
308194724224720.000064org.pewresearch
309194713205190.000057com.bigcommerce
3101946791221370.000015de.bild
311194672402310.000112com.eepurl
312194653005150.000058com.theverge
313194647922730.000092com.stackoverflow
314194645989260.000032com.politico
315194630368110.000036co.ibb
316194623943320.000079it.google
3171946216221100.000016ly.visual
318194618409550.000031org.unicef
3191946093220200.000016org.tensorflow
3201945759216880.000018com.itv
3211945715010130.000029com.lifehacker
322194565121060.000334com.stripe
3231945627213490.000022edu.msu
324194554123120.000083net.windows
325194533748050.000037edu.academia
3261945028413910.000022com.storify
3271944963812570.000024com.crunchbase
328194493865950.000049com.tandfonline
3291944913219580.000017com.lego
3301944468211870.000025com.jetbrains
331194437966770.000043gov.senate
332194436648550.000035com.chicagotribune
3331944323423010.000014com.rottentomatoes
334194402247700.000038ca.cbc
335194399342050.000125com.eventbrite
3361943949612730.000023hk.hku
3371943640210350.000028edu.wisc
338194361046910.000042com.libsyn
3391943574210510.000028edu.northwestern
340194332129440.000031com.scientificamerican
3411943279810430.000028edu.uchicago
3421943118212880.000023uk.co.wired
343194255461900.000137jp.co.google
3441942434620020.000016org.maven
3451942373210300.000028com.mediafire
346194233504150.000070me.telegram
347194184403960.000074com.criteo
348194172083570.000076fr.google
349194170386640.000044us.icio
3501941640214770.000020com.deadline
351194158086400.000046com.sagepub
352194142567300.000039com.ecwid
3531941346612750.000023org.aclu
354194132585760.000051com.typepad
355194121684710.000064com.photobucket
356194072945330.000055com.oup
3571940716811990.000025com.reverbnation
3581940696815140.000020de.mpg
3591940533013890.000022edu.rutgers
3601940479010670.000027com.scmp
36119403976810.000392net.jsfiddle
362194036924210.000069com.calendly
363194036188440.000035com.sciencedaily
364194034687270.000039gov.justice
365194008305750.000051gov.hhs
366193982589190.000032com.theconversation
367193975969910.000030com.apnews
368193974429380.000032com.huffpost
3691939493415180.000020com.newscientist
370193946566080.000049org.openstreetmap
3711939330012870.000023com.aljazeera
372193932302160.000119com.hubspot
373193900186450.000046gov.house
3741938811826820.000012uk.co.timesonline
3751938803425640.000013com.space
376193839107000.000041com.pinimg
377193835044320.000068page.g
3781938199012410.000024com.sky
379193818448660.000035gov.congress
380193810269120.000033com.500px
3811938063212170.000024org.wiktionary
382193803409580.000031com.ssrn
3831937974217090.000018edu.bu
3841937764017570.000018gov.cia
385193757402140.000120org.bbb
3861937563414380.000021com.foxbusiness
387193718146240.000047ru.gov
3881937105615980.000019ca.mcgill
389193679267900.000037com.qualtrics
3901936605412900.000023org.semanticscholar
391193657787610.000038site.business
392193657602670.000094ru.ok
393193637989770.000030edu.si
394193637588870.000034br.com.google
395193636888470.000035co.g
3961936320410210.000029uk.co.thetimes
3971936212226630.000012com.discovermagazine
398193599201820.000142us.zoom
399193594928890.000034org.fao
400193593526830.000043org.change
4011935786614690.000020com.salon
402193566502280.000114com.aliyuncs
403193562809970.000029com.thehill
404193548189730.000030gov.usgs
405193515842980.000085com.ebay
4061935098812220.000024com.nikkei
407193501423380.000078com.rawgit
408193496605780.000051it.placehold
409193488241570.000173com.wixsite
4101934812212380.000024com.smithsonianmag
411193465527580.000038org.oecd
4121934651410880.000027ee.linktr
4131934525433120.000011com.openai
4141934228810480.000028uk.co.mirror
415193416566790.000043com.deviantart
4161934133215760.000019org.phys
417193405984130.000070tv.twitch
418193401384040.000072com.mapbox
4191933524615460.000020ca.sfu
4201933246427540.000012com.instapaper
421193306562440.000100org.gnu
4221933050421150.000016au.edu.unimelb
4231932872410440.000028int.coe
4241932832020780.000016org.nobelprize
425193282866670.000043pl.google
4261932768013330.000022com.irishtimes
427193275782930.000086com.office
4281932753619620.000017org.torproject
429193249364840.000063net.imgix
4301932462812810.000023uk.ac.ucl
4311932092610540.000028org.ohchr
4321931877212130.000025com.strikingly
433193155025090.000059org.hbr
4341931504014110.000021uk.co.metro
435193143041230.000270com.statcounter
436193134689720.000030gov.dhs
437193133802870.000088com.thedailybeast
4381931323418110.000017com.bankofamerica
4391931253412650.000024com.buzzsprout
440193119408630.000035gov.nps
4411930986824260.000014au.com.theage
442193074729330.000032com.aweber
4431930676615570.000020blog.home
444193054488480.000035gov.bls
445193052964900.000062edu.nyu
4461930434620870.000016com.oxforddictionaries
4471930407411620.000025gov.nyc
44819303568930.000356org.reactjs
4491930277813820.000022au.com.news
4501930088222910.000014sg.edu.nus
4511929990014290.000021com.flipboard
452192998964810.000063com.scorecardresearch
4531929801025170.000013com.dummies
4541929584024650.000013org.rsc
4551929547210100.000029com.britannica
456192949847140.000040gov.state
4571929421617000.000018org.gutenberg
4581929289235650.000010fm.ask
4591929086629700.000011com.pearltrees
460192899907930.000037com.zapier
4611928649425620.000013com.mystrikingly
462192840928760.000034com.cctv
463192835008160.000036com.healthline
4641928304419550.000017com.chrome
4651928263814840.000020com.rt
466192825509670.000031com.newsweek
4671928053823620.000014com.biography
4681927964610050.000029ch.google
4691927050414120.000021com.ifttt
4701927023815840.000019com.axios
471192700424660.000065es.google
472192696588820.000034au.gov.nsw
4731926744434830.000010hk.edu.cuhk
474192671508620.000035com.stitcher
4751926700025200.000013com.boredpanda
4761926558211920.000025fr.lemonde
477192639925540.000053com.steampowered
4781926387810550.000028org.jstor
4791926215013350.000022org.imf
480192619188730.000034com.venturebeat
481192611968250.000036org.poynter
4821925957416840.000018com.straitstimes
4831925945233900.000010com.chosun
4841925932215020.000020edu.asu
4851925876223510.000014io.gitlab
486192568109560.000031ru.google
487192559969520.000031sg.com.google
4881925379813310.000022uk.co.standard
489192529066120.000048de.gesetze-im-internet
490192515169480.000031gov.archives
4911925027023850.000014th.co.google
492192497304230.000069io.codepen
4931924893030330.000011com.nola
4941924889420230.000016edu.gmu
4951924524628360.000012app.netlify
4961924515811160.000026com.wikia
4971924265613530.000022com.history
4981924216010070.000029com.thelancet
4991924183029180.000011com.coca-colacompany
5001924064026540.000012google.ai
501192406008560.000035com.freepik
5021924043015480.000020com.buzzfeednews
5031923864828940.000012org.cato
504192377004310.000068net.datatables
505192374565010.000060com.rackcdn
5061923616815900.000019gov.supremecourt
5071923330225340.000013edu.byu
508192332686420.000046fr.amazon
5091923292028720.000012tw.blogspot
510192319448030.000037in.co.google
5111923153019770.000017org.edx
5121923122813090.000023com.tunein
5131923115617790.000018org.ocks
514192304785220.000057nl.google
515192283705550.000053com.gmail
5161922706823980.000014com.nationalpost
5171922691018670.000017edu.ucsb
5181922641823830.000014edu.nd
5191922639213720.000022com.dw
520192262561270.000262com.jimdo
5211922586024120.000014no.uio
5221922540010060.000029google.blog
5231922239814090.000021cn.cntv
5241922216432850.000011cn.org.china
5251922113616390.000019org.unwomen
526192189509460.000031com.airtable
5271921778825100.000013edu.uoregon
5281921537621720.000015org.britishcouncil
5291921467426680.000012org.icrc
530192144629510.000031com.gallup
5311921337822650.000015ru.kremlin
5321921289413320.000022com.globalsign
533192108508750.000034gov.uspto
534192104929590.000031edu.psu
5351921002215090.000020com.penguinrandomhouse
5361920931813450.000022com.netdna-cdn
5371920868632690.000011is.archive
5381920834415310.000020uk.ac.lse
5391920795225030.000013fi.helsinki
5401920762020420.000016edu.pitt
5411920723621700.000015net.openid
5421920625611550.000026edu.brookings
543192052907860.000037com.imageshack
544192047701720.000152com.npmjs
5451920448632900.000011de.diplo
5461920438019560.000017edu.unl
5471920383215440.000020edu.georgetown
5481920321021250.000015org.metmuseum
5491920275012400.000024org.nejm
550192022447260.000040com.adage
5511920043419900.000017com.channel4
5521920029015110.000020com.findlaw
5531920003022240.000015com.france24
554191989382820.000089net.php
5551919869817840.000017com.csmonitor
556191978664190.000069com.proofpoint
557191953201920.000135com.iubenda
5581919437210110.000029gov.treasury
5591919402817080.000018com.euronews
5601919144622860.000014com.thoughtco
5611919013637420.000009com.doodlekit
562191898621070.000320com.godaddy
5631918933412980.000023edu.duke
5641918865220710.000016com.foreignpolicy
5651918511819960.000017org.documentcloud
5661918375613000.000023com.livescience
5671918370625080.000013com.upi
5681918310420850.000016com.gq
569191822601780.000148com.zendesk
5701918207430200.000011com.authorstream
5711918207439150.000009com.mysanantonio
5721918169441330.000008tw.edu.sinica
5731917789427190.000012org.wikisource
5741917738222200.000015com.insider
575191771808510.000035gov.nist
5761917700016250.000019com.thestar
577191766421810.000145jp.co.yahoo
5781917454613040.000023au.com.smh
5791917402820250.000016org.ncsl
5801917380042520.000008hk.edu.cityu
5811917374433490.000010com.sina
5821917310821970.000015ie.independent
5831917226621560.000015edu.uky
58419171704960.000349me.ogp
5851917093634130.000010uk.ac.sussex
5861917079217550.000018gov.doc
587191707041310.000250org.networkadvertising
588191695663200.000080io.shields
589191680586490.000045gov.usa
5901916699042910.000008org.china-embassy
5911916681031370.000011com.udn
592191637741610.000166ru.mail
5931916371234740.000010com.worldatlas
594191635225050.000060com.netflix
595191632548570.000035com.thinkwithgoogle
5961916235614410.000021gov.defense
5971916195213180.000023tw.com.google
5981916082616040.000019org.hrw
5991915981214950.000020com.asahi
600191595707850.000037io.readthedocs
6011915876826880.000012org.freedomhouse
6021915865414130.000021tv.ustream
603191578228930.000034org.mediawiki
6041915644617150.000018org.pypi
6051915180030280.000011org.adb
6061915140620990.000016fr.leparisien
6071915115226150.000013com.abc7news
6081915065020630.000016com.voanews
6091915004810190.000029com.pcmag
610191486984470.000067org.nodejs
6111914855442880.000008com.theundefeated
6121914781638600.000009org.gephi
6131914717613270.000023org.undp
6141914646232770.000011org.iucnredlist
6151914645425830.000013com.sacbee
6161914620415940.000019com.treehugger
6171914560822920.000014no.google
6181914446224710.000013co.ello
6191914335419860.000017com.msnbc
620191433542520.000097com.myshopify
621191428109810.000030uk.parliament
6221914252022870.000014co.pcdn
6231914194212550.000024gov.uscourts
6241914189614220.000021co.lpages
6251914078023440.000014org.fas
626191397687810.000037com.intel
627191387408070.000036com.marketwatch
6281913691420470.000016com.infogram
6291913384825380.000013com.sputniknews
6301913370424300.000014ie.google
6311913258213440.000022se.google
632191317989900.000030com.netlify
633191310009250.000032com.jekyllrb
6341913061230550.000011int.interpol
635191303085240.000056fr.free
6361913018011980.000025be.google
6371912975015750.000019uk.co.huffingtonpost
6381912931023230.000014ly.rebrand
6391912910415040.000020link.page
6401912870417940.000017com.sched
6411912772422180.000015jp.co.japantimes
6421912725428290.000012org.tigris
6431912715228390.000012org.pri
6441912700623190.000014nz.co.nzherald
6451912562212040.000025at.google
6461912546452920.000007org.arkive
647191253262220.000116com.salesforce
648191232966500.000045br.com.uol
6491912101842420.000008kr.co.kbs
6501911937416650.000018com.thebalance
6511911912614550.000021org.oxfordjournals
6521911863837380.000009com.encyclopedia
6531911726222040.000015org.eji
6541911650628180.000012org.heritage
6551911629823710.000014com.popsci
6561911451821990.000015com.snopes
6571911409826010.000013org.oas
658191133481560.000174com.aspnetcdn
6591911271210310.000028org.ilo
6601910965422630.000015com.insidehighered
6611910898015870.000019gov.usembassy
6621910893216220.000019dk.google
6631910804033920.000010org.jenkins-ci
6641910738828270.000012org.project-syndicate
6651910655619630.000017com.justia
6661910412015630.000019gov.govinfo
6671910315216990.000018com.firebaseapp
6681910206820930.000016edu.uga
6691910202836780.000010edu.wm
6701910161432840.000011com.cgtn
6711910159618810.000017org.worldcat
672191012269000.000033com.zoho
673191005903920.000074com.atlassian
6741910029026760.000012org.transparency
6751909977613170.000023org.aarp
6761909968616750.000018org.americanbar
6771909916422390.000015com.timeshighereducation
6781909796432700.000011com.pastemagazine
6791909590225980.000013org.csis
680190943426290.000047com.samsung
681190940587740.000038com.pexels
6821909337419640.000017com.washingtontimes
6831909271420160.000016gov.usaid
6841909016613340.000022org.heart
685190887641910.000136com.automattic
686190884288650.000035com.verisign
6871908766021080.000016com.motherjones
6881908703429440.000011org.vim
6891908649820620.000016edu.nap
690190861729240.000032com.webs
6911908477815930.000019org.amnesty
6921908434421010.000016ua.com.google
6931908355239880.000009org.globalnetworkinitiative
6941908319625460.000013org.globalcitizen
6951908250017540.000018com.surveygizmo
6961908205822620.000015org.wbur
6971908104823530.000014uk.gov.companieshouse
6981908039824680.000013jp.mainichi
6991908028631810.000011com.podomatic
7001907811617510.000018org.unhcr
7011907627621180.000016ca.ctvnews
7021907531025650.000013uk.co.bbci
703190738129680.000031uk.gov.legislation
7041907152226810.000012com.nationalreview
7051907083225230.000013com.cleveland
7061907047438140.000009org.neocities
7071906988410730.000027ly.snip
708190688644380.000067com.herokuapp
709190685106560.000045com.oreilly
7101906673011540.000026cz.google
7111906646421640.000015org.nrdc
7121906576826710.000012org.thinkprogress
7131906565417950.000017ca.globalnews
714190651062700.000093jp.co.amazon
7151906284013280.000023org.altervista
7161906173231190.000011uk.ac.nottingham
7171906116812670.000024uk.gov.nationalarchives
7181906093421060.000016au.edu.anu
7191906023630350.000011com.intensedebate
7201906010227340.000012de.hu-berlin
721190598027360.000039com.airbnb
7221905980023260.000014de.auswaertiges-amt
7231905937623160.000014nz.co.google
7241905917026720.000012org.unenvironment
7251905897831320.000011org.rsf
7261905793241100.000008com.koreaherald
7271905777819600.000017org.pewtrusts
7281905767828670.000012com.techinasia
7291905748822760.000014com.thecut
7301905617437000.000009com.viki
7311905606827240.000012org.gnupg
7321905459024690.000013ro.google
7331905439420570.000016edu.gwu
7341905411630570.000011com.bangkokpost
7351905362625720.000013fr.rfi
736190528684140.000070com.pubmatic
7371905190623090.000014com.tutsplus
7381905164810790.000027tr.com.google
739190515162480.000098com.getbootstrap
7401905090844240.000008com.wonderhowto
7411905062636190.000010com.upworthy
7421905049628830.000012org.sonatype
743190503822880.000087com.typeform
7441904957428060.000012il.co.google
7451904938427390.000012uk.ac.leeds
746190481162010.000127to.amzn
7471904798627030.000012vn.com.google
748190475782740.000092com.surveymonkey
749190473809220.000032int.wipo
7501904628810570.000028com.gizmodo
751190461448740.000034com.box
7521904557822980.000014com.oregonlive
753190449165470.000054gg.discord
7541904444433560.000010com.theepochtimes
7551904440024800.000013ar.com.google
7561904414429430.000011bg.google
7571904363220610.000016com.squarespace-cdn
7581904340034790.000010io.soup
7591904277825450.000013com.webbyawards
7601904238427440.000012io.fabric
7611904229815880.000019com.speakerdeck
762190416841360.000232info.aboutads
763190406069070.000033com.docker
7641903881418170.000017com.miamiherald
7651903792431910.000011ph.com.google
7661903776224630.000013com.channelnewsasia
7671903755631980.000011uk.co.vogue
7681903755426190.000013edu.fsu
769190358704850.000063com.staticflickr
7701903528424950.000013za.co.google
7711903367826960.000012com.thejakartapost
7721903244212360.000024edu.ucsd
773190322584870.000062com.fc2
7741903203854150.000007com.armorgames
7751903194421550.000015fi.google
7761903123438850.000009com.alamy
7771903086822210.000015id.co.google
7781903046227940.000012com.rd
7791902971229510.000011com.cartodb
7801902958420920.000016com.newrepublic
7811902934834360.000010com.benzinga
782190283646610.000044com.entrepreneur
7831902796053760.000007org.gwtproject
7841902666029880.000011com.sciencealert
7851902653827630.000012org.iaea
7861902640223760.000014com.thenation
7871902369234110.000010si.google
7881902304624000.000014pt.google
7891902012429650.000011au.gov.nla
7901901983835130.000010com.dailykos
791190197564940.000061com.aol
7921901912825190.000013edu.emory
7931901901235730.000010com.inhabitat
7941901895634150.000010uk.ac.soas
795190184026660.000044com.deloitte
7961901823011850.000025com.today
797190168389780.000030com.windowsphone
7981901618636590.000010org.cpj
7991901616421190.000016kr.co.google
8001901590629810.000011se.lu
8011901578027740.000012org.cfr
802190148564290.000068me.fb
8031901367832880.000011com.joins
8041901298042640.000008sa.com.google
8051901287828140.000012com.politifact
806190122929640.000031com.alexa
8071901144241310.000008edu.utm
8081901106827350.000012com.law360
809190105469830.000030com.engadget
8101900866235830.000010hr.google
8111900853821460.000015hu.google
812190068606310.000047fm.last
8131900654024760.000013eu.politico
8141900624840470.000009com.chinatimes
8151900611625210.000013mx.com.google
8161900606031410.000011com.jezebel
8171900594238680.000009com.iconarchive
8181900531834710.000010com.ogilvy
8191900486623990.000014gr.google
8201900408628160.000012com.monday
8211900325227380.000012com.digitaljournal
8221900324831490.000011com.nyt
8231900322033000.000011audio.breaker
8241900264028230.000012uk.co.guim
825190023846250.000047com.cisco
8261900203833910.000010cn.globaltimes
8271900180826480.000012com.instructure
8281900064633210.000011com.crashlytics
8291899972027230.000012au.com.businessinsider
8301899933834300.000010org.grist
8311899828012090.000025com.pastebin
832189981183150.000082ai.shortpixel
8331899807839900.000009org.constitutioncenter
8341899796048420.000007jp.hatenadiary
8351899678037700.000009edu.ttu
8361899607629970.000011uk.ac.york
8371899593616710.000018com.eater
83818995084900.000364com.livestream
8391899503627720.000012com.bepress
8401899475228980.000012org.wri
8411899226220430.000016my.com.thestar
8421899112237750.000009com.minds
8431899059223520.000014mp.j
8441899057037080.000009app.web
8451899006234100.000010org.carnegieendowment
8461898978636450.000010tr.com.aa
847189894187110.000041gov.sec
8481898774638120.000009com.hyperallergic
8491898728234080.000010com.foreignaffairs
8501898664037970.000009au.edu.uts
851189853924700.000064com.fastcompany
8521898503235600.000010org.hypotheses
8531898446838960.000009com.japantoday
8541898275235070.000010edu.wayne
8551898204837130.000009uk.ac.kent
8561898198836970.000009rs.google
8571898053240710.000009org.sourcewatch
858189793668320.000036com.symantec
8591897842425390.000013fr.paris
8601897799629420.000011com.prweek
8611897790217650.000018ch.ipcc
8621897696022170.000015com.kinstacdn
8631897626210460.000028edu.cmu
8641897546220390.000016int.unfccc
8651897506241960.000008eg.com.google
8661897480431800.000011org.nationalgeographic
8671897454826430.000013gov.doi
8681897394034060.000010de.uni-frankfurt
8691897349442430.000008by.google
8701897202250500.000007com.symbaloo
8711897101034170.000010nl.wur
8721896995023280.000014org.unodc
8731896843015990.000019com.routledge
8741896841245090.000008com.ipsos-mori
8751896696236580.000010ae.google
8761896615244820.000008com.etymonline
8771896588849820.000007build.bazel
8781896556633200.000011org.brainpickings
8791896454431430.000011com.scotsman
8801896379642950.000008com.oilprice
8811896338035970.000010uk.ac.westminster
8821896326645450.000008lk.google
8831896257612600.000024fr.blogspot
8841896136034120.000010org.rferl
8851896131031730.000011org.epi
8861895990041150.000008lv.google
8871895981239090.000009au.edu.griffith
8881895942242190.000008kr.ac.snu
8891895728013120.000023com.upwork
8901895707624360.000014com.html5rocks
8911895671454930.000007me.nimbusweb
8921895650229400.000011fr.archives-ouvertes
8931895639842930.000008com.delawareonline
8941895546217920.000017ru.rbc
895189549687450.000039com.gartner
8961895493011270.000026edu.utexas
8971895364225260.000013net.noscript
8981895346627170.000012ae.thenational
8991895333633800.000010com.study
900189530924270.000068com.hp
9011895307436410.000010uk.co.spectator
9021895276238690.000009com.cleantechnica
9031895220828030.000012org.unctad
9041895120042550.000008com.teslamotors
9051895011816140.000019com.billboard
9061894936630740.000011com.theculturetrip
9071894789624540.000013com.multiscreensite
908189477387040.000041com.visualstudio
9091894758839850.000009uk.ac.plymouth
9101894745426600.000012sk.google
9111894731238110.000009net.aljazeera
9121894711024130.000014com.theintercept
9131894655634210.000010uk.ac.exeter
9141894649433320.000010social.mastodon
9151894587628280.000012com.euractiv
9161894586436350.000010com.db
9171894273644470.000008org.mises
9181894231646800.000008ng.com.google
9191894201627950.000012org.panda
9201894162224660.000013uk.gov.justice
9211894143056020.000007net.chinadialogue
9221894092441180.000008cat.uab
9231894074642270.000008com.spokesman
9241894008235230.000010co.com.google
9251893923044730.000008lu.google
9261893899641890.000008pe.com.google
9271893861833660.000010com.nybooks
9281893860643810.000008uk.ac.core
9291893820622280.000015com.termsfeed
9301893819416690.000018com.pcworld
9311893811238460.000009kr.co.yna
9321893800247930.000007com.gust
9331893778838800.000009org.cgiar
9341893730042310.000008pk.com.google
9351893653035750.000010net.inquirer
9361893600830830.000011ru.lenta
9371893400014680.000020com.nokia
9381893367629320.000011tw.com.pchome
9391893349612230.000024com.ycombinator
9401893335029110.000011nl.volkskrant
94118933194780.000411com.oculus
9421893261234550.000010cl.google
9431893186239490.000009org.polymer-project
9441893088826370.000013com.washingtonexaminer
9451893062239450.000009sk.sme
9461893053433890.000010edu.monash
947189300869180.000032com.canva
948189295524540.000066org.opensource
9491892939839770.000009com.rappler
9501892863040000.000009org.plan-international
9511892651845610.000008cr.co.google
9521892641235870.000010lt.google
9531892583238100.000009ca.macleans
954189256468170.000036net.adform
9551892504648730.000007com.blogto
9561892495235080.000010uk.ac.nhm
9571892492832110.000011edu.ua
9581892355428150.000012com.articulate
959189232882490.000098com.sxsw
9601892286639930.000009org.wilsoncenter
9611892267640820.000009edu.lehigh
962189223364170.000070com.skype
9631892154646990.000008com.out
9641892071410850.000027com.redhat
9651892068032660.000011my.com.google
9661891906420310.000016gov.ecfr
9671891890045850.000008org.nsidc
968189187784120.000070net.secureservercdn
9691891811245360.000008kz.google
9701891759032950.000011org.osce
971189175625570.000053org.whatwg
9721891741840960.000009com.wsoctv
9731891738025870.000013uk.org.nationaltrust
9741891722032010.000011uk.gov.london
9751891704819730.000017scot.gov
9761891698238650.000009uk.ac.qub
9771891646038070.000009com.governing
978189164305280.000056com.businesswire
9791891630022530.000015wales.gov
9801891506634220.000010com.afp
9811891498230800.000011uk.ac.qmul
9821891487851540.000007com.ingress
9831891454045960.000008com.webcindario
9841891431634020.000010org.psychiatryonline
9851891323041480.000008org.marxists
9861891309640730.000009me.thinglink
9871891297016600.000018com.css-tricks
9881891285847320.000008ie.nuigalway
9891891251443480.000008com.asiaone
9901891236833540.000010com.kaspersky-labs
9911891211012490.000024com.smashingmagazine
9921891206437870.000009org.nationalinterest
993189118485560.000053com.adweek
9941891143644980.000008ec.com.google
9951891140447220.000008bd.com.google
9961891000648460.000007uy.com.google
9971890999842330.000008com.match
9981890974640210.000009ee.google
9991890968839620.000009com.adn
10001890947443100.000008com.wnd

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

September 2020 crawl archive now available

The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The September crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-40/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-40/segment.paths.gz100
WARC filesCC-MAIN-2020-40/warc.paths.gz7960081.8
WAT filesCC-MAIN-2020-40/wat.paths.gz7960023.14
WET filesCC-MAIN-2020-40/wet.paths.gz7960010.28
Robots.txt filesCC-MAIN-2020-40/robotstxt.paths.gz796000.22
Non-200 responses filesCC-MAIN-2020-40/non200responses.paths.gz796002.36
URL index filesCC-MAIN-2020-40/cc-index.paths.gz3020.27

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-40/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August 2020 crawl archive now available

The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th. It includes page captures of 940 million URLs unknown in any of our prior crawl archives.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-34/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-34/segment.paths.gz100
WARC filesCC-MAIN-2020-34/warc.paths.gz6000048.9
WAT filesCC-MAIN-2020-34/wat.paths.gz6000016.9
WET filesCC-MAIN-2020-34/wet.paths.gz600007.56
Robots.txt filesCC-MAIN-2020-34/robotstxt.paths.gz600000.19
Non-200 responses filesCC-MAIN-2020-34/non200responses.paths.gz600001.94
URL index filesCC-MAIN-2020-34/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

July 2020 crawl archive now available

The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th. It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives.

Bug Fixes and Improvements

The URL index fields "redirect" and "mime" haven’t been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case. This bug has been detected during the crawl and was fixed for 90 out of 100 segments. It also affects the columnar index and the fields "fetch_redirect" resp. "content_mime_type". To a minor extend it may affect the detection of character set and content language as the value of the Content-Type header is used as additional hint for the detection. Additional information about this bug fix is given in the corresponding issue report.

Archive Location and Download

The July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-29/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-29/segment.paths.gz100
WARC filesCC-MAIN-2020-29/warc.paths.gz6000062.64
WAT filesCC-MAIN-2020-29/wat.paths.gz6000022.23
WET filesCC-MAIN-2020-29/wet.paths.gz600009.87
Robots.txt filesCC-MAIN-2020-29/robotstxt.paths.gz600000.21
Non-200 responses filesCC-MAIN-2020-29/non200responses.paths.gz600002.52
URL index filesCC-MAIN-2020-29/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-29/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/May 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The host-level graph now includes hosts visited by the crawler but not linking to any other host. Why is this possible – isn’t any host found via links the crawler is following? Yes, but some links were already detected in a prior crawl, not in one of the 3 crawls used to build the web graphs. More details about the issue are given in cc-pyspark#15. The impact of this fix on the graph size is minimal: the recent crawl now includes 1 million nodes (0.1% of all nodes) which are not connected to any other node.

Host-level graph

The graph consists of 927 million nodes and 3.88 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 857 million dangling nodes (92.5%) and the largest strongly connected component contains 47 million (5.1%) nodes.

You can download the graph and the ranks of all 927 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/May 2020 host-level webgraph

SizeFileDescription
5.67 GBcc-main-2020-feb-mar-may-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
17.26 GBcc-main-2020-feb-mar-may-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
7.40 GBcc-main-2020-feb-mar-may-host.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-host.properties
8.57 GBcc-main-2020-feb-mar-may-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-feb-mar-may-host-t.properties
1 kBcc-main-2020-feb-mar-may-host.statsWebGraph statistics
12.16 GBcc-main-2020-feb-mar-may-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 91 million nodes and 1.96 billion edges. 51% or 46 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/.

Download files of the Common Crawl Feb/Mar/May 2020 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2020-feb-mar-may-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.79 GBcc-main-2020-feb-mar-may-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.23 GBcc-main-2020-feb-mar-may-domain.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-domain.properties
4.16 GBcc-main-2020-feb-mar-may-domain-t.graphtranspose of the graph
2 kBcc-main-2020-feb-mar-may-domain-t.properties
1 kBcc-main-2020-feb-mar-may-domain.statsWebGraph statistics
1.96 GBcc-main-2020-feb-mar-may-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/May 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13266761810.018180com.googleapis
23055277230.011873com.facebook
32956908820.013789com.google
42692046040.007145com.twitter
52688312850.007106org.w
62636044860.006483com.youtube
72471939690.004210com.instagram
82425194280.005125org.gmpg
92384133270.005329com.googletagmanager
1023606890130.002940com.linkedin
1122741292100.003621com.cloudflare
1222732960120.002974org.wordpress
1322661910140.002515com.gravatar
1422577680150.002438com.gstatic
1522378134220.001529com.pinterest
1622196962270.001192org.wikipedia
1722189650190.001864com.wordpress
1822066028160.002404com.bootstrapcdn
1921967760180.001884com.apple
2021751768200.001863com.jquery
2121589606240.001461com.microsoft
2221568908440.000785be.youtu
2321568474430.000806com.blogspot
2421533280310.001104com.vimeo
2521415938460.000761gl.goo
2621399120350.001040com.amazonaws
2721358048530.000665com.amazon
2821331634210.001737com.adobe
2921324666230.001506com.wp
3021209012700.000452com.tumblr
3121184360170.001949com.github
3221150652370.001008com.google-analytics
3321110976300.001152com.baidu
3421096692870.000387com.yahoo
3521081268590.000547ly.bit
3621060360330.001072com.macromedia
3721046916360.001035net.cloudfront
3821036258450.000763com.flickr
3920997926320.001101com.googlesyndication
4020993476260.001277me.wp
4120980462970.000340com.googleusercontent
4220966446560.000624eu.europa
4320960242420.000807net.jsdelivr
4420959910520.000677co.t
4520901872290.001163ru.yandex
4620846092500.000742net.doubleclick
4720843032410.000869com.addthis
4820823518690.000457io.github
4920817952760.000433com.medium
5020810030250.001287com.fontawesome
51208091201390.000189com.forbes
5220796434610.000510org.w3
5320759102550.000640com.paypal
54207572661090.000282com.soundcloud
5520754514900.000368org.creativecommons
5620747472570.000619com.vk
5720711184540.000658org.mozilla
5820710182880.000382com.weebly
5920698442840.000410com.wix
60206753721020.000317com.weibo
6120663930580.000604org.schema
62206502021640.000151com.imgur
63206444521470.000177org.apache
64206422821780.000138uk.co.bbc
65206255601290.000210org.archive
66206103542740.000089com.ibm
67206096141540.000169com.bing
68206023801910.000125net.sourceforge
69205790121300.000207com.nytimes
70205786261500.000174int.who
71205710121830.000131com.cnn
72205616741740.000140net.slideshare
73205476341580.000164gov.cdc
74205425462020.000116com.android
75205272302280.000104com.wsj
76205185481940.000122edu.stanford
77205055462050.000115com.businessinsider
78204950342540.000095com.oracle
7920489434340.001049net.fbcdn
80204888683730.000067com.msn
81204882822610.000093edu.harvard
82204833843100.000080com.go
8320478152990.000335com.shopify
84204714242670.000093com.bbc
85204644342970.000083edu.mit
86204613403300.000076com.myspace
8720458776620.000497com.whatsapp
88204572062890.000085com.appspot
89204544663070.000080com.wired
90204463002920.000085com.reuters
91204420041010.000323com.godaddy
92204355501710.000147com.theguardian
93204177701430.000182gov.nih
94204125361960.000120org.ietf
95204013303880.000065gov.nasa
96203972984230.000061com.theverge
97203947361490.000175com.giphy
98203942763820.000066net.researchgate
99203849302700.000092com.bloomberg
100203777781080.000285com.unpkg
101203763941140.000271com.reddit
102203738563370.000075com.xinhuanet
103203667362150.000108org.gnu
104203635063180.000079com.usatoday
105203526608130.000037org.chromium
106203449963560.000071com.springer
10720343678980.000335de.google
10820342420280.001184com.qq
109203418243450.000073com.example
110203365107440.000041edu.psu
111203245364680.000055edu.cornell
112203243781840.000131com.blogger
11320314024600.000516net.akamaihd
114203042423750.000067org.hbr
115203023107500.000040com.git-scm
116203000149370.000032com.wikia
117202985461370.000191com.spotify
118202960124850.000053edu.yale
119202955161130.000271com.jimdo
120202931405540.000047com.cbsnews
121202919467170.000043com.economist
122202905742140.000109com.washingtonpost
123202885041400.000188jp.co.yahoo
124202864702850.000086com.huffingtonpost
125202845583160.000080org.un
126202818744100.000063fr.free
127202799464730.000054edu.berkeley
128202754462870.000086com.cnbc
129202732802450.000099com.dribbble
130202715845760.000046org.arxiv
131202697161510.000172com.issuu
132202570385450.000047com.mysql
133202562621600.000157com.twimg
134202525321070.000285com.statcounter
135202516823380.000075uk.co.telegraph
136202474783050.000081com.w3schools
137202466825610.000047com.gitlab
138202422108020.000038edu.columbia
139202409785240.000049gov.noaa
140202306661220.000230com.ytimg
141202299001190.000233com.youtube-nocookie
142202276567310.000042org.ieee
143202271263330.000075org.npr
144202255287290.000042io.readthedocs
145202252062860.000086org.acm
146202223143390.000074com.time
1472022043011800.000025org.eclipse
148202203822410.000100org.ampproject
149202186163440.000074com.fc2
150202157301420.000185com.wixsite
151202136927550.000040edu.washington
152202101224210.000061com.force
153202098642760.000089com.prnewswire
154202091305000.000052com.buzzfeed
155202071364340.000060com.nationalgeographic
156202064024030.000063com.nature
157202038262000.000118gle.forms
158202024907990.000038org.sciencemag
159202011444280.000061com.theatlantic
160202001048710.000035com.stackexchange
161201981422800.000088com.sciencedirect
162201854003320.000075com.staticflickr
163201845284950.000052uk.co.independent
164201822562630.000093gov.ca
165201809726870.000043org.worldbank
166201759944350.000060com.mozilla
167201754007340.000041com.marketwatch
1682016809810870.000027com.hatenablog
169201670403640.000069com.nypost
170201640166460.000043org.bitbucket
171201611922190.000107com.ft
172201511164630.000056com.pixabay
173201437963540.000071jp.co.rakuten
174201426527430.000041edu.upenn
175201401262770.000089org.doi
176201393769660.000031jp.livedoor
177201365461980.000120uk.co.google
178201349324070.000063uk.co.dailymail
179201344047240.000042org.pbs
180201339362580.000094net.behance
181201329141920.000124org.wikimedia
182201278609170.000033edu.jhu
183201278284540.000057gov.whitehouse
184201223528560.000035org.weforum
185201221704160.000062com.dailymotion
1862011705414870.000020com.warnerbros
187201118983260.000077org.opensource
1882011079810910.000027cn.com.chinadaily
189201099165480.000047me.about
190201098202320.000103jp.ameblo
191201089405580.000047com.oup
192201034283250.000077com.digg
193200974184550.000056com.entrepreneur
194200951086310.000044com.vice
195200941427490.000040com.qz
1962009269212590.000024com.discovery
197200911544440.000058com.goodreads
198200910524470.000057gg.discord
1992008291011090.000027com.sap
200200821863530.000071com.scribd
201200794121880.000128com.feedburner
202200761464660.000055com.fortune
203200755565800.000045com.gartner
2042007259810120.000029com.500px
205200721364580.000056jp.ne.sakura
206200674001760.000139com.imdb
207200609507320.000042uk.co.blogspot
2082005905417350.000018com.amd
209200582289470.000032edu.princeton
210200566668900.000034org.cambridge
21120056572510.000714com.fb
212200562728480.000036com.evernote
213200544721440.000180com.dropbox
21420053532390.000951com.wixstatic
215200516626170.000044org.unesco
2162005094014610.000020com.fandom
217200481522940.000084com.wiley
218200461347680.000039com.withgoogle
2192003942610150.000029org.altervista
2202003901023370.000014com.wolfram
221200379207980.000038com.slate
2222003148412010.000025org.kernel
2232002816410490.000028edu.purdue
224200252825690.000046page.g
225200213407860.000038com.trello
226200170182300.000103com.disqus
227200127967570.000040org.eff
228200104309510.000031com.merriam-webster
229200046864930.000052gov.usda
230200042409810.000030com.netlify
2312000399421790.000015com.diigo
232200029188070.000038com.vox
233200026901800.000135org.allaboutcookies
2342000222012060.000025com.jetbrains
2351999941814160.000021edu.arizona
236199943845420.000047com.tandfonline
237199930308440.000036com.foxnews
238199921842910.000085com.live
239199911421750.000140com.xing
240199898749090.000033com.politico
241199885703200.000079com.outlook
2421998503611350.000026jp.ne.goo
243199833407540.000040au.net.abc
2441998268019450.000016com.wikidot
245199779347930.000038com.investopedia
2461997757410660.000028edu.uchicago
2471997682010090.000029edu.wisc
248199759221970.000120com.eepurl
2491997256010390.000028com.bostonglobe
250199720967750.000039org.semver
251199695946190.000044com.sagepub
252199691824970.000052gov.fda
253199684423470.000073net.windows
2541996808415680.000019edu.osu
255199653863190.000079com.nbcnews
256199639462440.000099com.myshopify
257199628925850.000045cn.google
258199625306080.000044site.business
259199610668320.000036com.sciencedaily
2601996038010440.000028com.strikingly
2611995636612360.000024edu.unc
2621995626814460.000021edu.virginia
2631995603412040.000025co.elastic
2641995296011940.000025com.nymag
2651995050022060.000015com.renren
266199504907420.000041gov.house
2671995044821630.000015sg.edu.nus
2681994797622850.000014org.wikibooks
2691994728419610.000016com.googlesource
270199405982350.000103com.wpengine
271199401583230.000078com.googlecode
272199392127610.000040gov.senate
273199380085130.000051com.herokuapp
274199377384520.000057org.pewresearch
275199374925670.000046org.iana
2761993695410930.000027com.podbean
277199358189820.000030com.alexa
2781993474216290.000019gd.is
279199338041030.000301com.paypalobjects
280199327408050.000038org.unicef
281199324167180.000043com.newyorker
282199308589690.000031uk.co.thetimes
283199293244040.000063com.patreon
2841992826610600.000028com.lifehacker
285199259403810.000066com.criteo
286199245249970.000030com.huffpost
287199225763030.000081com.squareup
288199225108390.000036ca.cbc
2891992180811450.000026org.wiktionary
290199188441460.000178com.addtoany
291199181742010.000117com.optimizely
2921991805213420.000022edu.msu
2931991598613710.000022com.history
294199133844180.000062com.calendly
2951990586011810.000025com.udemy
296199033648090.000037uk.ac.ox
297199029201720.000145com.amazon-adsystem
29819899332490.000743com.googleadservices
299198969241550.000167com.opera
300198909708870.000034org.fao
3011989083210170.000029com.ecwid
302198908264760.000054com.googleblog
303198871422110.000110com.stackoverflow
3041988619014190.000021uk.ac.lse
305198853123600.000070com.getpocket
3061988445616670.000018org.maven
307198838009150.000033uk.co.guardian
308198833581690.000148org.bbb
3091988108413370.000022com.aljazeera
310198807902550.000095com.aliyuncs
3111987993827230.000013net.pixnet
3121987438431800.000011net.hinet
3131986902811700.000025com.smithsonianmag
3141986883213470.000022edu.ucdavis
315198682588940.000034gov.congress
3161986719013200.000023edu.illinois
3171986516811200.000026com.theglobeandmail
3181986330610360.000029gov.archives
319198624144920.000052it.placehold
32019861934930.000359net.facebook
3211986137616150.000019hk.com.google
3221986092214730.000020ca.sfu
3231985635216760.000018blog.home
3241985529010730.000027com.apnews
325198548929630.000031com.ssrn
3261985368233830.000010com.wizards
3271985110219970.000016com.nabble
328198510327600.000040com.chinaz
3291985041236670.000010cn.edu.sjtu
3301984814014840.000020com.urbandictionary
3311984443611360.000026com.scmp
3321984232614890.000020ms.1drv
3331984179643610.000008tw.com.gamer
3341983858213920.000021com.flipboard
335198381669190.000033co.g
336198375425470.000047com.gofundme
3371983699620970.000015com.france24
3381983563614050.000021jp.geocities
3391983365413700.000022com.ibtimes
340198313625810.000045com.biomedcentral
3411983005611280.000026com.britannica
3421982942021740.000015com.oregonlive
343198270624120.000062com.kickstarter
344198262149620.000031com.adjust
345198241888670.000035gov.fcc
346198240487150.000043uk.co.mirror
347198232665890.000045us.icio
3481982317211290.000026com.mediafire
3491982176814320.000021edu.tamu
350198213105870.000045com.usnews
3511982044213140.000023org.greenpeace
352198202529850.000030edu.academia
3531981948613810.000021com.livescience
3541981597216840.000018gov.cia
3551981456413250.000023com.akamai
356198132669300.000032com.chicagotribune
357198115381560.000167com.npmjs
3581981110014290.000021net.seesaa
359198101203290.000076es.google
3601980971012380.000024com.reverbnation
361198094905500.000047com.quora
3621980831434810.000010com.proboards
3631980626810400.000028com.thehill
364198038403210.000078org.python
3651980147611320.000026org.jstor
3661980101817220.000018ca.mcgill
367197999821670.000149com.zendesk
368197928909990.000030com.thelancet
3691979224610940.000027com.jamanetwork
3701978859419350.000016uk.ac.manchester
371197852145400.000048com.udacity
3721978332813720.000021ca.utoronto
373197830825790.000046com.bigcartel
3741978223024870.000013org.wikiquote
3751978118613570.000022edu.rutgers
376197800288960.000034org.apa
377197797184390.000059com.newsweek
378197785389200.000033com.healthline
3791977798222040.000015com.knowyourmeme
380197756103280.000077com.tinyurl
381197755587260.000042gov.state
382197750922160.000108com.unsplash
3831977370217080.000018ca.ualberta
384197723784060.000063com.githubusercontent
3851977190014710.000020com.asahi
386197712202590.000094org.nodejs
387197694364750.000054com.latimes
3881976925810270.000029com.timeanddate
389197686864320.000060com.slack
390197684107690.000039jp.shinobi
3911976797616740.000018com.buzzfeednews
392197650384150.000062com.elsevier
3931976472213350.000022edu.gatech
3941976429828610.000012com.youdao
395197612568950.000034com.brightcove
3961975973017740.000017com.bankofamerica
3971975953025690.000013edu.byu
3981975876019180.000016com.voanews
3991975758631640.000011com.opendns
4001975681614250.000021com.sky
4011975578023360.000014com.slides
4021975446213730.000021com.dw
4031975445811580.000026com.nikkei
404197525909040.000033com.cbslocal
4051974876622360.000014net.earthlink
406197486783910.000064com.cnet
4071974815016420.000018com.xrea
4081974743013540.000022uk.co.huffingtonpost
409197464241820.000133com.eventbrite
4101974637010710.000027com.nydailynews
4111974409013050.000023me.vk
412197431949180.000033gov.bls
4131974154214580.000020org.ap
414197409363840.000066net.imgix
4151973986024140.000014org.aclweb
4161973975016410.000018com.axios
417197389409870.000030com.wattpad
4181973753017130.000018com.straitstimes
419197374124740.000054com.ted
4201973687412940.000023edu.brookings
421197286349670.000031int.coe
422197275802120.000109com.etsy
4231972711223920.000014com.biography
424197260808650.000035gov.va
425197257102170.000107com.typepad
4261972462819320.000016com.cocolog-nifty
4271972358016080.000019com.reference
428197207405530.000047com.livejournal
4291971740620960.000015ru.kremlin
430197163548150.000037uk.gov.service
431197153782980.000083com.techcrunch
4321971235824620.000013org.wikisource
4331971229615530.000019com.foxbusiness
4341971162012810.000023mil.army
4351971124417610.000017com.itv
436197102607330.000041com.deviantart
4371970595213110.000023de.mpg
438197052888450.000036gov.justice
4391970457419930.000016cn.people
4401970324812620.000024au.com.smh
4411970165617630.000017org.tensorflow
4421970163412230.000024org.ohchr
443197010005680.000046ru.gov
444197001364000.000064com.technorati
4451969959621340.000015jp.co.japantimes
44619697954830.000413com.list-manage
4471969708810680.000028com.thedrum
4481969675415380.000019uk.co.standard
449196954301850.000131com.rawgit
4501969421621200.000015com.oxforddictionaries
4511969300622410.000014com.shutterfly
4521969208231470.000011tw.edu.ntu
4531969156425500.000013com.smashwords
4541968986218620.000016edu.unl
4551968876824020.000014org.fas
456196886462960.000084uk.org.ico
4571968813827100.000013tv.blip
458196860669570.000031com.bandsintown
4591968444835160.000010cn.org.china
4601968296015500.000019uk.co.express
4611967970810820.000027jp.jugem
4621967915836560.000010info.webry
4631967873014030.000021gov.uscourts
4641967794421570.000015au.edu.unimelb
46519675766920.000363com.wsimg
466196748682830.000086ru.rambler
4671967373819210.000016com.washingtontimes
468196717543510.000072com.proofpoint
46919669412740.000441net.jsfiddle
470196683527880.000038org.mediawiki
4711966815828510.000012jp.blog
4721966774014790.000020com.firebaseapp
4731966741816180.000019com.webnode
4741966594021730.000015com.pbworks
4751966574833740.000011com.patheos
4761966568431350.000011uk.co.timesonline
4771966398021710.000015google.ai
478196633542330.000103com.squarespace
4791966218829040.000012fr.rfi
4801966098414540.000020gov.supremecourt
4811965920018890.000016int.unfccc
482196585343310.000076com.office
483196565265770.000046pl.google
484196540989910.000030gov.wa
485196527968040.000038gov.sba
4861965262612670.000023com.cognitoforms
4871965006622070.000015org.csis
488196490083660.000068io.codepen
4891964875023440.000014com.kobo
490196465121100.000281com.mailchimp
4911964342816710.000018edu.wustl
4921964257227340.000013edu.kit
4931964233414800.000020org.hrw
494196422769530.000031edu.umich
4951964185613890.000021com.dictionary
496196415448360.000036com.mapquest
4971964083617470.000017org.worldcat
4981964027636210.000010net.aljazeera
499196401443570.000071com.photobucket
5001963994820460.000015net.cnki
5011963851017050.000018com.secondlife
5021963841624210.000014int.wmo
5031963788810890.000027org.ilo
5041963745011000.000027google.blog
505196366923780.000067com.meetup
506196346349950.000030uk.co.pinterest
5071963377033970.000010com.freehostia
5081963041232560.000011com.doodlekit
509196297469360.000032com.arstechnica
5101962837037300.000009com.colourlovers
5111962835616960.000018ru.ucoz
512196282989520.000031com.thenextweb
5131962445822860.000014org.unep
5141962234222520.000014org.icrc
5151962180814240.000021com.findlaw
5161962113423340.000014com.similarweb
517196206964810.000054com.gmail
5181961930430400.000012io.soup
5191961624614370.000021com.imageshack
5201961595627850.000013com.sputniknews
5211961407830800.000012com.smore
5221961323232460.000011org.iucnredlist
5231961176631170.000011com.kinja
5241961176018830.000016com.csmonitor
525196116041450.000180ru.mail
5261961008813390.000022gov.uscis
527196085544460.000058net.secureservercdn
5281960631430040.000012sh.now
529196057484270.000061tv.twitch
5301960499415800.000019link.app
531196008144400.000059com.statista
5321959916036760.000010jp.hatenablog
5331959555043560.000008com.coroflot
5341959526431770.000011org.jenkins-ci
5351959515817570.000017gov.oregon
5361959313032000.000011li.paper
5371959310638470.000009com.pixar
5381958987830950.000011com.shell
5391958819440350.000009com.scienceblogs
5401958618816250.000019org.amnesty
541195848248920.000034com.thedailybeast
5421958246417670.000017org.pypi
5431958234621490.000015com.foreignpolicy
5441958031028490.000012com.instapaper
5451957967229100.000012org.accessnow
5461957861416020.000019com.surveygizmo
5471957778017330.000018ca.globalnews
5481957620031750.000011de.uni-koeln
549195761982390.000101io.shields
5501957618433770.000011org.lds
5511957590222380.000014org.rand
552195747902070.000114com.salesforce
5531957454434380.000010net.mootools
5541957442823570.000014at.ac.univie
5551957418240500.000009org.marxists
5561957166428600.000012org.panda
5571957119428060.000013com.oprah
5581956857618740.000016com.justia
5591956797034710.000010org.avaaz
5601956785428800.000012com.openai
5611956776435970.000010org.neocities
5621956726037530.000009cn.edu.sdu
563195649607620.000040com.netflix
564195641204980.000052com.oreilly
5651956308644050.000008com.yam
566195622482270.000105uk.co.amazon
567195622048660.000035com.zoho
568195609566290.000044com.zdnet
5691955996612980.000023ly.snip
5701955879017900.000017ch.ipcc
571195586649930.000030uk.parliament
5721955850837870.000009com.nestle
5731955630412540.000024se.google
5741955629229970.000012com.treehugger
5751955518410110.000029net.nocookie
5761955509646440.000008com.x0
5771955336836310.000010org.tvtropes
5781955099211410.000026org.sphinx-doc
5791954999421220.000015ru.mos
5801954882030440.000012es.csic
5811954853029130.000012uk.gov.companieshouse
5821954657610340.000029com.engadget
5831954623011830.000025com.here
5841954549250600.000007com.dbs
5851954543841030.000009br.ufrj
5861954420421590.000015edu.colostate
5871954339827060.000013de.uni-heidelberg
5881954050030590.000012com.pearltrees
5891953926821760.000015net.openid
5901953788026000.000013com.mystrikingly
5911953784438800.000009com.chinatimes
5921953583424000.000014link.page
5931953418223540.000014com.real
5941953343218360.000017org.ncsl
595195322883010.000082com.surveymonkey
596195319303620.000070com.hp
5971953141211930.000025org.js
5981953070021350.000015com.123formbuilder
5991952884224260.000014org.vim
6001952810432050.000011pl.wp
6011952801826020.000013au.com.sbs
602195267801700.000148com.yelp
6031952621624990.000013uk.ac.kcl
6041952434613380.000022org.aarp
6051952369226210.000013th.co.google
6061952315610060.000029uk.gov.legislation
607195230422600.000094com.getbootstrap
6081952285636630.000010com.magcloud
6091952227439900.000009com.zynga
6101952194212680.000023tw.com.google
6111952192228290.000013com.kaggle
612195201309480.000031gov.gpo
613195197429460.000032com.about
6141951971432730.000011org.rsf
6151951874029760.000012org.tigris
6161951822427270.000013uk.ac.leeds
6171951551235350.000010de.dw
6181951543430190.000012org.cfr
6191951457432530.000011de.uni-freiburg
6201951357036400.000010de.uni-konstanz
6211951271438810.000009ua.at
6221951125421170.000015info.worldometers
6231951031446570.000008com.embarcadero
6241950937029990.000012vn.zing
6251950913432290.000011com.bangkokpost
6261950880436150.000010ly.rebrand
6271950854820080.000016gov.ky
6281950842640090.000009org.wilsoncenter
6291950677440590.000009jp.hatenadiary
6301950628443740.000008com.musictoday
6311950538838240.000009org.constitutioncenter
632195051863720.000067com.booking
6331950440225790.000013com.eiseverywhere
6341950380040380.000009com.itsnicethat
6351950377633310.000011il.ac.tau
6361950209623590.000014mx.com.google
6371950080637360.000009com.db
638194989283120.000080com.ebay
6391949858835780.000010jp.hateblo
6401949816633480.000011org.democracynow
6411949729639750.000009edu.odu
6421949681228150.000013dk.au
6431949662642200.000008com.etymonline
6441949618428850.000012uk.gov.metoffice
645194957563610.000070com.skype
6461949556635700.000010com.hsbc
6471949484422280.000015com.bankrate
6481949410422400.000014gov.wi
6491949335218150.000017fi.google
6501949330644260.000008com.x10host
6511949213632240.000011org.royalsociety
652194910968170.000037com.pexels
653194903585320.000048com.mashable
6541949028246140.000008com.epochtimes
6551949001811740.000025edu.ucla
6561948965632260.000011cc.reurl
6571948941434300.000010com.dailykos
6581948936037420.000009uk.ac.uea
6591948805037050.000010ca.shaw
6601948610419680.000016uk.gov.tfl
6611948598834340.000010uk.ac.nhm
6621948503230600.000012com.ipage
6631948475424980.000013com.prweek
6641948459818190.000017gov.usembassy
6651948396648610.000007am.do
6661948363630860.000011com.viki
6671948351832520.000011se.liu
6681948271830660.000012com.coca-colacompany
6691948258042320.000008br.ufrgs
6701948249836390.000010de.uni-kiel
6711948134014530.000020com.speakerdeck
6721948071830770.000012net.openreview
6731948066022080.000015de.auswaertiges-amt
674194802482080.000113com.hubspot
6751947976220260.000016com.lexisnexis
6761947870021060.000015net.ucoz
6771947755234940.000010com.iconarchive
678194775328190.000037com.steampowered
679194772867560.000040com.xiti
6801947713224860.000013com.post-gazette
6811947689833690.000011com.eklablog
6821947663229370.000012uk.co.bbci
6831947637819110.000016hu.google
6841947616043990.000008com.jacobinmag
6851947597433230.000011uk.ac.sussex
6861947436830680.000012uk.ac.qmul
6871947421239300.000009nf.co
6881947301441140.000009com.collinsdictionary
6891947289652150.000007com.evaair
6901947284625720.000013com.marketwire
6911947258031380.000011au.com.telstra
6921947211439160.000009it.unitn
693194716468980.000034com.visualstudio
6941947133038070.000009in.ernet
6951947099429060.000012nl.rug
6961946870852970.000007org.arkive
697194682522520.000096org.drupal
6981946705034600.000010ca.dal
6991946704636930.000010com.canada
7001946564214510.000021com.tinypic
7011946530431360.000011org.wri
7021946503436980.000010com.la-croix
7031946410845570.000008com.mitsubishielectric
7041946382847480.000008com.gamejolt
7051946297627890.000013gr.google
7061946288248820.000007cz.webgarden
7071946240430790.000012my.com.thestar
708194618302690.000092net.php
7091946164043290.000008au.gov.fairwork
7101946077022790.000014co.pcdn
7111946017639430.000009uk.ac.essex
712194599841210.000231org.networkadvertising
7131945968433960.000010org.rferl
7141945906842110.000008com.sc
7151945902032920.000011com.blogfa
7161945879433820.000010ca.yelp
7171945758041020.000009edu.utm
7181945724856940.000007com.anghami
7191945653252100.000007su.clan
7201945614440950.000009it.justpaste
721194560064140.000062com.sxsw
7221945591432580.000011com.waterstones
7231945460239600.000009com.jigsy
724194545168380.000036com.intel
7251945439440320.000009ee.ut
726194532429160.000033com.docker
727194529887380.000041com.samsung
7281945180234220.000010es.ucm
7291945071825030.000013com.washingtonexaminer
7301945034239510.000009tl.page
7311945020622090.000015org.wbur
7321944903641120.000009site.negocio
7331944892227730.000013com.yell
7341944851639880.000009com.fatcow
7351944826632820.000011pl.poznan
736194481981350.000194com.youku
7371944793028780.000012ae.thenational
7381944776647050.000008id.co.kaskus
7391944766834070.000010com.afp
7401944760253360.000007net.manilatimes
741194467344190.000062com.caniuse
7421944616814700.000020com.pastebin
7431944591033870.000010uk.org.rspb
744194457367650.000039com.moz
7451944437640270.000009lv.draugiem
7461944160425080.000013gov.dni
7471944087425930.000013ro.google
7481944014429460.000012com.broadwayworld
7491943957437500.000009ru.msu
7501943937437660.000009pl.cba
7511943933241370.000009org.rfa
7521943928055620.000007org.bukkit
7531943908620130.000016scot.gov
754194388681330.000200com.constantcontact
7551943882656380.000007org.adbusters
7561943809445170.000008google.design
7571943765441540.000008com.macobserver
7581943708816490.000018fr.pagesjaunes
7591943702025020.000013com.thenation
7601943677639730.000009com.bbcamerica
7611943455648570.000007com.orgfree
7621943381029780.000012com.channelnewsasia
763194325067350.000041gov.sec
7641943250240080.000009com.teamspeak
7651943243028000.000013org.gnupg
7661943226037800.000009com.the-scientist
7671943225230150.000012com.laweekly
7681943144629210.000012au.edu.sydney
7691943008435770.000010uk.co.yougov
7701943000031400.000011vn.com.google
7711942994244170.000008com.50webs
7721942900431240.000011org.repec
7731942893832150.000011org.ourworldindata
7741942789035060.000010com.tradingeconomics
7751942735231020.000011tw.com.pchome
7761942658233320.000011com.monday
7771942655635560.000010org.project-syndicate
7781942555223310.000014com.amebaownd
7791942489015960.000019org.whatbrowser
7801942475019560.000016org.americanbar
7811942468037390.000009ie.thejournal
782194241521040.000298com.stripe
7831942414040140.000009com.hatenadiary
7841942406029330.000012org.thinkprogress
7851942371230730.000012uk.gov.london
7861942305439270.000009com.thesaurus
7871942300634750.000010net.webself
7881942296434320.000010io.pantheon
7891942171234200.000010uk.ac.exeter
7901942150843430.000008com.appledaily
7911942111835280.000010com.bravesites
7921942081651780.000007com.bambuser
7931942059233790.000011com.foreignaffairs
7941941937824320.000013com.instructables
7951941638821850.000015vn.vietnamnet
7961941473639940.000009com.webcindario
7971941432828230.000013org.ewg
7981941393445340.000008ws.nimb
7991941377828330.000013org.fullfact
800194133522560.000095us.zoom
8011941255636850.000010com.encyclopedia
8021941247438970.000009de.uni-erlangen
8031941082253410.000007net.boards
804194095983410.000074com.histats
8051940953442010.000008is.pse
806194094367480.000040fm.last
8071940780836610.000010com.mongabay
8081940704032200.000011me.site123
8091940633834360.000010com.seetickets
8101940555058380.000007com.gamigo
8111940440016660.000018com.materialdesignicons
8121940410851400.000007bd.com.google
813194032427900.000038com.venturebeat
8141940121846010.000008uk.org.phrases
8151940078032130.000011com.instructure
8161940029828170.000013gov.arkansas
81719399890720.000444com.livestream
8181939955440810.000009cat.uab
8191939948635460.000010org.lacity
8201939937236120.000010com.heraldscotland
8211939837014990.000020com.teachable
8221939667228950.000012com.foodandwine
8231939575212330.000024com.createjs
8241939427422660.000014com.ajc
8251939417239500.000009com.rappler
8261939403023550.000014net.noscript
8271939398241400.000009jp.doorblog
8281939288228730.000012com.timeshighereducation
829193922382750.000089com.bandcamp
8301938933239690.000009jp.ne.hi-ho
8311938809436290.000010net.inquirer
832193878825520.000047com.cisco
8331938731840760.000009pl.lublin
8341938637016570.000018com.pcworld
835193834042660.000093com.typeform
836193828862030.000116com.naver
8371938269837230.000010gov.bts
8381938219218160.000017jp.makeshop
8391938210244620.000008com.tor
8401938207245130.000008com.weightwatchers
8411938134614380.000021org.khanacademy
842193812749540.000031com.thinkwithgoogle
8431938102033850.000010uk.ac.jisc
8441938023840880.000009ly.genial
8451937998640070.000009com.themoscowtimes
8461937850032720.000011com.nyt
8471937843437600.000009com.springernature
8481937835633900.000010int.cbd
8491937785460450.000006es.xurl
8501937689817560.000017com.netsolhost
8511937659838520.000009au.edu.griffith
8521937605447400.000008co.edu.unal
8531937604040740.000009kr.co.koreatimes
854193745887270.000042com.deloitte
8551937430049860.000007org.edc
8561937394041490.000008vn.tienphong
8571937347635150.000010com.thediplomat
8581937293240990.000009uk.ac.lancs
8591937279850060.000007com.inoreader
8601937274649220.000007com.ueuo
8611937259415850.000019tv.ustream
8621937257632340.000011com.tapatalk
8631937235634160.000010nl.wur
8641937210648480.000007net.hypermart
8651937163622930.000014org.kff
866193693563980.000064com.pubmatic
8671936898236250.000010org.grist
8681936848030880.000011tw.gov.cdc
8691936828833890.000010com.gothamist
8701936813011060.000027com.gizmodo
8711936811641010.000009com.globalpost
872193676768140.000037gov.nist
8731936753645630.000008org.globalsecurity
8741936645445470.000008build.bazel
8751936638437820.000009us.ms.state
8761936587842560.000008gr.ntua
8771936577644440.000008se.thelocal
8781936537229630.000012com.politifact
8791936512813170.000023com.ensighten
8801936358850970.000007ru.my1
8811936268034680.000010com.rabbitmq
8821935969841380.000009com.elasticbeanstalk
8831935957413640.000022com.billboard
8841935912247660.000008cc.dict
8851935877456870.000007fi.mbnet
886193573908790.000035com.aliexpress
887193569182100.000111to.amzn
8881935566842750.000008edu.ohio
8891935554634520.000010com.thejakartapost
8901935535032770.000011vn.com.dantri
8911935508052850.000007com.galvanize
8921935488034840.000010jp.go.ndl
8931935479047100.000008com.kiwibox
8941935451421400.000015org.linuxfoundation
8951935450048010.000007ru.nnov
8961935316642880.000008gr.auth
8971935297022570.000014net.vnexpress
8981935177029000.000012com.crashlytics
8991935159410450.000028com.dropboxusercontent
9001935082834390.000010com.scotusblog
9011935071240900.000009org.carnegieendowment
902193502783950.000064com.atlassian
9031934972634650.000010com.study
904193487243500.000072com.mapbox
9051934853210460.000028com.redhat
9061934788617990.000017com.bravenet
9071934746042840.000008uk.org.npg
9081934715244630.000008com.btplc
9091934714852890.000007ru.drom
9101934654224300.000013com.vimeopro
9111934590044190.000008edu.marquette
912193456444260.000061com.adweek
913193451449140.000033com.shutterstock
9141934509010160.000029com.ubuntu
9151934196057120.000007in.ac.nptel
9161934148812270.000024com.msdn
9171934071447070.000008com.vocabulary
9181934068039290.000009edu.uaf
9191933965839190.000009com.atavist
9201933945632010.000011com.healthgrades
9211933909225460.000013com.kinstacdn
9221933838423450.000014com.gazhall
9231933793853980.000007com.asmallorange
9241933780037970.000009com.generalmills
9251933617645850.000008vn.vtc
9261933590815190.000020cn.gov.mofcom
927193337787970.000038com.box
9281933360639660.000009si.uni-lj
9291933332241700.000008az.president
9301933319417880.000017org.reactjs
9311933241236050.000010com.postaffiliatepro
9321933192251920.000007edu.uah
9331933128035990.000010org.openedition
9341933069648380.000007com.kapook
9351933038241530.000008org.caringbridge
936193303744830.000053com.aol
9371932961423030.000014org.nfpa
9381932953859560.000006com.glosbe
9391932919441240.000009com.mcall
9401932762242890.000008ru.tmweb
9411932687641260.000009uk.co.liverpoolecho
9421932642242440.000008com.atwebpages
9431932598010670.000028com.freepik
9441932479040850.000009org.specialolympics
9451932386848450.000007net.freeforums
9461932367647440.000008uk.ac.westminster
9471932353240920.000009com.tok2
9481932346010250.000029com.elpais
9491932315049460.000007tw.com.sina
9501932250832960.000011com.wowza
951193223063170.000079com.webs
9521932202446970.000008com.warriorplus
9531932191834140.000010com.cityam
9541932181244820.000008org.fee
9551932152048540.000007tw.edu.ntnu
9561932129649620.000007com.sparknotes
9571932020245160.000008com.newspapers
9581931963421920.000015com.tutsplus
9591931960058680.000007com.ananova
9601931927438180.000009org.opensecrets
961193191346330.000044gov.uspto
9621931872256800.000007su.moy
9631931836610130.000029com.uk
9641931826649360.000007ru.pr-cy
9651931805838270.000009cz.centrum
9661931778041580.000008edu.niu
9671931532016650.000018org.webkit
9681931501446920.000008pl.edu.amu
9691931408451860.000007com.artfire
9701931389438000.000009org.ascd
9711931210638010.000009edu.scu
9721931174243070.000008com.taipeitimes
9731931156843510.000008edu.whoi
9741931085459490.000006com.voatiengviet
9751931074831000.000011com.broadcastingcable
9761931072046550.000008hk.rthk
9771931024657030.000007com.enotes
978193099104880.000053com.indiatimes
979193096608600.000035com.playstation
9801930904048660.000007com.brothersoft
9811930894827080.000013uk.gov.defra
982193076062310.000103org.whatwg
9831930717844510.000008com.batchgeo
984193071187510.000040com.psychologytoday
9851930636842630.000008uk.co.lrb
9861930635050340.000007ca.pe.gov
9871930588441590.000008com.ecowatch
9881930382041950.000008com.williamhill
9891930354857670.000007pt.ipp
9901930297248430.000007uk.org.38degrees
9911930162413030.000023com.technologyreview
9921930146440910.000009org.spie
993193010689590.000031com.libsyn
9941930057247950.000007com.storeboard
9951930054832600.000011de.bmel
9961929944847490.000008net.onlinewebshop
9971929927438720.000009ru.1gb
998192986542790.000088com.automattic
9991929850238700.000009com.piie
10001929744053060.000007com.allthatsinteresting

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

May/June 2020 crawl archive now available

The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives.

Starting with this crawl the WET files indicate the natural language(s) a text is written in. The language is detected using Compact Language Detector 2 (CLD2) and was made available since August 2018 only in WARC and WAT files and URL indexes. It is now also provided in WET files in the WARC header "WARC-Identified-Content-Language". Up to three language(s) are detected per document and given as comma-separated list of ISO-639-3 codes, here one example WET record fragment:

...
WARC-Identified-Content-Language: isl,eng
Content-Type: text/plain
Content-Length: 10494

Bananabrauð með Nutella – Ljúfmeti og lekkerheit
...

Additional information about this improvement is given in the corresponding issue report.

Archive Location and Download

The May/June crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-24/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-24/segment.paths.gz100
WARC filesCC-MAIN-2020-24/warc.paths.gz6000053.16
WAT filesCC-MAIN-2020-24/wat.paths.gz6000019.02
WET filesCC-MAIN-2020-24/wet.paths.gz600008.42
Robots.txt filesCC-MAIN-2020-24/robotstxt.paths.gz600000.22
Non-200 responses filesCC-MAIN-2020-24/non200responses.paths.gz600002.77
URL index filesCC-MAIN-2020-24/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-24/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March/April 2020 crawl archive now available

The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Archive Location and Download

The March/April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-16/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-16/segment.paths.gz100
WARC filesCC-MAIN-2020-16/warc.paths.gz5600062.67
WAT filesCC-MAIN-2020-16/wat.paths.gz5600020.37
WET filesCC-MAIN-2020-16/wet.paths.gz560008.97
Robots.txt filesCC-MAIN-2020-16/robotstxt.paths.gz560000.19
Non-200 responses filesCC-MAIN-2020-16/non200responses.paths.gz560001.39
URL index filesCC-MAIN-2020-16/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-16/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2020 crawl archive now available

The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Improvements and Fixes

The HTTP headers in WARC response records have been fixed: the HTTP response status line now has a white space following the status code if the reason-phrase is empty. E.g., if a server sends an empty message (instead of “OK”), the status line will include a trailing space character: “HTTP/1.1 200 ”. Following RFC 7230 the white space between status code and message is mandatory. Please refer to the bug report NUTCH-2763 for further details.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-10/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-10/segment.paths.gz100
WARC filesCC-MAIN-2020-10/warc.paths.gz5600049.28
WAT filesCC-MAIN-2020-10/wat.paths.gz5600017.98
WET filesCC-MAIN-2020-10/wet.paths.gz560007.97
Robots.txt filesCC-MAIN-2020-10/robotstxt.paths.gz560000.22
Non-200 responses filesCC-MAIN-2020-10/non200responses.paths.gz560002.21
URL index filesCC-MAIN-2020-10/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-10/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.