August 2020 crawl archive now available

The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th. It includes page captures of 940 million URLs unknown in any of our prior crawl archives.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-34/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-34/segment.paths.gz100
WARC filesCC-MAIN-2020-34/warc.paths.gz6000048.9
WAT filesCC-MAIN-2020-34/wat.paths.gz6000016.9
WET filesCC-MAIN-2020-34/wet.paths.gz600007.56
Robots.txt filesCC-MAIN-2020-34/robotstxt.paths.gz600000.19
Non-200 responses filesCC-MAIN-2020-34/non200responses.paths.gz600001.94
URL index filesCC-MAIN-2020-34/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

July 2020 crawl archive now available

The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th. It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives.

Bug Fixes and Improvements

The URL index fields "redirect" and "mime" haven’t been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case. This bug has been detected during the crawl and was fixed for 90 out of 100 segments. It also affects the columnar index and the fields "fetch_redirect" resp. "content_mime_type". To a minor extend it may affect the detection of character set and content language as the value of the Content-Type header is used as additional hint for the detection. Additional information about this bug fix is given in the corresponding issue report.

Archive Location and Download

The July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-29/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-29/segment.paths.gz100
WARC filesCC-MAIN-2020-29/warc.paths.gz6000062.64
WAT filesCC-MAIN-2020-29/wat.paths.gz6000022.23
WET filesCC-MAIN-2020-29/wet.paths.gz600009.87
Robots.txt filesCC-MAIN-2020-29/robotstxt.paths.gz600000.21
Non-200 responses filesCC-MAIN-2020-29/non200responses.paths.gz600002.52
URL index filesCC-MAIN-2020-29/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-29/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Feb/Mar/May 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

What’s new?

The host-level graph now includes hosts visited by the crawler but not linking to any other host. Why is this possible – isn’t any host found via links the crawler is following? Yes, but some links were already detected in a prior crawl, not in one of the 3 crawls used to build the web graphs. More details about the issue are given in cc-pyspark#15. The impact of this fix on the graph size is minimal: the recent crawl now includes 1 million nodes (0.1% of all nodes) which are not connected to any other node.

Host-level graph

The graph consists of 927 million nodes and 3.88 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 857 million dangling nodes (92.5%) and the largest strongly connected component contains 47 million (5.1%) nodes.

You can download the graph and the ranks of all 927 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/May 2020 host-level webgraph

SizeFileDescription
5.67 GBcc-main-2020-feb-mar-may-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
17.26 GBcc-main-2020-feb-mar-may-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
7.40 GBcc-main-2020-feb-mar-may-host.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-host.properties
8.57 GBcc-main-2020-feb-mar-may-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2020-feb-mar-may-host-t.properties
1 kBcc-main-2020-feb-mar-may-host.statsWebGraph statistics
12.16 GBcc-main-2020-feb-mar-may-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 91 million nodes and 1.96 billion edges. 51% or 46 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/.

Download files of the Common Crawl Feb/Mar/May 2020 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2020-feb-mar-may-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.79 GBcc-main-2020-feb-mar-may-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.23 GBcc-main-2020-feb-mar-may-domain.graphgraph in BVGraph format
2 kBcc-main-2020-feb-mar-may-domain.properties
4.16 GBcc-main-2020-feb-mar-may-domain-t.graphtranspose of the graph
2 kBcc-main-2020-feb-mar-may-domain-t.properties
1 kBcc-main-2020-feb-mar-may-domain.statsWebGraph statistics
1.96 GBcc-main-2020-feb-mar-may-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Feb/Mar/May 2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13266761810.018180com.googleapis
23055277230.011873com.facebook
32956908820.013789com.google
42692046040.007145com.twitter
52688312850.007106org.w
62636044860.006483com.youtube
72471939690.004210com.instagram
82425194280.005125org.gmpg
92384133270.005329com.googletagmanager
1023606890130.002940com.linkedin
1122741292100.003621com.cloudflare
1222732960120.002974org.wordpress
1322661910140.002515com.gravatar
1422577680150.002438com.gstatic
1522378134220.001529com.pinterest
1622196962270.001192org.wikipedia
1722189650190.001864com.wordpress
1822066028160.002404com.bootstrapcdn
1921967760180.001884com.apple
2021751768200.001863com.jquery
2121589606240.001461com.microsoft
2221568908440.000785be.youtu
2321568474430.000806com.blogspot
2421533280310.001104com.vimeo
2521415938460.000761gl.goo
2621399120350.001040com.amazonaws
2721358048530.000665com.amazon
2821331634210.001737com.adobe
2921324666230.001506com.wp
3021209012700.000452com.tumblr
3121184360170.001949com.github
3221150652370.001008com.google-analytics
3321110976300.001152com.baidu
3421096692870.000387com.yahoo
3521081268590.000547ly.bit
3621060360330.001072com.macromedia
3721046916360.001035net.cloudfront
3821036258450.000763com.flickr
3920997926320.001101com.googlesyndication
4020993476260.001277me.wp
4120980462970.000340com.googleusercontent
4220966446560.000624eu.europa
4320960242420.000807net.jsdelivr
4420959910520.000677co.t
4520901872290.001163ru.yandex
4620846092500.000742net.doubleclick
4720843032410.000869com.addthis
4820823518690.000457io.github
4920817952760.000433com.medium
5020810030250.001287com.fontawesome
51208091201390.000189com.forbes
5220796434610.000510org.w3
5320759102550.000640com.paypal
54207572661090.000282com.soundcloud
5520754514900.000368org.creativecommons
5620747472570.000619com.vk
5720711184540.000658org.mozilla
5820710182880.000382com.weebly
5920698442840.000410com.wix
60206753721020.000317com.weibo
6120663930580.000604org.schema
62206502021640.000151com.imgur
63206444521470.000177org.apache
64206422821780.000138uk.co.bbc
65206255601290.000210org.archive
66206103542740.000089com.ibm
67206096141540.000169com.bing
68206023801910.000125net.sourceforge
69205790121300.000207com.nytimes
70205786261500.000174int.who
71205710121830.000131com.cnn
72205616741740.000140net.slideshare
73205476341580.000164gov.cdc
74205425462020.000116com.android
75205272302280.000104com.wsj
76205185481940.000122edu.stanford
77205055462050.000115com.businessinsider
78204950342540.000095com.oracle
7920489434340.001049net.fbcdn
80204888683730.000067com.msn
81204882822610.000093edu.harvard
82204833843100.000080com.go
8320478152990.000335com.shopify
84204714242670.000093com.bbc
85204644342970.000083edu.mit
86204613403300.000076com.myspace
8720458776620.000497com.whatsapp
88204572062890.000085com.appspot
89204544663070.000080com.wired
90204463002920.000085com.reuters
91204420041010.000323com.godaddy
92204355501710.000147com.theguardian
93204177701430.000182gov.nih
94204125361960.000120org.ietf
95204013303880.000065gov.nasa
96203972984230.000061com.theverge
97203947361490.000175com.giphy
98203942763820.000066net.researchgate
99203849302700.000092com.bloomberg
100203777781080.000285com.unpkg
101203763941140.000271com.reddit
102203738563370.000075com.xinhuanet
103203667362150.000108org.gnu
104203635063180.000079com.usatoday
105203526608130.000037org.chromium
106203449963560.000071com.springer
10720343678980.000335de.google
10820342420280.001184com.qq
109203418243450.000073com.example
110203365107440.000041edu.psu
111203245364680.000055edu.cornell
112203243781840.000131com.blogger
11320314024600.000516net.akamaihd
114203042423750.000067org.hbr
115203023107500.000040com.git-scm
116203000149370.000032com.wikia
117202985461370.000191com.spotify
118202960124850.000053edu.yale
119202955161130.000271com.jimdo
120202931405540.000047com.cbsnews
121202919467170.000043com.economist
122202905742140.000109com.washingtonpost
123202885041400.000188jp.co.yahoo
124202864702850.000086com.huffingtonpost
125202845583160.000080org.un
126202818744100.000063fr.free
127202799464730.000054edu.berkeley
128202754462870.000086com.cnbc
129202732802450.000099com.dribbble
130202715845760.000046org.arxiv
131202697161510.000172com.issuu
132202570385450.000047com.mysql
133202562621600.000157com.twimg
134202525321070.000285com.statcounter
135202516823380.000075uk.co.telegraph
136202474783050.000081com.w3schools
137202466825610.000047com.gitlab
138202422108020.000038edu.columbia
139202409785240.000049gov.noaa
140202306661220.000230com.ytimg
141202299001190.000233com.youtube-nocookie
142202276567310.000042org.ieee
143202271263330.000075org.npr
144202255287290.000042io.readthedocs
145202252062860.000086org.acm
146202223143390.000074com.time
1472022043011800.000025org.eclipse
148202203822410.000100org.ampproject
149202186163440.000074com.fc2
150202157301420.000185com.wixsite
151202136927550.000040edu.washington
152202101224210.000061com.force
153202098642760.000089com.prnewswire
154202091305000.000052com.buzzfeed
155202071364340.000060com.nationalgeographic
156202064024030.000063com.nature
157202038262000.000118gle.forms
158202024907990.000038org.sciencemag
159202011444280.000061com.theatlantic
160202001048710.000035com.stackexchange
161201981422800.000088com.sciencedirect
162201854003320.000075com.staticflickr
163201845284950.000052uk.co.independent
164201822562630.000093gov.ca
165201809726870.000043org.worldbank
166201759944350.000060com.mozilla
167201754007340.000041com.marketwatch
1682016809810870.000027com.hatenablog
169201670403640.000069com.nypost
170201640166460.000043org.bitbucket
171201611922190.000107com.ft
172201511164630.000056com.pixabay
173201437963540.000071jp.co.rakuten
174201426527430.000041edu.upenn
175201401262770.000089org.doi
176201393769660.000031jp.livedoor
177201365461980.000120uk.co.google
178201349324070.000063uk.co.dailymail
179201344047240.000042org.pbs
180201339362580.000094net.behance
181201329141920.000124org.wikimedia
182201278609170.000033edu.jhu
183201278284540.000057gov.whitehouse
184201223528560.000035org.weforum
185201221704160.000062com.dailymotion
1862011705414870.000020com.warnerbros
187201118983260.000077org.opensource
1882011079810910.000027cn.com.chinadaily
189201099165480.000047me.about
190201098202320.000103jp.ameblo
191201089405580.000047com.oup
192201034283250.000077com.digg
193200974184550.000056com.entrepreneur
194200951086310.000044com.vice
195200941427490.000040com.qz
1962009269212590.000024com.discovery
197200911544440.000058com.goodreads
198200910524470.000057gg.discord
1992008291011090.000027com.sap
200200821863530.000071com.scribd
201200794121880.000128com.feedburner
202200761464660.000055com.fortune
203200755565800.000045com.gartner
2042007259810120.000029com.500px
205200721364580.000056jp.ne.sakura
206200674001760.000139com.imdb
207200609507320.000042uk.co.blogspot
2082005905417350.000018com.amd
209200582289470.000032edu.princeton
210200566668900.000034org.cambridge
21120056572510.000714com.fb
212200562728480.000036com.evernote
213200544721440.000180com.dropbox
21420053532390.000951com.wixstatic
215200516626170.000044org.unesco
2162005094014610.000020com.fandom
217200481522940.000084com.wiley
218200461347680.000039com.withgoogle
2192003942610150.000029org.altervista
2202003901023370.000014com.wolfram
221200379207980.000038com.slate
2222003148412010.000025org.kernel
2232002816410490.000028edu.purdue
224200252825690.000046page.g
225200213407860.000038com.trello
226200170182300.000103com.disqus
227200127967570.000040org.eff
228200104309510.000031com.merriam-webster
229200046864930.000052gov.usda
230200042409810.000030com.netlify
2312000399421790.000015com.diigo
232200029188070.000038com.vox
233200026901800.000135org.allaboutcookies
2342000222012060.000025com.jetbrains
2351999941814160.000021edu.arizona
236199943845420.000047com.tandfonline
237199930308440.000036com.foxnews
238199921842910.000085com.live
239199911421750.000140com.xing
240199898749090.000033com.politico
241199885703200.000079com.outlook
2421998503611350.000026jp.ne.goo
243199833407540.000040au.net.abc
2441998268019450.000016com.wikidot
245199779347930.000038com.investopedia
2461997757410660.000028edu.uchicago
2471997682010090.000029edu.wisc
248199759221970.000120com.eepurl
2491997256010390.000028com.bostonglobe
250199720967750.000039org.semver
251199695946190.000044com.sagepub
252199691824970.000052gov.fda
253199684423470.000073net.windows
2541996808415680.000019edu.osu
255199653863190.000079com.nbcnews
256199639462440.000099com.myshopify
257199628925850.000045cn.google
258199625306080.000044site.business
259199610668320.000036com.sciencedaily
2601996038010440.000028com.strikingly
2611995636612360.000024edu.unc
2621995626814460.000021edu.virginia
2631995603412040.000025co.elastic
2641995296011940.000025com.nymag
2651995050022060.000015com.renren
266199504907420.000041gov.house
2671995044821630.000015sg.edu.nus
2681994797622850.000014org.wikibooks
2691994728419610.000016com.googlesource
270199405982350.000103com.wpengine
271199401583230.000078com.googlecode
272199392127610.000040gov.senate
273199380085130.000051com.herokuapp
274199377384520.000057org.pewresearch
275199374925670.000046org.iana
2761993695410930.000027com.podbean
277199358189820.000030com.alexa
2781993474216290.000019gd.is
279199338041030.000301com.paypalobjects
280199327408050.000038org.unicef
281199324167180.000043com.newyorker
282199308589690.000031uk.co.thetimes
283199293244040.000063com.patreon
2841992826610600.000028com.lifehacker
285199259403810.000066com.criteo
286199245249970.000030com.huffpost
287199225763030.000081com.squareup
288199225108390.000036ca.cbc
2891992180811450.000026org.wiktionary
290199188441460.000178com.addtoany
291199181742010.000117com.optimizely
2921991805213420.000022edu.msu
2931991598613710.000022com.history
294199133844180.000062com.calendly
2951990586011810.000025com.udemy
296199033648090.000037uk.ac.ox
297199029201720.000145com.amazon-adsystem
29819899332490.000743com.googleadservices
299198969241550.000167com.opera
300198909708870.000034org.fao
3011989083210170.000029com.ecwid
302198908264760.000054com.googleblog
303198871422110.000110com.stackoverflow
3041988619014190.000021uk.ac.lse
305198853123600.000070com.getpocket
3061988445616670.000018org.maven
307198838009150.000033uk.co.guardian
308198833581690.000148org.bbb
3091988108413370.000022com.aljazeera
310198807902550.000095com.aliyuncs
3111987993827230.000013net.pixnet
3121987438431800.000011net.hinet
3131986902811700.000025com.smithsonianmag
3141986883213470.000022edu.ucdavis
315198682588940.000034gov.congress
3161986719013200.000023edu.illinois
3171986516811200.000026com.theglobeandmail
3181986330610360.000029gov.archives
319198624144920.000052it.placehold
32019861934930.000359net.facebook
3211986137616150.000019hk.com.google
3221986092214730.000020ca.sfu
3231985635216760.000018blog.home
3241985529010730.000027com.apnews
325198548929630.000031com.ssrn
3261985368233830.000010com.wizards
3271985110219970.000016com.nabble
328198510327600.000040com.chinaz
3291985041236670.000010cn.edu.sjtu
3301984814014840.000020com.urbandictionary
3311984443611360.000026com.scmp
3321984232614890.000020ms.1drv
3331984179643610.000008tw.com.gamer
3341983858213920.000021com.flipboard
335198381669190.000033co.g
336198375425470.000047com.gofundme
3371983699620970.000015com.france24
3381983563614050.000021jp.geocities
3391983365413700.000022com.ibtimes
340198313625810.000045com.biomedcentral
3411983005611280.000026com.britannica
3421982942021740.000015com.oregonlive
343198270624120.000062com.kickstarter
344198262149620.000031com.adjust
345198241888670.000035gov.fcc
346198240487150.000043uk.co.mirror
347198232665890.000045us.icio
3481982317211290.000026com.mediafire
3491982176814320.000021edu.tamu
350198213105870.000045com.usnews
3511982044213140.000023org.greenpeace
352198202529850.000030edu.academia
3531981948613810.000021com.livescience
3541981597216840.000018gov.cia
3551981456413250.000023com.akamai
356198132669300.000032com.chicagotribune
357198115381560.000167com.npmjs
3581981110014290.000021net.seesaa
359198101203290.000076es.google
3601980971012380.000024com.reverbnation
361198094905500.000047com.quora
3621980831434810.000010com.proboards
3631980626810400.000028com.thehill
364198038403210.000078org.python
3651980147611320.000026org.jstor
3661980101817220.000018ca.mcgill
367197999821670.000149com.zendesk
368197928909990.000030com.thelancet
3691979224610940.000027com.jamanetwork
3701978859419350.000016uk.ac.manchester
371197852145400.000048com.udacity
3721978332813720.000021ca.utoronto
373197830825790.000046com.bigcartel
3741978223024870.000013org.wikiquote
3751978118613570.000022edu.rutgers
376197800288960.000034org.apa
377197797184390.000059com.newsweek
378197785389200.000033com.healthline
3791977798222040.000015com.knowyourmeme
380197756103280.000077com.tinyurl
381197755587260.000042gov.state
382197750922160.000108com.unsplash
3831977370217080.000018ca.ualberta
384197723784060.000063com.githubusercontent
3851977190014710.000020com.asahi
386197712202590.000094org.nodejs
387197694364750.000054com.latimes
3881976925810270.000029com.timeanddate
389197686864320.000060com.slack
390197684107690.000039jp.shinobi
3911976797616740.000018com.buzzfeednews
392197650384150.000062com.elsevier
3931976472213350.000022edu.gatech
3941976429828610.000012com.youdao
395197612568950.000034com.brightcove
3961975973017740.000017com.bankofamerica
3971975953025690.000013edu.byu
3981975876019180.000016com.voanews
3991975758631640.000011com.opendns
4001975681614250.000021com.sky
4011975578023360.000014com.slides
4021975446213730.000021com.dw
4031975445811580.000026com.nikkei
404197525909040.000033com.cbslocal
4051974876622360.000014net.earthlink
406197486783910.000064com.cnet
4071974815016420.000018com.xrea
4081974743013540.000022uk.co.huffingtonpost
409197464241820.000133com.eventbrite
4101974637010710.000027com.nydailynews
4111974409013050.000023me.vk
412197431949180.000033gov.bls
4131974154214580.000020org.ap
414197409363840.000066net.imgix
4151973986024140.000014org.aclweb
4161973975016410.000018com.axios
417197389409870.000030com.wattpad
4181973753017130.000018com.straitstimes
419197374124740.000054com.ted
4201973687412940.000023edu.brookings
421197286349670.000031int.coe
422197275802120.000109com.etsy
4231972711223920.000014com.biography
424197260808650.000035gov.va
425197257102170.000107com.typepad
4261972462819320.000016com.cocolog-nifty
4271972358016080.000019com.reference
428197207405530.000047com.livejournal
4291971740620960.000015ru.kremlin
430197163548150.000037uk.gov.service
431197153782980.000083com.techcrunch
4321971235824620.000013org.wikisource
4331971229615530.000019com.foxbusiness
4341971162012810.000023mil.army
4351971124417610.000017com.itv
436197102607330.000041com.deviantart
4371970595213110.000023de.mpg
438197052888450.000036gov.justice
4391970457419930.000016cn.people
4401970324812620.000024au.com.smh
4411970165617630.000017org.tensorflow
4421970163412230.000024org.ohchr
443197010005680.000046ru.gov
444197001364000.000064com.technorati
4451969959621340.000015jp.co.japantimes
44619697954830.000413com.list-manage
4471969708810680.000028com.thedrum
4481969675415380.000019uk.co.standard
449196954301850.000131com.rawgit
4501969421621200.000015com.oxforddictionaries
4511969300622410.000014com.shutterfly
4521969208231470.000011tw.edu.ntu
4531969156425500.000013com.smashwords
4541968986218620.000016edu.unl
4551968876824020.000014org.fas
456196886462960.000084uk.org.ico
4571968813827100.000013tv.blip
458196860669570.000031com.bandsintown
4591968444835160.000010cn.org.china
4601968296015500.000019uk.co.express
4611967970810820.000027jp.jugem
4621967915836560.000010info.webry
4631967873014030.000021gov.uscourts
4641967794421570.000015au.edu.unimelb
46519675766920.000363com.wsimg
466196748682830.000086ru.rambler
4671967373819210.000016com.washingtontimes
468196717543510.000072com.proofpoint
46919669412740.000441net.jsfiddle
470196683527880.000038org.mediawiki
4711966815828510.000012jp.blog
4721966774014790.000020com.firebaseapp
4731966741816180.000019com.webnode
4741966594021730.000015com.pbworks
4751966574833740.000011com.patheos
4761966568431350.000011uk.co.timesonline
4771966398021710.000015google.ai
478196633542330.000103com.squarespace
4791966218829040.000012fr.rfi
4801966098414540.000020gov.supremecourt
4811965920018890.000016int.unfccc
482196585343310.000076com.office
483196565265770.000046pl.google
484196540989910.000030gov.wa
485196527968040.000038gov.sba
4861965262612670.000023com.cognitoforms
4871965006622070.000015org.csis
488196490083660.000068io.codepen
4891964875023440.000014com.kobo
490196465121100.000281com.mailchimp
4911964342816710.000018edu.wustl
4921964257227340.000013edu.kit
4931964233414800.000020org.hrw
494196422769530.000031edu.umich
4951964185613890.000021com.dictionary
496196415448360.000036com.mapquest
4971964083617470.000017org.worldcat
4981964027636210.000010net.aljazeera
499196401443570.000071com.photobucket
5001963994820460.000015net.cnki
5011963851017050.000018com.secondlife
5021963841624210.000014int.wmo
5031963788810890.000027org.ilo
5041963745011000.000027google.blog
505196366923780.000067com.meetup
506196346349950.000030uk.co.pinterest
5071963377033970.000010com.freehostia
5081963041232560.000011com.doodlekit
509196297469360.000032com.arstechnica
5101962837037300.000009com.colourlovers
5111962835616960.000018ru.ucoz
512196282989520.000031com.thenextweb
5131962445822860.000014org.unep
5141962234222520.000014org.icrc
5151962180814240.000021com.findlaw
5161962113423340.000014com.similarweb
517196206964810.000054com.gmail
5181961930430400.000012io.soup
5191961624614370.000021com.imageshack
5201961595627850.000013com.sputniknews
5211961407830800.000012com.smore
5221961323232460.000011org.iucnredlist
5231961176631170.000011com.kinja
5241961176018830.000016com.csmonitor
525196116041450.000180ru.mail
5261961008813390.000022gov.uscis
527196085544460.000058net.secureservercdn
5281960631430040.000012sh.now
529196057484270.000061tv.twitch
5301960499415800.000019link.app
531196008144400.000059com.statista
5321959916036760.000010jp.hatenablog
5331959555043560.000008com.coroflot
5341959526431770.000011org.jenkins-ci
5351959515817570.000017gov.oregon
5361959313032000.000011li.paper
5371959310638470.000009com.pixar
5381958987830950.000011com.shell
5391958819440350.000009com.scienceblogs
5401958618816250.000019org.amnesty
541195848248920.000034com.thedailybeast
5421958246417670.000017org.pypi
5431958234621490.000015com.foreignpolicy
5441958031028490.000012com.instapaper
5451957967229100.000012org.accessnow
5461957861416020.000019com.surveygizmo
5471957778017330.000018ca.globalnews
5481957620031750.000011de.uni-koeln
549195761982390.000101io.shields
5501957618433770.000011org.lds
5511957590222380.000014org.rand
552195747902070.000114com.salesforce
5531957454434380.000010net.mootools
5541957442823570.000014at.ac.univie
5551957418240500.000009org.marxists
5561957166428600.000012org.panda
5571957119428060.000013com.oprah
5581956857618740.000016com.justia
5591956797034710.000010org.avaaz
5601956785428800.000012com.openai
5611956776435970.000010org.neocities
5621956726037530.000009cn.edu.sdu
563195649607620.000040com.netflix
564195641204980.000052com.oreilly
5651956308644050.000008com.yam
566195622482270.000105uk.co.amazon
567195622048660.000035com.zoho
568195609566290.000044com.zdnet
5691955996612980.000023ly.snip
5701955879017900.000017ch.ipcc
571195586649930.000030uk.parliament
5721955850837870.000009com.nestle
5731955630412540.000024se.google
5741955629229970.000012com.treehugger
5751955518410110.000029net.nocookie
5761955509646440.000008com.x0
5771955336836310.000010org.tvtropes
5781955099211410.000026org.sphinx-doc
5791954999421220.000015ru.mos
5801954882030440.000012es.csic
5811954853029130.000012uk.gov.companieshouse
5821954657610340.000029com.engadget
5831954623011830.000025com.here
5841954549250600.000007com.dbs
5851954543841030.000009br.ufrj
5861954420421590.000015edu.colostate
5871954339827060.000013de.uni-heidelberg
5881954050030590.000012com.pearltrees
5891953926821760.000015net.openid
5901953788026000.000013com.mystrikingly
5911953784438800.000009com.chinatimes
5921953583424000.000014link.page
5931953418223540.000014com.real
5941953343218360.000017org.ncsl
595195322883010.000082com.surveymonkey
596195319303620.000070com.hp
5971953141211930.000025org.js
5981953070021350.000015com.123formbuilder
5991952884224260.000014org.vim
6001952810432050.000011pl.wp
6011952801826020.000013au.com.sbs
602195267801700.000148com.yelp
6031952621624990.000013uk.ac.kcl
6041952434613380.000022org.aarp
6051952369226210.000013th.co.google
6061952315610060.000029uk.gov.legislation
607195230422600.000094com.getbootstrap
6081952285636630.000010com.magcloud
6091952227439900.000009com.zynga
6101952194212680.000023tw.com.google
6111952192228290.000013com.kaggle
612195201309480.000031gov.gpo
613195197429460.000032com.about
6141951971432730.000011org.rsf
6151951874029760.000012org.tigris
6161951822427270.000013uk.ac.leeds
6171951551235350.000010de.dw
6181951543430190.000012org.cfr
6191951457432530.000011de.uni-freiburg
6201951357036400.000010de.uni-konstanz
6211951271438810.000009ua.at
6221951125421170.000015info.worldometers
6231951031446570.000008com.embarcadero
6241950937029990.000012vn.zing
6251950913432290.000011com.bangkokpost
6261950880436150.000010ly.rebrand
6271950854820080.000016gov.ky
6281950842640090.000009org.wilsoncenter
6291950677440590.000009jp.hatenadiary
6301950628443740.000008com.musictoday
6311950538838240.000009org.constitutioncenter
632195051863720.000067com.booking
6331950440225790.000013com.eiseverywhere
6341950380040380.000009com.itsnicethat
6351950377633310.000011il.ac.tau
6361950209623590.000014mx.com.google
6371950080637360.000009com.db
638194989283120.000080com.ebay
6391949858835780.000010jp.hateblo
6401949816633480.000011org.democracynow
6411949729639750.000009edu.odu
6421949681228150.000013dk.au
6431949662642200.000008com.etymonline
6441949618428850.000012uk.gov.metoffice
645194957563610.000070com.skype
6461949556635700.000010com.hsbc
6471949484422280.000015com.bankrate
6481949410422400.000014gov.wi
6491949335218150.000017fi.google
6501949330644260.000008com.x10host
6511949213632240.000011org.royalsociety
652194910968170.000037com.pexels
653194903585320.000048com.mashable
6541949028246140.000008com.epochtimes
6551949001811740.000025edu.ucla
6561948965632260.000011cc.reurl
6571948941434300.000010com.dailykos
6581948936037420.000009uk.ac.uea
6591948805037050.000010ca.shaw
6601948610419680.000016uk.gov.tfl
6611948598834340.000010uk.ac.nhm
6621948503230600.000012com.ipage
6631948475424980.000013com.prweek
6641948459818190.000017gov.usembassy
6651948396648610.000007am.do
6661948363630860.000011com.viki
6671948351832520.000011se.liu
6681948271830660.000012com.coca-colacompany
6691948258042320.000008br.ufrgs
6701948249836390.000010de.uni-kiel
6711948134014530.000020com.speakerdeck
6721948071830770.000012net.openreview
6731948066022080.000015de.auswaertiges-amt
674194802482080.000113com.hubspot
6751947976220260.000016com.lexisnexis
6761947870021060.000015net.ucoz
6771947755234940.000010com.iconarchive
678194775328190.000037com.steampowered
679194772867560.000040com.xiti
6801947713224860.000013com.post-gazette
6811947689833690.000011com.eklablog
6821947663229370.000012uk.co.bbci
6831947637819110.000016hu.google
6841947616043990.000008com.jacobinmag
6851947597433230.000011uk.ac.sussex
6861947436830680.000012uk.ac.qmul
6871947421239300.000009nf.co
6881947301441140.000009com.collinsdictionary
6891947289652150.000007com.evaair
6901947284625720.000013com.marketwire
6911947258031380.000011au.com.telstra
6921947211439160.000009it.unitn
693194716468980.000034com.visualstudio
6941947133038070.000009in.ernet
6951947099429060.000012nl.rug
6961946870852970.000007org.arkive
697194682522520.000096org.drupal
6981946705034600.000010ca.dal
6991946704636930.000010com.canada
7001946564214510.000021com.tinypic
7011946530431360.000011org.wri
7021946503436980.000010com.la-croix
7031946410845570.000008com.mitsubishielectric
7041946382847480.000008com.gamejolt
7051946297627890.000013gr.google
7061946288248820.000007cz.webgarden
7071946240430790.000012my.com.thestar
708194618302690.000092net.php
7091946164043290.000008au.gov.fairwork
7101946077022790.000014co.pcdn
7111946017639430.000009uk.ac.essex
712194599841210.000231org.networkadvertising
7131945968433960.000010org.rferl
7141945906842110.000008com.sc
7151945902032920.000011com.blogfa
7161945879433820.000010ca.yelp
7171945758041020.000009edu.utm
7181945724856940.000007com.anghami
7191945653252100.000007su.clan
7201945614440950.000009it.justpaste
721194560064140.000062com.sxsw
7221945591432580.000011com.waterstones
7231945460239600.000009com.jigsy
724194545168380.000036com.intel
7251945439440320.000009ee.ut
726194532429160.000033com.docker
727194529887380.000041com.samsung
7281945180234220.000010es.ucm
7291945071825030.000013com.washingtonexaminer
7301945034239510.000009tl.page
7311945020622090.000015org.wbur
7321944903641120.000009site.negocio
7331944892227730.000013com.yell
7341944851639880.000009com.fatcow
7351944826632820.000011pl.poznan
736194481981350.000194com.youku
7371944793028780.000012ae.thenational
7381944776647050.000008id.co.kaskus
7391944766834070.000010com.afp
7401944760253360.000007net.manilatimes
741194467344190.000062com.caniuse
7421944616814700.000020com.pastebin
7431944591033870.000010uk.org.rspb
744194457367650.000039com.moz
7451944437640270.000009lv.draugiem
7461944160425080.000013gov.dni
7471944087425930.000013ro.google
7481944014429460.000012com.broadwayworld
7491943957437500.000009ru.msu
7501943937437660.000009pl.cba
7511943933241370.000009org.rfa
7521943928055620.000007org.bukkit
7531943908620130.000016scot.gov
754194388681330.000200com.constantcontact
7551943882656380.000007org.adbusters
7561943809445170.000008google.design
7571943765441540.000008com.macobserver
7581943708816490.000018fr.pagesjaunes
7591943702025020.000013com.thenation
7601943677639730.000009com.bbcamerica
7611943455648570.000007com.orgfree
7621943381029780.000012com.channelnewsasia
763194325067350.000041gov.sec
7641943250240080.000009com.teamspeak
7651943243028000.000013org.gnupg
7661943226037800.000009com.the-scientist
7671943225230150.000012com.laweekly
7681943144629210.000012au.edu.sydney
7691943008435770.000010uk.co.yougov
7701943000031400.000011vn.com.google
7711942994244170.000008com.50webs
7721942900431240.000011org.repec
7731942893832150.000011org.ourworldindata
7741942789035060.000010com.tradingeconomics
7751942735231020.000011tw.com.pchome
7761942658233320.000011com.monday
7771942655635560.000010org.project-syndicate
7781942555223310.000014com.amebaownd
7791942489015960.000019org.whatbrowser
7801942475019560.000016org.americanbar
7811942468037390.000009ie.thejournal
782194241521040.000298com.stripe
7831942414040140.000009com.hatenadiary
7841942406029330.000012org.thinkprogress
7851942371230730.000012uk.gov.london
7861942305439270.000009com.thesaurus
7871942300634750.000010net.webself
7881942296434320.000010io.pantheon
7891942171234200.000010uk.ac.exeter
7901942150843430.000008com.appledaily
7911942111835280.000010com.bravesites
7921942081651780.000007com.bambuser
7931942059233790.000011com.foreignaffairs
7941941937824320.000013com.instructables
7951941638821850.000015vn.vietnamnet
7961941473639940.000009com.webcindario
7971941432828230.000013org.ewg
7981941393445340.000008ws.nimb
7991941377828330.000013org.fullfact
800194133522560.000095us.zoom
8011941255636850.000010com.encyclopedia
8021941247438970.000009de.uni-erlangen
8031941082253410.000007net.boards
804194095983410.000074com.histats
8051940953442010.000008is.pse
806194094367480.000040fm.last
8071940780836610.000010com.mongabay
8081940704032200.000011me.site123
8091940633834360.000010com.seetickets
8101940555058380.000007com.gamigo
8111940440016660.000018com.materialdesignicons
8121940410851400.000007bd.com.google
813194032427900.000038com.venturebeat
8141940121846010.000008uk.org.phrases
8151940078032130.000011com.instructure
8161940029828170.000013gov.arkansas
81719399890720.000444com.livestream
8181939955440810.000009cat.uab
8191939948635460.000010org.lacity
8201939937236120.000010com.heraldscotland
8211939837014990.000020com.teachable
8221939667228950.000012com.foodandwine
8231939575212330.000024com.createjs
8241939427422660.000014com.ajc
8251939417239500.000009com.rappler
8261939403023550.000014net.noscript
8271939398241400.000009jp.doorblog
8281939288228730.000012com.timeshighereducation
829193922382750.000089com.bandcamp
8301938933239690.000009jp.ne.hi-ho
8311938809436290.000010net.inquirer
832193878825520.000047com.cisco
8331938731840760.000009pl.lublin
8341938637016570.000018com.pcworld
835193834042660.000093com.typeform
836193828862030.000116com.naver
8371938269837230.000010gov.bts
8381938219218160.000017jp.makeshop
8391938210244620.000008com.tor
8401938207245130.000008com.weightwatchers
8411938134614380.000021org.khanacademy
842193812749540.000031com.thinkwithgoogle
8431938102033850.000010uk.ac.jisc
8441938023840880.000009ly.genial
8451937998640070.000009com.themoscowtimes
8461937850032720.000011com.nyt
8471937843437600.000009com.springernature
8481937835633900.000010int.cbd
8491937785460450.000006es.xurl
8501937689817560.000017com.netsolhost
8511937659838520.000009au.edu.griffith
8521937605447400.000008co.edu.unal
8531937604040740.000009kr.co.koreatimes
854193745887270.000042com.deloitte
8551937430049860.000007org.edc
8561937394041490.000008vn.tienphong
8571937347635150.000010com.thediplomat
8581937293240990.000009uk.ac.lancs
8591937279850060.000007com.inoreader
8601937274649220.000007com.ueuo
8611937259415850.000019tv.ustream
8621937257632340.000011com.tapatalk
8631937235634160.000010nl.wur
8641937210648480.000007net.hypermart
8651937163622930.000014org.kff
866193693563980.000064com.pubmatic
8671936898236250.000010org.grist
8681936848030880.000011tw.gov.cdc
8691936828833890.000010com.gothamist
8701936813011060.000027com.gizmodo
8711936811641010.000009com.globalpost
872193676768140.000037gov.nist
8731936753645630.000008org.globalsecurity
8741936645445470.000008build.bazel
8751936638437820.000009us.ms.state
8761936587842560.000008gr.ntua
8771936577644440.000008se.thelocal
8781936537229630.000012com.politifact
8791936512813170.000023com.ensighten
8801936358850970.000007ru.my1
8811936268034680.000010com.rabbitmq
8821935969841380.000009com.elasticbeanstalk
8831935957413640.000022com.billboard
8841935912247660.000008cc.dict
8851935877456870.000007fi.mbnet
886193573908790.000035com.aliexpress
887193569182100.000111to.amzn
8881935566842750.000008edu.ohio
8891935554634520.000010com.thejakartapost
8901935535032770.000011vn.com.dantri
8911935508052850.000007com.galvanize
8921935488034840.000010jp.go.ndl
8931935479047100.000008com.kiwibox
8941935451421400.000015org.linuxfoundation
8951935450048010.000007ru.nnov
8961935316642880.000008gr.auth
8971935297022570.000014net.vnexpress
8981935177029000.000012com.crashlytics
8991935159410450.000028com.dropboxusercontent
9001935082834390.000010com.scotusblog
9011935071240900.000009org.carnegieendowment
902193502783950.000064com.atlassian
9031934972634650.000010com.study
904193487243500.000072com.mapbox
9051934853210460.000028com.redhat
9061934788617990.000017com.bravenet
9071934746042840.000008uk.org.npg
9081934715244630.000008com.btplc
9091934714852890.000007ru.drom
9101934654224300.000013com.vimeopro
9111934590044190.000008edu.marquette
912193456444260.000061com.adweek
913193451449140.000033com.shutterstock
9141934509010160.000029com.ubuntu
9151934196057120.000007in.ac.nptel
9161934148812270.000024com.msdn
9171934071447070.000008com.vocabulary
9181934068039290.000009edu.uaf
9191933965839190.000009com.atavist
9201933945632010.000011com.healthgrades
9211933909225460.000013com.kinstacdn
9221933838423450.000014com.gazhall
9231933793853980.000007com.asmallorange
9241933780037970.000009com.generalmills
9251933617645850.000008vn.vtc
9261933590815190.000020cn.gov.mofcom
927193337787970.000038com.box
9281933360639660.000009si.uni-lj
9291933332241700.000008az.president
9301933319417880.000017org.reactjs
9311933241236050.000010com.postaffiliatepro
9321933192251920.000007edu.uah
9331933128035990.000010org.openedition
9341933069648380.000007com.kapook
9351933038241530.000008org.caringbridge
936193303744830.000053com.aol
9371932961423030.000014org.nfpa
9381932953859560.000006com.glosbe
9391932919441240.000009com.mcall
9401932762242890.000008ru.tmweb
9411932687641260.000009uk.co.liverpoolecho
9421932642242440.000008com.atwebpages
9431932598010670.000028com.freepik
9441932479040850.000009org.specialolympics
9451932386848450.000007net.freeforums
9461932367647440.000008uk.ac.westminster
9471932353240920.000009com.tok2
9481932346010250.000029com.elpais
9491932315049460.000007tw.com.sina
9501932250832960.000011com.wowza
951193223063170.000079com.webs
9521932202446970.000008com.warriorplus
9531932191834140.000010com.cityam
9541932181244820.000008org.fee
9551932152048540.000007tw.edu.ntnu
9561932129649620.000007com.sparknotes
9571932020245160.000008com.newspapers
9581931963421920.000015com.tutsplus
9591931960058680.000007com.ananova
9601931927438180.000009org.opensecrets
961193191346330.000044gov.uspto
9621931872256800.000007su.moy
9631931836610130.000029com.uk
9641931826649360.000007ru.pr-cy
9651931805838270.000009cz.centrum
9661931778041580.000008edu.niu
9671931532016650.000018org.webkit
9681931501446920.000008pl.edu.amu
9691931408451860.000007com.artfire
9701931389438000.000009org.ascd
9711931210638010.000009edu.scu
9721931174243070.000008com.taipeitimes
9731931156843510.000008edu.whoi
9741931085459490.000006com.voatiengviet
9751931074831000.000011com.broadcastingcable
9761931072046550.000008hk.rthk
9771931024657030.000007com.enotes
978193099104880.000053com.indiatimes
979193096608600.000035com.playstation
9801930904048660.000007com.brothersoft
9811930894827080.000013uk.gov.defra
982193076062310.000103org.whatwg
9831930717844510.000008com.batchgeo
984193071187510.000040com.psychologytoday
9851930636842630.000008uk.co.lrb
9861930635050340.000007ca.pe.gov
9871930588441590.000008com.ecowatch
9881930382041950.000008com.williamhill
9891930354857670.000007pt.ipp
9901930297248430.000007uk.org.38degrees
9911930162413030.000023com.technologyreview
9921930146440910.000009org.spie
993193010689590.000031com.libsyn
9941930057247950.000007com.storeboard
9951930054832600.000011de.bmel
9961929944847490.000008net.onlinewebshop
9971929927438720.000009ru.1gb
998192986542790.000088com.automattic
9991929850238700.000009com.piie
10001929744053060.000007com.allthatsinteresting

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

May/June 2020 crawl archive now available

The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives.

Starting with this crawl the WET files indicate the natural language(s) a text is written in. The language is detected using Compact Language Detector 2 (CLD2) and was made available since August 2018 only in WARC and WAT files and URL indexes. It is now also provided in WET files in the WARC header "WARC-Identified-Content-Language". Up to three language(s) are detected per document and given as comma-separated list of ISO-639-3 codes, here one example WET record fragment:

...
WARC-Identified-Content-Language: isl,eng
Content-Type: text/plain
Content-Length: 10494

Bananabrauð með Nutella – Ljúfmeti og lekkerheit
...

Additional information about this improvement is given in the corresponding issue report.

Archive Location and Download

The May/June crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-24/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-24/segment.paths.gz100
WARC filesCC-MAIN-2020-24/warc.paths.gz6000053.16
WAT filesCC-MAIN-2020-24/wat.paths.gz6000019.02
WET filesCC-MAIN-2020-24/wet.paths.gz600008.42
Robots.txt filesCC-MAIN-2020-24/robotstxt.paths.gz600000.22
Non-200 responses filesCC-MAIN-2020-24/non200responses.paths.gz600002.77
URL index filesCC-MAIN-2020-24/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-24/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March/April 2020 crawl archive now available

The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Archive Location and Download

The March/April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-16/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-16/segment.paths.gz100
WARC filesCC-MAIN-2020-16/warc.paths.gz5600062.67
WAT filesCC-MAIN-2020-16/wat.paths.gz5600020.37
WET filesCC-MAIN-2020-16/wet.paths.gz560008.97
Robots.txt filesCC-MAIN-2020-16/robotstxt.paths.gz560000.19
Non-200 responses filesCC-MAIN-2020-16/non200responses.paths.gz560001.39
URL index filesCC-MAIN-2020-16/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-16/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2020 crawl archive now available

The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Improvements and Fixes

The HTTP headers in WARC response records have been fixed: the HTTP response status line now has a white space following the status code if the reason-phrase is empty. E.g., if a server sends an empty message (instead of “OK”), the status line will include a trailing space character: “HTTP/1.1 200 ”. Following RFC 7230 the white space between status code and message is mandatory. Please refer to the bug report NUTCH-2763 for further details.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-10/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-10/segment.paths.gz100
WARC filesCC-MAIN-2020-10/warc.paths.gz5600049.28
WAT filesCC-MAIN-2020-10/wat.paths.gz5600017.98
WET filesCC-MAIN-2020-10/wet.paths.gz560007.97
Robots.txt filesCC-MAIN-2020-10/robotstxt.paths.gz560000.22
Non-200 responses filesCC-MAIN-2020-10/non200responses.paths.gz560002.21
URL index filesCC-MAIN-2020-10/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-10/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases (e.g., Nov/Dec/Jan 2017-2018 Webgraphs). You may also visit the projects cc-webgraph and cc-pyspark which host all scripts and tools required to construct the graphs.

Host-level graph

The graph consists of 1.24 billion nodes and 4.54 billion edges and includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 1.17 billion dangling nodes (95%) and the largest strongly connected component contains 45 million (3.6%) nodes.

You can download the graph and the ranks of all 1.24 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/host/ as prefix to access the files from everywhere.

Download files of the Common Crawl Nov/Dec/Jan 2019-20 domain-level webgraph

SizeFileDescription
7.23 GBcc-main-2019-20-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 12 vertices files
20.16 GBcc-main-2019-20-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 24 edges files
8.42 GBcc-main-2019-20-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2019-20-nov-dec-jan-host.properties
10.80 GBcc-main-2019-20-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2019-20-nov-dec-jan-host-t.properties
1 kBcc-main-2019-20-nov-dec-jan-host.statsWebGraph statistics
16.32 GBcc-main-2019-20-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Note that the host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org.

The domain-level graph has 85.8 million nodes and 1.9 billion edges. 51% or 44 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2019-20 domain-level webgraph

SizeFileDescription
0.59 GBcc-main-2019-20-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
7.65 GBcc-main-2019-20-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
4.10 GBcc-main-2019-20-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2019-20-nov-dec-jan-domain.properties
4.13 GBcc-main-2019-20-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2019-20-nov-dec-jan-domain-t.properties
1 kBcc-main-2019-20-nov-dec-jan-domain.statsWebGraph statistics
1.86 GBcc-main-2019-20-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 86 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2019-2020)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
13059839810.019072com.googleapis
22911313630.012214com.facebook
32747513820.013236com.google
42561048040.007452com.twitter
52494712650.007174org.w
62490471260.006611com.youtube
72328150490.004269com.instagram
82244629670.005561org.gmpg
92215475080.005033com.googletagmanager
1022107784130.003001com.linkedin
1121307220100.003433org.wordpress
1221290688200.001717com.gravatar
1321096944110.003266com.cloudflare
1421086168230.001516com.pinterest
1520869868160.002242com.gstatic
1620855286150.002366com.wordpress
1720713268250.001234org.wikipedia
1820641712170.002130com.apple
1920584368140.002460com.bootstrapcdn
2020371046330.001102com.vimeo
2120336732420.000833com.blogspot
2220237058180.001787com.jquery
2320220572500.000732be.youtu
2420218806320.001130com.microsoft
2520119820490.000737com.wp
2620059828190.001764com.adobe
2720029544520.000709com.amazon
2820013346440.000784gl.goo
2919972608350.001020com.amazonaws
3019941934670.000471com.tumblr
3119937390640.000518ly.bit
3219862906290.001164com.macromedia
3319817494300.001151com.baidu
3419816382380.000958com.google-analytics
3519808066310.001142com.googlesyndication
3619781460340.001092net.cloudfront
3719770754240.001254ru.yandex
3819749302530.000693com.flickr
3919700112220.001568com.github
4019698814800.000368com.yahoo
4119676534580.000644eu.europa
42196325061150.000287com.reddit
4319603190410.000918com.addthis
4419561006720.000403com.weebly
4519559908430.000823org.w3
4619550116630.000524me.wp
47195466141080.000313com.googleusercontent
4819531876450.000777io.github
49195232241820.000140org.wikimedia
5019521602700.000422com.medium
5119516098470.000743org.schema
5219496222460.000754net.jsdelivr
5319495662760.000374org.creativecommons
54194726101730.000153com.imgur
5519459848360.000989net.doubleclick
5619451512510.000711com.wix
57194160541860.000138uk.co.bbc
58194083121490.000181com.forbes
5919405426930.000332com.weibo
6019404990600.000601co.t
6119398622280.001192com.fontawesome
6219382484480.000741com.paypal
63193759142100.000114com.cnn
64193722921440.000190org.archive
6519368680610.000583org.mozilla
66193481821890.000137net.sourceforge
67193047743150.000079edu.mit
68193029361740.000152com.theguardian
69192937822780.000089edu.harvard
70192833721790.000144com.bing
71192713901170.000286com.jimdo
72192703681400.000202com.nytimes
7319260752270.001195com.qq
74192594062370.000103com.wsj
7519257754210.001581org.apache
7619256254560.000658com.googleadservices
77192411102150.000111com.washingtonpost
78192378162690.000092com.bloomberg
79192371642590.000094com.techcrunch
80192105625440.000049com.deviantart
81192059461620.000162org.ietf
82191983462470.000098com.oracle
83191946962790.000089com.android
8419193448690.000430com.list-manage
85191749683970.000065com.ted
86191678203210.000078com.reuters
87191619643120.000080com.wired
88191587681540.000173com.wixsite
89191548743410.000073com.ft
90191520683300.000076uk.co.telegraph
91191516664270.000060com.theverge
92191450781550.000172gov.nih
93191438862720.000091com.myspace
94191421723770.000068gov.nasa
95191360222910.000085com.bbc
96191261383390.000074com.example
97191199162390.000103org.python
9819117868820.000361com.whatsapp
99191115501220.000267com.unpkg
100191103461880.000138uk.co.google
101190895047070.000041com.economist
102190887662550.000095com.appspot
103190823003840.000067uk.co.dailymail
104190819042090.000115org.gnu
105190817082620.000093com.githubusercontent
106190801821310.000232com.ytimg
107190771563200.000078org.un
108190758061630.000162com.giphy
109190688603980.000065com.latimes
110190676061690.000157com.twimg
111190664044310.000060com.googleblog
112190561621760.000148com.blogger
113190543102320.000104com.dribbble
114190528242070.000115com.npmjs
115190505245640.000047org.arxiv
116190451946660.000042edu.upenn
117190427701710.000154com.eventbrite
118190366123790.000068com.springer
119190324222770.000090org.ampproject
120190313545570.000047com.gitlab
121190256165960.000045com.vice
122190252362060.000116com.disqus
1231902397810360.000031com.hatenablog
124190234068350.000039edu.columbia
125190186708180.000040io.readthedocs
126190082802050.000116me.t
127190058423900.000066com.w3schools
128190040309410.000034org.chromium
129190039004180.000062com.nature
130190014527160.000041com.slate
131190008621570.000171jp.co.yahoo
132189975663250.000077com.time
133189971644300.000060com.statista
134189903367440.000040com.ubuntu
135189854661580.000167com.yelp
136189825506320.000043org.worldbank
137189824281430.000191com.spotify
138189810783470.000072com.skype
139189786609350.000034com.playstation
140189765243060.000082com.fc2
1411897331213040.000024org.coursera
142189694381210.000281com.stripe
143189685628160.000040com.qz
144189681686170.000044com.git-scm
145189664484860.000053uk.co.independent
146189658101990.000126com.eepurl
147189644449610.000034com.500px
148189643424050.000063net.researchgate
149189626542410.000101com.bandcamp
15018959274550.000669net.facebook
151189564783890.000066com.outlook
152189554902290.000105com.unsplash
153189504606310.000043com.mysql
154189493304190.000062com.theatlantic
155189484561160.000286com.soundcloud
156189483641800.000143com.amazon-adsystem
157189476701230.000259org.networkadvertising
158189407685710.000046org.bitbucket
1591894029811630.000027com.jetbrains
160189364744080.000063com.mozilla
161189361044980.000052com.nationalgeographic
162189329063160.000079com.usatoday
163189308904390.000059com.criteo
164189273428370.000039uk.ac.ox
165189255604680.000054com.fortune
166189249564660.000055com.pixabay
1671892222412780.000024uk.co.thesun
168189214762300.000104net.behance
1691891670015470.000019com.amd
170189155748220.000039com.evernote
17118909918400.000932com.vk
172189093187990.000040com.about
173189079105050.000051uk.co.blogspot
1741890400011930.000026se.haxx
175189036862510.000097gle.forms
176189001327190.000041com.docker
177189000209670.000033uk.co.guardian
178188995763030.000082org.doi
179188983304970.000052me.about
180188968625250.000050gg.discord
1811889443213830.000022com.instructables
182188915821470.000189com.dropbox
1831888843210000.000032com.scientificamerican
184188852283320.000076jp.co.rakuten
185188815489030.000036google.blog
186188757041940.000130com.feedburner
1871887075810810.000030org.altervista
188188690686340.000043org.unesco
1891886896610820.000030org.eclipse
190188684882750.000090gov.ca
191188665528750.000037jp.livedoor
1921886516414900.000020org.phys
193188646742920.000085com.sciencedirect
194188606742120.000113jp.ameblo
195188557005720.000046gov.loc
196188548929930.000033org.cambridge
197188455124030.000064ca.google
198188425586630.000042edu.washington
19918836168710.000408net.slideshare
200188328085430.000049com.cisco
2011882983813640.000023edu.rutgers
202188288083040.000082com.nbcnews
203188272942430.000099ru.rambler
204188265948230.000039au.net.abc
205188260349980.000032uk.co.thetimes
2061882555414720.000021com.bankofamerica
20718821912620.000549com.fb
208188216708630.000037org.sciencemag
209188138448870.000036com.speakerdeck
210188109864120.000063jp.ne.sakura
211188067482200.000109org.iana
2121880211212580.000025com.wikidot
2131879970616500.000018pt.sapo
214187928809050.000036uk.co.mirror
215187906202180.000110edu.stanford
2161878773411430.000028org.kernel
217187865264500.000057com.elsevier
2181878442014520.000021edu.osu
2191878344616220.000019com.googlesource
2201878242613950.000022com.vogue
22118781902740.000401net.akamaihd
222187812588420.000038gov.fcc
2231878092612400.000025ms.1drv
2241877910613770.000022edu.asu
225187788282540.000095com.businessinsider
226187785348440.000038co.ibb
2271877838618430.000016com.wolfram
228187710907210.000041com.trello
229187691481060.000323com.paypalobjects
230187662983010.000083net.windows
2311876223812160.000026jp.geocities
232187621505870.000045com.box
233187618408760.000037com.sciencedaily
234187588022270.000106com.wpengine
235187529864930.000052com.herokuapp
2361875202610300.000032edu.princeton
237187488149150.000035edu.academia
238187482322840.000087com.googlecode
2391874614611770.000027com.asahi
2401874004413750.000022com.newscientist
2411873371814280.000021blog.home
242187324583140.000080com.tinyurl
243187320105160.000051com.udacity
2441873174021030.000014com.wizards
245187299063860.000067com.cnet
2461872964414210.000022com.ndtv
247187293703350.000075com.getpocket
2481872927613490.000023com.fandom
2491872641212410.000025net.seesaa
250187233761670.000158com.imdb
251187184403330.000076org.debian
252187155526990.000041site.business
253187136382610.000093com.live
2541871149810590.000031jp.ne.goo
2551871077815040.000020io.itch
2561870401611550.000027org.greenpeace
2571870225410470.000031com.netlify
2581870156818080.000017net.pixnet
259187005524020.000064com.squareup
2601869985211450.000028co.elastic
261186997563100.000081com.ibm
262186987842030.000118com.stackoverflow
263186981464940.000052com.indiatimes
2641869631234200.000010com.armorgames
265186931482650.000092com.aliyuncs
266186923782260.000106com.optimizely
2671868594417770.000017uk.co.timesonline
268186795709750.000033com.mixcloud
2691867753016180.000019com.itv
270186744601980.000128org.bbb
27118674340570.000648net.fbcdn
2721867414822360.000014com.opendns
2731867268233020.000010tw.com.gamer
274186726464060.000063com.go
275186724864370.000059com.msn
27618672432370.000966com.wixstatic
2771866760614850.000021org.archlinux
278186665464480.000058org.pewresearch
27918664590870.000345com.shopify
280186629148380.000039jp.shinobi
281186625509950.000032com.bmj
2821866051615790.000019com.diigo
283186596881600.000167com.opera
2841865933218220.000017com.youdao
2851865768612590.000025com.angelfire
286186573769060.000036jp.naver
2871865693410380.000031com.thelancet
2881865361015850.000019uk.bl
289186521868640.000037br.com.google
290186499144900.000053com.bigcartel
2911864879610070.000032com.sky
2921864759614080.000022net.daringfireball
2931864259217790.000017uk.ac.kcl
2941864116817520.000017org.maven
295186395445340.000049me.m
2961863929811070.000029com.reverbnation
2971863729016650.000018net.cnki
2981863410210890.000030com.theconversation
299186305044600.000056it.placehold
3001862906411170.000029com.podbean
301186286608810.000036org.fao
3021862695611440.000028co.g
3031862626413700.000023com.dw
304186236461190.000283com.mailchimp
3051862126016310.000019jp.ne.so-net
3061862026413460.000023com.livescience
3071862011020710.000015edu.kit
3081861190013080.000024ca.utoronto
3091861061813100.000024com.webnode
310186083889570.000034au.gov.nsw
3111860696414110.000022com.citrix
3121860370410310.000032jp.jugem
3131860237411150.000029gov.wa
314186000024580.000056com.quora
31518598840990.000326com.godaddy
3161859774013340.000023com.bloglovin
3171859683411940.000026com.serving-sys
318185959769000.000036gov.dhs
3191859512016300.000019org.edx
320185926962440.000099me.wa
3211859258218300.000017com.pearltrees
3221859202215760.000019com.twitpic
3231859174417760.000017cn.people
3241859035011710.000027com.britannica
3251858938018930.000016sg.edu.nus
3261858813017910.000017com.kinja
3271858740422960.000013com.authorstream
3281858592215660.000019ca.mcgill
329185851323800.000068com.kickstarter
3301858163612370.000025com.lulu
3311857694825910.000012com.colourlovers
3321857506419190.000016com.hm
333185742423370.000075com.rackcdn
3341856799422610.000013uk.ac.sussex
3351856395419450.000016org.vim
3361856294610630.000031com.healthline
3371856286019110.000016org.wikibooks
3381856265018670.000016io.soup
3391856005217040.000018nl.blogspot
340185557405090.000051com.mashable
341185537302640.000093com.typepad
3421855348610710.000031com.adjust
343185514563960.000065com.photobucket
3441854452817410.000017org.bitcoin
3451854295222270.000014tw.edu.ntu
3461854107211000.000030com.ecwid
3471854079818200.000017com.indianexpress
3481854028818890.000016co.ello
349185388705750.000046edu.berkeley
3501853737220260.000015com.upi
351185372641320.000232com.squarespace
352185356822800.000089uk.org.ico
3531853492411360.000028com.ssrn
3541853438221520.000014com.viki
3551853381812190.000025it.scoop
356185326062700.000092com.surveymonkey
3571853201616010.000019com.fastcodesign
3581853062017820.000017org.unep
3591852958810570.000031uk.parliament
3601852735619660.000016org.haskell
361185271402240.000107com.etsy
3621852706414420.000021com.shutterfly
3631852538815690.000019uk.org.tate
3641852453028620.000011co.electrek
3651852313426930.000011jp.doorblog
366185228381560.000171com.issuu
3671851914820180.000015com.dezeen
3681851791024300.000013sh.now
3691851753011570.000027com.tradedoubler
3701851502811730.000027gov.weather
3711851361611090.000029com.imageshack
3721851268216930.000018com.channel4
3731851213411160.000029gov.dot
3741851101827030.000011cn.edu.sdu
3751851055411640.000027com.wikia
376185092442820.000088com.huffingtonpost
377185091249530.000034uk.co.pinterest
378185086769240.000035com.arstechnica
379185071562710.000091com.rawgit
380185058124840.000053tv.twitch
3811850572219170.000016th.co.google
3821850313423900.000013uk.ac.nhm
3831850236017640.000017com.netvibes
3841850115618710.000016edu.emory
385185009649180.000035in.amazon
386185002529630.000034com.strikingly
3871849922417730.000017net.bplaced
3881849778633560.000010tw.edu.ntnu
3891849569218110.000017edu.iu
390184945428330.000039com.brightcove
391184916242250.000107com.hubspot
3921849136614700.000021com.wattpad
3931849091414760.000021gov.michigan
3941848922219160.000016nl.tudelft
3951848806414360.000021org.c-span
396184877083940.000065com.meetup
3971848341221100.000014com.kaggle
3981848166212990.000024edu.brookings
39918478490860.000345net.jsfiddle
4001847842026210.000012sh.surge
4011847553022480.000014com.rsa
4021847522017810.000017gov.ahrq
403184741288250.000039org.mediawiki
404184739223460.000072edu.yale
405184726648260.000039com.intel
4061847228815120.000020gov.faa
4071847192619750.000015io.material
4081847173210730.000031com.thenextweb
4091847170618470.000016net.earthlink
4101846961019080.000016jp.blog
411184690928310.000039com.pexels
4121846476010290.000032uk.gov.nationalarchives
4131846091619730.000016com.smashwords
414184590889390.000034org.ieee
4151845742021850.000014com.smore
416184567243450.000072com.livejournal
4171845636033660.000010hk.edu.hkbu
418184535424140.000063com.nypost
4191845346417850.000017com.business-standard
4201845344029070.000011com.yam
4211845124812330.000025org.aarp
4221845040819990.000015com.oprah
4231844994220530.000015org.jpn
4241844988015030.000020org.amnesty
4251844979016510.000018com.avvo
4261844964825400.000012com.cleantechnica
427184495045220.000050edu.cornell
4281844812425860.000012com.mysanantonio
429184475944730.000054io.shields
4301844754414440.000021org.hrw
4311844424024870.000012org.neocities
4321844281019430.000016com.care2
4331844070217140.000018com.snopes
4341844019411480.000027com.gizmodo
4351844005824740.000012com.googledrive
4361843927225940.000012com.iflscience
4371843752415940.000019org.pypi
438184372662680.000092net.php
4391843690219270.000016org.rsc
4401843628813890.000022com.pbworks
4411843597827350.000011com.itsnicethat
4421843536220920.000015ae.thenational
4431843532624610.000012com.hsbc
444184346423380.000074com.hp
4451843263613540.000023uk.co.standard
4461843176418070.000017com.instapaper
447184315964650.000055io.codepen
448184313905530.000047com.buzzfeed
4491843102019100.000016com.secondlife
4501843025824250.000013jp.go.ndl
4511842975621060.000014io.gitlab
452184284323730.000069int.who
4531842712823000.000013org.lds
4541842697624710.000012uk.mod
4551842695417110.000018google.ai
45618426290960.000330de.google
4571842389814270.000021com.thehindu
4581842372417490.000017com.curbed
4591842290216380.000019no.google
460184217383400.000074com.cnbc
4611842068610610.000031com.thedrum
462184197801650.000160com.ebay
463184187906270.000043com.zdnet
4641841845423300.000013pl.cba
4651841639224410.000013com.minds
466184134882010.000125com.salesforce
4671841325215740.000019com.moonfruit
4681841235811560.000027com.mixpanel
4691841167028180.000011tl.page
4701840919619820.000015com.name
4711840908022820.000013jp.hateblo
4721840783025070.000012org.tvtropes
4731840731825800.000012jp.hatenadiary
4741840634825100.000012de.dw
4751840563418540.000016com.googlegroups
4761840550818760.000016mx.com.google
4771840513421980.000014org.aiga
4781840340428830.000011uk.co.birminghammail
479184033403670.000069com.booking
4801840160223140.000013vn.com.google
4811840155017290.000018gov.pa
4821839997216660.000018org.hrc
483183996188820.000036gov.nist
4841839847227420.000011com.exxonmobil
4851839737618410.000016ar.com.google
486183963569890.000033net.clickbank
487183956609760.000033com.matterport
4881839240224290.000013ua.at
4891839052220110.000015uk.ac.leeds
490183874443090.000081gov.cdc
4911838652815580.000019int.unfccc
4921838640823420.000013com.eklablog
493183857004590.000056com.gmail
494183855984010.000064org.npr
4951838483216720.000018gov.maryland
496183843903570.000070com.office
4971838395022400.000014se.liu
4981838381020670.000015com.discovermagazine
4991838340022040.000014com.ipage
5001838162611100.000029com.stackexchange
5011838159424180.000013it.justpaste
502183809744490.000058fr.free
5031838068217180.000018sg.com.google
5041837967210600.000031com.engadget
5051837823824210.000013my.com.thestar
5061837728212730.000024dk.google
5071837713622100.000014org.biorxiv
5081837706218610.000016com.weheartit
5091837419415980.000019uk.gov.tfl
510183712745080.000051gov.whitehouse
5111836933017230.000018ly.snip
5121836900618090.000017com.yourstory
5131836635631540.000011com.bonanza
5141836565028330.000011com.scienceblogs
5151836548414310.000021com.ebayimg
5161836543617740.000017gov.ky
517183635008580.000038com.venturebeat
5181836292411600.000027se.google
5191836245413500.000023com.firebaseapp
520183620261780.000147com.zendesk
5211836015020040.000015uk.gov.metoffice
522183599909280.000035com.windowsphone
5231835975023360.000013com.rediff
524183583885180.000051com.alibaba
5251835525622250.000014com.blogfa
526183552324150.000063com.fastcompany
5271835321214260.000021com.surveygizmo
5281835235220210.000015au.com.telstra
5291835145411340.000028org.sphinx-doc
5301835050220480.000015ro.google
5311835012619040.000016org.tigris
5321834952428350.000011be.lesoir
5331834943026980.000011cz.centrum
5341834937220470.000015link.page
535183492604790.000054org.nodejs
5361834902819600.000016com.marketwire
5371834767222420.000014com.mystrikingly
5381834701822600.000013ch.unige
5391834685027530.000011cat.uab
5401834681828890.000011com.zynga
5411834516415100.000020us.mn.state
5421834162222750.000013com.articulate
543183400129910.000033edu.psu
5441833942221410.000014com.thecvf
5451833902021500.000014es.csic
5461833892228800.000011co.carrd
5471833738016110.000019gov.mo
5481833736022970.000013com.newatlas
5491833569039080.000009jp.rdy
5501833463019900.000015org.iea
5511833359825650.000012com.db
5521833271623100.000013com.webstarts
5531833258424880.000012jp.hatenablog
5541833197623310.000013ly.rebrand
555183313443700.000069com.mapbox
556183312284850.000053com.livechatinc
5571832535219980.000015org.mozillazine
5581832494422710.000013de.uni-freiburg
5591832447213720.000023com.tinypic
560183242528830.000036com.steampowered
5611832384220720.000015uk.ac.york
5621832218610970.000030com.thinkwithgoogle
5631832058225890.000012ru.msu
5641832015624580.000012org.kotlinlang
5651831954016290.000019gov.oregon
5661831891435070.000010com.ingress
5671831812018060.000017gov.wi
568183180565410.000049com.aol
5691831804019690.000016gr.google
5701831782427410.000011lv.draugiem
5711831672023050.000013org.iucnredlist
5721831582220350.000015com.broadwayworld
573183140761340.000221com.youtube-nocookie
5741831395415110.000020net.openid
575183137041680.000158com.tripadvisor
576183135164350.000059com.dailymotion
5771831339815480.000019net.leadpages
5781831336033890.000010com.brother
5791831318627550.000011com.webcindario
5801831311631610.000011es.usal
5811831296223380.000013bg.google
582183127289070.000036com.xiti
5831831233822730.000013us.oh.state
584183107607200.000041fm.last
5851831070016620.000018net.ucoz
586183069363530.000071org.acm
5871830407439310.000009com.worldlingo
5881830325433790.000010com.embarcadero
5891830311017890.000017com.eiseverywhere
5901830283822300.000014org.wri
591183027463950.000065com.pubmatic
592183025164700.000054com.goodreads
5931830082622120.000014com.thehindubusinessline
5941830019622890.000013com.mihanblog
5951829998820370.000015com.intensedebate
5961829818432300.000011com.hellomagazine
5971829729834740.000010net.hypermart
598182949902350.000103uk.co.amazon
5991829442827110.000011nf.co
60018294164730.000401me.fb
601182940885420.000049com.entrepreneur
6021829306225430.000012com.futurelearn
6031829264423890.000013com.iconarchive
6041829230612750.000024com.cognitoforms
6051829213817300.000018org.khanacademy
6061829107220410.000015com.financialpost
6071829100017440.000017us.pa.state
6081828869826530.000012com.fatcow
609182885643720.000069com.staticflickr
6101828827218510.000016io.bower
6111828787836570.000010nz.govt.tepapa
6121828500021990.000014org.prlog
6131828498027430.000011ca.shaw
6141828473621040.000014com.bravesites
6151828303026750.000012de.uni-erlangen
6161828256824460.000012org.lacity
6171828202415340.000020fi.google
6181828201822670.000013de.uni-koeln
6191828051024640.000012uk.co.spectator
620182793263340.000076com.typeform
6211827932427250.000011is.good
6221827929033990.000010com.114la
6231827884232770.000010net.freeforums
624182783849200.000035com.zoho
6251827389824650.000012uk.ac.jisc
6261827340424890.000012com.mnn
6271827335223930.000013ca.dal
628182727401140.000290com.statcounter
629182727224800.000054com.netflix
6301827231815670.000019com.flashtalking
6311827221219030.000016com.prweek
6321827080631150.000011site.negocio
6331827079420800.000015org.lung
6341827050627270.000011com.mouser
6351827034025690.000012uk.co.profilebusiness
6361826984036160.000010uk.gov.number10
6371826814638060.000009net.dead
6381826773433750.000010jp.ac.kobe-u
6391826762819250.000016uk.org.nice
64018267512880.000343com.oculus
6411826719831960.000011build.bazel
6421826654618780.000016org.gentoo
6431826616621810.000014ie.thejournal
644182661481090.000310com.sharethis
6451826597619060.000016org.gnupg
646182647281480.000186ru.mail
6471826337618600.000016com.doodlekit
6481826223819480.000016com.crashlytics
6491826215618310.000017org.alz
6501826195425490.000012us.ms.state
6511826116224590.000012com.instructure
652182605408200.000040com.cbsnews
6531825984428770.000011ee.ut
6541825982612110.000026com.msdn
655182596107770.000040com.samsung
6561825700413380.000023com.emailmeform
657182549345490.000048edu.cmu
6581825482224960.000012uk.co.osoo
65918254762830.000354com.livestream
6601825465622260.000014com.atavist
6611825287622080.000014fr.archives-ouvertes
6621825228227920.000011com.cnsnews
6631825201823480.000013io.pantheon
664182511488980.000036com.createjs
6651825102617550.000017us.fl.state
6661825073023210.000013com.rabbitmq
6671825062827120.000011uk.co.newmedianow
6681824857614220.000022com.123formbuilder
6691824703220860.000015gov.nh
6701824350422330.000014org.crossref
6711824231422290.000014us.nm.state
672182422542960.000084com.scribd
6731824136632540.000010ca.qc.montreal
6741824090832850.000010uk.co.lrb
675182408281350.000215com.youku
676182397505170.000051com.slack
6771823965826770.000012com.hatenadiary
6781823965622920.000013com.itsmyurls
6791823763626710.000012uk.org.oxonaa
680182369022460.000099com.constantcontact
6811823686233480.000010com.outlookindia
6821823585438930.000009in.ac.nptel
6831823554026810.000012uk.org.oxfam
6841823523623440.000013com.yext
685182338122560.000094com.getbootstrap
6861823332421070.000014org.jenkins-ci
6871823058420550.000015com.broadcastingcable
6881823047816860.000018uk.gov.direct
6891823041626630.000012com.wmtransfer
6901823037419770.000015gov.mt
6911823016428210.000011uk.ac.stir
6921822854010520.000031com.marketwatch
6931822774422660.000013com.tmcnet
6941822744031360.000011uk.co.hsbc
6951822708617980.000017org.nfpa
6961822679229390.000011com.batchgeo
6971822584432750.000010com.weightwatchers
698182256362340.000103to.amzn
6991822463235740.000010com.orgfree
7001822377813550.000023org.whatbrowser
7011822181428430.000011com.adn
7021822127611900.000026org.weforum
703182205064810.000054org.hbr
7041821988028200.000011au.edu.deakin
7051821973414550.000021org.js
7061821911824450.000013in.ernet
7071821796228540.000011hu.elte
7081821751630250.000011pl.edu.uw
7091821727423670.000013uk.org.rspb
7101821652822200.000014com.healthgrades
7111821626427790.000011org.carbonbrief
712182142143660.000069com.prnewswire
7131821395620880.000015com.tapatalk
7141821318024310.000013org.grist
7151821275034230.000010id.co.kaskus
716182106384560.000057com.oreilly
7171821010635870.000010com.skepticalscience
718182099505390.000049gov.sec
7191820992230810.000011com.deccanherald
7201820966819050.000016tl.we
7211820877023110.000013us.ma.state
7221820686011010.000030uk.ac.cam
7231820599436300.000010ua.meta
7241820573835260.000010app.web
7251820446223980.000013uk.co.zoopla
7261820196632100.000011org.oceanconservancy
7271819963034210.000010org.atsjournals
7281819896235320.000010ru.my1
7291819844431620.000011com.mozello
7301819560015620.000019com.pastebin
7311819458028670.000011de.freenet
7321819341411370.000028edu.ucla
7331819310030520.000011com.telegraphindia
7341819302628570.000011com.chagasi
735181927589370.000034br.com.uol
7361818899426300.000012com.atwebpages
7371818862630360.000011com.remind
7381818792211320.000028com.redhat
739181877486080.000044com.wikihow
7401818765833770.000010edu.utep
7411818726434550.000010ru.nnov
7421818683418810.000016uk.gov.defra
7431818656823590.000013net.portfoliobox
7441818562426100.000012com.blogsky
7451818543438560.000009uk.co.mailonsunday
7461818543227230.000011jp.xxxxxxxx
7471818412214250.000021edu.ucsd
7481818396214490.000021com.digitaltrends
749181837381960.000130jp.ne.hatena
7501818246425630.000012uk.co.inews
7511818172823130.000013gov.la
7521818165612660.000024ly.ow
7531818036034410.000010gr.sch
7541817980230550.000011com.sc
7551817862833730.000010com.cummins
7561817756623630.000013com.activerain
7571817602638010.000009com.kazeo
7581817600229010.000011net.onlinewebshop
7591817542236890.000010com.galvanize
7601817490234730.000010ru.pr-cy
761181748265030.000052com.dmca
7621817352833280.000010com.kaywa
763181733488210.000040com.psychologytoday
7641817211828530.000011uk.co.heatall
76518171416840.000350me.ogp
7661816812826010.000012gov.ks
7671816778215160.000020ca.blogspot
7681816755821700.000014com.cityam
7691816728436040.000010gov.cabq
7701816643618130.000017org.reactjs
7711816605232830.000010org.escardio
7721816573410640.000031com.foxnews
7731816568018970.000016com.fifa
774181648602040.000117com.naver
7751816440437610.000009com.carscoops
7761816268029280.000011com.ecowatch
7771816239015070.000020com.literatumonline
778181619985350.000049net.2mdn
779181618004760.000054com.force
780181605781590.000167gov.privacyshield
7811816027018960.000016com.pcworld
7821816019229860.000011com.theyworkforyou
78318159730810.000365com.messenger
7841815970039390.000009com.anghami
785181594264240.000061edu.nyu
7861815799012940.000024com.indiegogo
7871815782818690.000016kr.or.kisa
788181578163640.000070com.discordapp
7891815701431860.000011uk.org.38degrees
7901815685036280.000010com.insideevs
7911815549614880.000020com.placeholder
7921815507232500.000010google.design
7931815504437640.000009gle.goo
794181544624540.000057com.walmart
795181533604280.000060com.flipboard
7961815204429020.000011pl.lublin
797181519524220.000062com.wufoo
7981815119811230.000029com.shutterstock
7991815068425370.000012org.iihs
8001814944627880.000011in.businessworld
801181486369810.000033com.pinimg
8021814776024070.000013jp.e-shops
8031814773422500.000014com.codecademy
8041814634026420.000012com.zx2c4
805181463281290.000243info.aboutads
8061814594421380.000014ca.ubc
8071814553828740.000011com.bnef
8081814435432400.000011uk.ac.rcplondon
8091814425437180.000009com.wsoctv
8101814390239500.000009com.monbiot
8111814334234630.000010com.droppages
8121814314823660.000013gov.arts
8131814245426440.000012us.wi.state
8141814204634770.000010org.usatf
8151814087816240.000019com.nvidia
8161813886636360.000010com.elmercurio
8171813883815380.000020com.businessweek
8181813846221760.000014com.tutsplus
819181383825540.000047com.atlassian
8201813735611840.000026com.searchengineland
8211813727835940.000010com.glu
8221813712436450.000010es.consumer
823181359742400.000102cn.com.sina
8241813559639480.000009com.allmyfaves
8251813534234460.000010com.businessgreen
826181336423500.000072com.163
8271813326832920.000010org.jython
828181332304710.000054com.smugmug
8291813281638640.000009org.thechicagocouncil
8301813212635760.000010gov.azdot
8311813047011760.000027com.ycombinator
8321812983833390.000010org.transportenvironment
8331812853829930.000011gov.ferc
834181279109360.000034com.aliexpress
835181261543560.000070com.wiley
836181257906960.000042com.moz
8371812499627560.000011uk.gov.environment-agency
8381812488630120.000011org.zsl
8391812413637040.000009org.ssireview
8401812352023780.000013uk.gov.scotland
8411812297815950.000019tv.ustream
8421812252230960.000011org.dailystrength
843181220385980.000045com.caniuse
8441812099624850.000012net.privacypolicytemplate
845181208667680.000040gov.noaa
8461812081815730.000019jp.makeshop
8471812051830400.000011org.rspo
8481811994623030.000013com.seetickets
8491811945421830.000014com.ign
850181188964040.000064mp.mailchi
851181180003110.000081com.digg
8521811800028550.000011gov.txdot
8531811736634120.000010uk.ac.ceh
8541811716414790.000021com.crunchbase
8551811707411270.000029com.highcharts
8561811587026450.000012com.9to5mac
8571811464810900.000030com.withgoogle
858181143148890.000036com.webs
8591811407224810.000012uk.co.streetmap
8601811250838650.000009com.pushwoosh
8611811170832040.000011ca.uwaterloo
862181111308170.000040com.shinystat
863181110783050.000082fr.google
8641811105034670.000010com.baomoi
8651811097439160.000009uk.ac.tyndall
8661811039617660.000017com.webmasterplan
8671811018036860.000010dk.bloggersdelight
8681810990834010.000010uk.gov.hm-treasury
8691810926217930.000017uk.org.cqc
8701810894812480.000025com.smashingmagazine
871181081383310.000076com.automattic
8721810757215300.000020com.ning
8731810698428290.000011com.linkwithin
8741810652230020.000011uk.org.greenpeace
875181037689560.000034com.libsyn
8761810353812390.000025com.sap
8771810295620910.000015edu.uci
878181025646280.000043com.patreon
8791810202035030.000010com.climatechangenews
880181017584090.000063com.xinhuanet
8811810133634640.000010com.kapook
882181006188850.000036com.newyorker
8831810047436400.000010com.spruz
884181001964780.000054com.inc
8851810006226760.000012jp.aikotoba
886180992689140.000035org.eff
8871809879436620.000010com.platts
8881809855635350.000010org.c2es
8891809855027470.000011com.mykaratestore
8901809678617700.000017com.ikea
8911809639414230.000022com.billboard
8921809509210700.000031com.hootsuite
8931809494835250.000010com.jkp
8941809349628240.000011org.mcsuk
8951809262212540.000025es.agpd
8961809243833490.000010net.edie
897180923585330.000050com.ea
898180921123760.000068org.opensource
8991809156829030.000011ru.drom
9001809016226390.000012com.yelloyello
9011808996825440.000012uk.co.intersol
902180897401390.000202com.alicdn
9031808942240510.000009com.mforos
9041808699014730.000021com.fiverr
905180863529340.000035com.foursquare
9061808591817370.000017org.freecsstemplates
9071808481041420.000009uk.org.indymedia
9081808467420490.000015uk.gov.education
9091808369438430.000009com.thinkbroadband
910180821642310.000104jp.co.amazon
9111808011438680.000009org.sciencenewsforstudents
912180800342210.000108org.drupal
9131807972610960.000030com.variety
914180786662900.000086com.stumbleupon
9151807803832690.000010net.scienceontheweb
9161807758217560.000017com.nba
9171807745225610.000012org.webring
9181807650210330.000031com.visualstudio
9191807595840050.000009io.raindrop
9201807454427440.000011jp.zouri
9211807376639040.000009org.corporateeurope
9221807247014020.000022com.storify
923180714363750.000069gov.ftc
9241807137216030.000019net.with2
9251807092614480.000021com.nike
9261807022240480.000009io.dataquest
9271807006612550.000025org.unicef
9281806967235670.000010bnpparibas.group
9291806917236850.000010com.thestatesman
9301806886634270.000010uk.org.rya
931180675083830.000068com.airbnb
9321806720416350.000019de.zeit
9331806719025550.000012com.hackernoon
9341806627434510.000010ca.pe.gov
9351806526640310.000009com.raamdev
9361806438824380.000013io.postach
9371806412614870.000020edu.purdue
938180635084070.000063com.tripod
9391806322812280.000025gov.fbi
9401806315413690.000023com.lifehacker
9411806313010690.000031com.uk
9421806187834320.000010in.gov.mhrd
9431806113035270.000010org.gmplib
9441806010038790.000009com.gitimmersion
9451805957828070.000011jp.at-ninja
9461805900430100.000011com.shichihuku
9471805882636290.000010com.h2database
9481805773634820.000010uk.org.rcn
9491805764037370.000009com.writetothem
9501805659213660.000023com.parsiblog
951180565869840.000033com.dropboxusercontent
9521805595013060.000024com.prweb
9531805562836950.000009com.websiteseguro
9541805510411180.000029com.vox
9551805427213970.000022us.imageshack
9561805396420320.000015com.howstuffworks
9571805292015310.000020com.yoast
9581805228012980.000024com.pcmag
9591805139830080.000011uk.org.woodlandtrust
9601805093635230.000010gle.posts
9611805080038380.000009org.priceofoil
9621804958016140.000019com.ccbill
9631804906637500.000009com.fourfour
964180472149450.000034gov.census
9651804648613280.000023edu.wisc
966180458761510.000179jp.co.google
9671804571012200.000025com.blackberry
9681804541411030.000030edu.umich
9691804539019520.000016com.w3layouts
970180438941460.000190me.line
9711804381615930.000019edu.usc
9721804235628420.000011com.zatunen
973180422405000.000052com.nasdaq
974180421305670.000046net.daum
9751804157031180.000011vn.tuoitre
9761804055625730.000012com.hisupplier
9771803944420230.000015com.nfl
978180393709270.000035com.ggpht
9791803932415490.000019com.vmware
9801803902038270.000009com.realtytimes
9811803836232610.000010net.batcave
9821803811633410.000010org.mygamesonline
983180378667340.000040com.mckinsey
9841803767439830.000009org.eia-international
985180376042580.000094com.sohu
9861803759437000.000009io.dropwizard
9871803739410260.000032gov.nps
9881803724421310.000014au.com.news
9891803660836520.000010de.epubli
9901803419813810.000022com.unity3d
9911803407229920.000011net.nend
9921803304840980.000009com.easyhits4u
9931803189011620.000027com.steamcommunity
9941803162214510.000021edu.uchicago
9951803157010860.000030com.uber
9961803147053060.000007com.plurk
997180304905970.000045com.adweek
9981803018236350.000010com.jal
9991802967017860.000017com.techradar
10001802924412710.000024com.ifttt

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2020 crawl archive now available

The crawl archive for January 2020 is now available! It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th. It includes page captures of 960 million URLs not contained in any crawl archive before.

Improvements and Fixes

  • date time values in the column "fetch_time" of the columnar index are now stored using the "int64" data type. For details and compatibility issues please see cc-index-table#7
  • WARC request records now show the HTTP protocol version sent with the HTTP request which can be different from the version received in the HTTP response message, cf. NUTCH-2760

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-05/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-05/segment.paths.gz100
WARC filesCC-MAIN-2020-05/warc.paths.gz5600059.94
WAT filesCC-MAIN-2020-05/wat.paths.gz5600022.3
WET filesCC-MAIN-2020-05/wet.paths.gz5600010
Robots.txt filesCC-MAIN-2020-05/robotstxt.paths.gz560000.25
Non-200 responses filesCC-MAIN-2020-05/non200responses.paths.gz560002.28
URL index filesCC-MAIN-2020-05/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-05/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

December 2019 crawl archive now available

The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th. It includes page captures of 850 million URLs not contained in any crawl archive before.

Archive Location and Download

The December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-51/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-51/segment.paths.gz100
WARC filesCC-MAIN-2019-51/warc.paths.gz5600047.47
WAT filesCC-MAIN-2019-51/wat.paths.gz5600017.6
WET filesCC-MAIN-2019-51/wet.paths.gz560008.06
Robots.txt filesCC-MAIN-2019-51/robotstxt.paths.gz560000.26
Non-200 responses filesCC-MAIN-2019-51/non200responses.paths.gz560003.5
URL index filesCC-MAIN-2019-51/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-51/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November 2019 crawl archive now available

The crawl archive for November 2019 is now available! It contains 2.55 billion web pages or 250 TiB of uncompressed content, crawled between November 11th and 23rd with a short operational break on Nov 16th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.

What’s new?

We’ve added two new fields to the URL indexes (CDX and columnar):

  • the redirect target location is stored in the CDX JSON field "redirect" resp. the column "fetch_redirect". The value is extracted from HTTP header field "Location" if the HTTP status code indicates a HTTP redirect. A relative URL path is converted to an absolute URL using the page URL as base URL. The key is absent (resp. the field value is null) in case the "Location" value is missing, not a valid URL or not a valid relative URL path.
  • truncation of the WARC record payload is indicated by the key "truncated" resp. the column "content_truncated". The reason for the truncation is given only for truncated records following the WARC header field "WARC-Truncated".

Additional details and examples can be found in the corresponding PR #15.

We’ve fixed a bug affecting the capture time (WARC-Date) in the the robots.txt subset which has been extracted from the HTTP "Date" field of the HTTP header and appeared to be occasionally wrong. Please see issue #14 for further details.

Archive Location and Download

The November crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-47/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-47/segment.paths.gz100
WARC filesCC-MAIN-2019-47/warc.paths.gz5600053.95
WAT filesCC-MAIN-2019-47/wat.paths.gz5600018.50
WET filesCC-MAIN-2019-47/wet.paths.gz560008.34
Robots.txt filesCC-MAIN-2019-47/robotstxt.paths.gz560000.24
Non-200 responses filesCC-MAIN-2019-47/non200responses.paths.gz560003.05
URL index filesCC-MAIN-2019-47/cc-index.paths.gz3020.20

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-47/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.