Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the September/October, November/December 2022 and January/February 2023 crawls. For more information about the data formats and the processing pipeline, please see the announcements of previous webgraph releases. You may also visit the cc-webgraph and cc-pyspark projects which contain all the scripts and tools needed to construct the graphs. Instructions for exploring the graphs in the webgraph format can be found in our collection of webgraph notebooks.

Host-level graph

The graph has of 325 million nodes and 2.63 billion edges. Both hyperlinks, HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc. However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 268 million dangling nodes (82.7%) and the largest strongly connected component contains 43.1 million (13.3%) nodes. Dangling nodes come from

  • hosts that are not crawled, but are referenced by a link on a crawled page
  • hosts with no links pointing to another hostname
  • or hosts that only returned an error page (e.g. HTTP 404).

Hostnames in the graph are in reverse domain name notation with the leading www. removed: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 325 million hosts from AWS S3 at s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ as prefix to access the files from everywhere.

Note that the text representation of the host-level graph is delivered in 10 gzip-compressed files listed in two path listings – one for the nodes (vertices), and one for the edges (arcs). First, download the path listing and decompress it with “gzip -d” or “gunzip”. Adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing will give you the list of URLs to download the entire graph.

Download files of the Common Crawl Sep/Nov/Jan 2022-2023 host-level webgraph

SizeFileDescription
2.34 GBcc-main-2022-23-sep-nov-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 32 vertices files
11.40 GBcc-main-2022-23-sep-nov-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 64 edges files
5.51 GBcc-main-2022-23-sep-nov-jan-host.graphgraph in BVGraph format
2 kBcc-main-2022-23-sep-nov-jan-host.properties
5.88 GBcc-main-2022-23-sep-nov-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2022-23-sep-nov-jan-host-t.properties
1 kBcc-main-2022-23-sep-nov-jan-host.statsWebGraph statistics
5.56 GBcc-main-2022-23-sep-nov-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org. Version (commit) 0bbf864 of the public suffix list was used (commit date 2023-03-08).

The domain-level graph has 88 million nodes and 1.68 billion edges. 52% or 46 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All domain graph files are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/.

Download files of the Common Crawl Sep/Nov/Jan 2022-2023 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2022-23-sep-nov-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.89 GBcc-main-2022-23-sep-nov-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.90 GBcc-main-2022-23-sep-nov-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2022-23-sep-nov-jan-domain.properties
3.81 GBcc-main-2022-23-sep-nov-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2022-23-sep-nov-jan-domain-t.properties
1 kBcc-main-2022-23-sep-nov-jan-domain.statsWebGraph statistics
1.90 GBcc-main-2022-23-sep-nov-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 88 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Sep/Nov/Jan 2022-2023)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13020826410.016763com.googleapis
22978372830.010997com.facebook
32938428620.015692com.google
42603256460.005934com.youtube
52580529250.006482com.twitter
62548799480.005484com.instagram
72522559070.005863org.w
82500241640.007149com.googletagmanager
92395287090.004622org.gmpg
1023337268120.003349com.linkedin
1123278606100.004162com.gstatic
1222405388150.002066com.gravatar
1322178872110.003793com.cloudflare
1421844608130.002400org.wordpress
1521681482250.001450com.pinterest
1621559576320.001217org.wikipedia
1721429174170.001813com.apple
1821227448310.001226com.wordpress
1921216668340.001088com.vimeo
2021195402390.000900be.youtu
2120932166140.002311net.cloudfront
2220898430180.001711com.bootstrapcdn
2320823002350.001080com.microsoft
2420748850480.000709com.amazon
2520741894550.000581com.blogspot
2620704246260.001442com.jquery
2720677290470.000780gl.goo
2820670994430.000852com.amazonaws
2920665338160.001840io.polyfill
3020631324440.000844eu.europa
3120618144500.000696com.wp
3220564226290.001298net.jsdelivr
3320528750450.000812org.mozilla
3420516286280.001369com.wixstatic
3520513258650.000505ly.bit
3620484072230.001470com.adobe
3720443722210.001544com.fontawesome
3820411612400.000896com.google-analytics
3920400328220.001485com.github
4020349558540.000635com.paypal
4120346992190.001674com.googleusercontent
4220328172300.001297com.whatsapp
43202923001030.000334com.tumblr
4420192382330.001113ru.yandex
4520186350850.000378com.medium
46201839441230.000299com.reddit
47201809141080.000331com.yahoo
4820178722620.000529com.shopify
4920170012580.000564com.flickr
5020167766690.000483io.github
5120163916630.000522co.t
52201522401280.000260com.nytimes
53201240681250.000293com.spotify
5420104136360.001066com.baidu
5520098412590.000553org.w3
5620094912370.001000com.qq
5720016100570.000569com.vk
5820005808410.000881com.googlesyndication
59200037061140.000313com.weebly
60199752981580.000186com.forbes
61199678741220.000302org.creativecommons
62199677261300.000251com.soundcloud
63199621901420.000211org.archive
64199424821500.000198gov.nih
65199187481260.000264com.tiktok
6619915782610.000538org.schema
6719912766600.000544com.unpkg
68199112341810.000153com.bing
69199110022490.000107com.imdb
70199055522640.000100edu.harvard
7119902624800.000387me.t
72199011681960.000137org.wikimedia
73198739901630.000174com.dropbox
74198722642880.000091net.slideshare
75198699921920.000142int.who
76198622622030.000131com.cnn
77198621421740.000157gov.cdc
78198588862140.000123com.theguardian
79198476421840.000151com.unsplash
80198451061210.000307com.list-manage
81198316902720.000097net.researchgate
82198191882530.000105com.wsj
83198160302930.000090com.bbc
8419812290640.000507com.macromedia
85198104522970.000089uk.co.bbc
86198051663040.000087com.reuters
87197889722460.000107com.washingtonpost
88197863142790.000094com.statista
89197756523180.000083edu.stanford
90197753464320.000060gov.nasa
9119774590670.000497com.addthis
9219770972460.000807net.doubleclick
93197698901510.000195com.wixsite
94197657142270.000117com.businessinsider
95197632702150.000123com.imgur
96197540143670.000072com.go
97197505862040.000129com.live
98197460041070.000332com.wix
99197455303720.000070com.wired
100197383661520.000195us.zoom
101197234221350.000225gle.forms
102197181482250.000118com.etsy
103197178242770.000095com.ibm
104197000924470.000058com.theverge
105196997883560.000074com.nature
106196966561200.000309me.wp
107196935241760.000156org.ietf
10819687268490.000700com.fb
109196869561570.000187com.ytimg
110196858004920.000053com.msn
111196854363150.000084com.android
112196842043160.000084com.cnbc
113196840241400.000214com.mailchimp
11419682468420.000858net.fbcdn
115196821982550.000104com.stackoverflow
116196812025110.000051edu.berkeley
117196799682750.000096org.un
118196795822650.000100com.bloomberg
119196777922200.000121com.outlook
120196735021780.000155org.apache
121196667062690.000098com.oracle
122196585423820.000069com.example
123196506603430.000077org.npr
124196495603500.000075com.quora
125196490301290.000254com.youtube-nocookie
126196468006010.000045com.zdnet
12719646398990.000336com.giphy
128196441281950.000140com.hubspot
129196437522190.000122org.doi
130196324905420.000048com.myspace
131196308862220.000120gov.ca
132196187263950.000067com.time
133196129543590.000074com.slack
13419610836700.000455com.ft
135196103504620.000056com.appspot
136196062341390.000217com.opera
137196052582610.000102com.sciencedirect
138196047804760.000054com.ted
139196041803420.000077com.springer
140196023624010.000065org.arxiv
141195982763930.000067com.usatoday
142195945541940.000140com.issuu
143195882883320.000080org.acm
144195850422050.000129com.npmjs
145195816643310.000080uk.co.amazon
146195772443070.000086com.githubusercontent
147195767842420.000108com.blogger
148195760543110.000086com.wiley
149195747323550.000074com.pexels
150195707824680.000055edu.cornell
151195694124820.000054com.theatlantic
152195670064400.000059org.python
153195644945930.000045org.worldbank
154195633905650.000047uk.co.telegraph
155195630087320.000037edu.psu
156195532044280.000060com.cnet
157195396181490.000199org.ampproject
158195375686230.000043org.weforum
159195373322960.000089uk.gov
160195336584440.000059com.huffingtonpost
161195328784900.000053com.latimes
162195285786450.000042org.unesco
163195272226550.000041com.livejournal
164195265344490.000058com.pixabay
165195084444110.000063com.sagepub
166195066944800.000054com.goodreads
167195061884990.000052uk.co.google
168195034022580.000103net.behance
169195031382700.000097com.bandcamp
170195030707270.000037org.chromium
171194959885250.000050com.cbsnews
172194951803260.000081ee.linktr
173194934525270.000049edu.yale
174194897302860.000092com.w3schools
175194831301750.000157com.yelp
176194824122990.000089edu.mit
17719481882680.000496com.googleadservices
17819480090780.000393me.wa
179194774305280.000049uk.co.independent
180194740801340.000228com.statcounter
181194729383000.000089com.tinyurl
182194721365090.000051com.fortune
183194713187140.000038edu.columbia
184194712186900.000039com.vox
185194612964220.000061gov.whitehouse
186194605464160.000062org.nodejs
187194578105120.000051uk.co.dailymail
188194554625370.000049com.indiatimes
189194526263490.000075com.businesswire
190194516182900.000090org.pewresearch
191194502765590.000047edu.cmu
192194426766430.000042com.marketwatch
193194368884940.000052com.tandfonline
194194316285980.000045org.pbs
195194292206570.000041com.usnews
196194289847770.000035edu.upenn
197194281002780.000095com.twimg
198194211146830.000039com.buzzfeed
199194205945900.000045gov.loc
200194200225040.000052com.fc2
201194195826740.000040com.git-scm
202194185549010.000033com.qz
203194161668850.000034edu.washington
204194156507780.000035com.trello
205194154446750.000040com.apnews
2061941191211210.000026com.techradar
207194117365170.000050com.investopedia
208194106106520.000041com.mysql
209194080941380.000218info.aboutads
210194074008800.000034me.about
211194036822760.000096org.gnu
212194031227360.000037com.economist
213194028626770.000040com.box
214193975007130.000038com.scribd
2151939645412760.000023com.techrepublic
216193957583700.000071com.gitlab
217193950364890.000053com.walmart
218193945683480.000075com.techcrunch
219193888884060.000064co.ibb
220193880544870.000053com.nationalgeographic
221193872587340.000037com.venturebeat
222193862485410.000048com.inc
223193855641650.000171com.staticflickr
224193852421410.000211me.line
225193828745340.000049com.theconversation
226193809285100.000051com.nbcnews
227193808025820.000046com.digg
2281937975211170.000026edu.northwestern
229193789527870.000035org.semver
2301937746610020.000029edu.jhu
231193770507370.000036ca.cbc
232193769425390.000049com.googleblog
2331937509613970.000021edu.rutgers
234193718126470.000042com.photobucket
2351937080610990.000027edu.usc
236193695046000.000045gov.senate
237193678341980.000134com.calendly
238193672222310.000116net.windows
2391936502012620.000023org.kernel
2401935967012870.000023co.elastic
241193587888080.000035com.shutterstock
242193584026910.000039org.cambridge
2431935637810910.000027fm.last
244193557983210.000082tv.twitch
245193557881870.000147page.g
246193557487190.000037com.newyorker
247193541167480.000036org.bitbucket
248193456085660.000047com.oup
249193441429300.000031org.sciencemag
250193434302500.000107com.jotform
251193405881910.000143com.cloudinary
252193402165380.000049org.unicef
2531933990610200.000029edu.princeton
254193390905880.000045io.codepen
255193388249820.000030gov.usgs
256193387948790.000034uk.ac.ox
257193366705620.000047com.xinhuanet
2581933611811840.000025com.alexa
259193336565150.000051org.js
260193314808640.000034edu.asu
261193293348840.000034com.nvidia
2621932855011870.000025com.mediafire
263193252822210.000121net.sourceforge
2641932243014330.000021com.euronews
265193214223350.000079com.prnewswire
266193209329370.000031com.foxnews
2671932090010830.000027gov.fbi
2681931490411740.000025org.coursera
269193134706270.000043com.biomedcentral
2701931272613670.000022com.500px
27119310966880.000374com.stripe
272193083922370.000112com.tripadvisor
273193067941860.000148com.xing
274193064382480.000107com.wpengine
275193060701540.000191com.sharethis
276193044769290.000031com.nypost
277193012787400.000036com.politico
278192977723250.000081com.automattic
2791929751614500.000020com.digitaltrends
2801929720810720.000027co.g
2811929705410580.000028org.pnas
2821929536811420.000026com.axios
283192952284100.000063org.openstreetmap
2841929363811910.000025uk.co.guardian
2851929354811110.000026com.scmp
286192902285950.000045ca.canada
2871928966014310.000021org.eclipse
2881928864412610.000023uk.co.blogspot
289192883006780.000040com.huffpost
2901928647613220.000022org.semanticscholar
291192840686370.000042gov.census
292192831944690.000055gov.usda
29319282746660.000504com.trustpilot
294192812827650.000036com.hp
295192812165560.000048io.readthedocs
2961927691212400.000023com.nymag
297192763042560.000104org.iana
298192761668700.000034com.ssrn
299192760929620.000030edu.umn
300192749869930.000029au.net.abc
3011927395814480.000020com.sky
3021927381612110.000024edu.purdue
303192733246250.000043org.apa
304192723622170.000122com.eventbrite
305192715121670.000167gov.privacyshield
3061927140410640.000028com.about
307192710583400.000078com.dribbble
308192702265830.000046gov.house
309192696849070.000033com.sciencedaily
310192691945710.000047gov.noaa
311192690344450.000059com.arcgis
312192675543630.000073com.feedburner
313192667145910.000045fm.anchor
3141926669815330.000020ru.spb
315192642805670.000047site.business
3161926426013080.000022com.nikkei
3171926116610300.000029com.ggpht
318192599808680.000034org.change
3191925743610530.000028com.evernote
3201925681813600.000022edu.illinois
321192556281620.000174com.office
3221925558213430.000022org.postgresql
323192529708820.000034org.pypi
324192529625810.000046com.163
325192513505740.000047com.dailymotion
3261924895616170.000019org.aclu
3271924789010510.000028edu.uchicago
328192451909530.000030com.mdpi
329192450108890.000034de.spiegel
330192436624960.000052gov.hhs
331192429904000.000065com.indeed
332192410066960.000039gov.justice
333192399986210.000043gov.state
3341923758416650.000019org.greenpeace
335192364601060.000333net.jsfiddle
336192336887890.000035gov.congress
337192331446890.000039com.bigcartel
3381923125220100.000017edu.gatech
339192302304840.000054gov.epa
3401922930619230.000017com.openai
341192287506190.000043org.ohchr
342192275788930.000033org.fao
343192271623900.000067com.atlassian
3441922715018120.000018org.science
345192248427680.000035com.jetbrains
3461922378612230.000024com.foursquare
347192236924540.000057com.squareup
348192201301240.000297com.alicdn
3491921920019960.000017org.phys
3501921821210900.000027cn.com.chinadaily
351192172843390.000078com.ebay
352192168163470.000075com.surveymonkey
353192167268830.000034com.chrome
3541921671612890.000023uk.co.thetimes
355192163482600.000102com.webflow
3561921609020230.000017com.foxbusiness
357192139729130.000032app.netlify
358192138223850.000068com.disqus
3591921253212290.000024com.hollywoodreporter
360192115507660.000036gov.archives
361192107464230.000061com.getpocket
362192105263960.000066com.samsung
363192073183710.000071com.proofpoint
3641920679211190.000026edu.utexas
365192061321320.000248com.zendesk
366192048343970.000066com.substack
367192021527020.000038com.mashable
3681920106010810.000027org.jstor
369192000686490.000042net.azurewebsites
370191989502000.000133org.allaboutcookies
371191985584210.000062com.freepik
372191978125290.000049com.netdna-ssl
373191967805470.000048com.snapchat
374191943726790.000040com.gumroad
375191942381440.000206com.paypalobjects
37619193934760.000397me.fb
377191928145260.000050ch.admin
378191913748630.000034com.pinimg
379191912507430.000036com.britannica
3801919108214890.000020au.com.smh
381191909067040.000038com.vice
382191890325540.000048gov.copyright
3831918366211570.000025com.dw
384191836263450.000076net.themeforest
385191825844520.000058com.patreon
3861918217812350.000024uk.co.mirror
3871917694413630.000022de.sueddeutsche
3881917346610140.000029uk.ac.cam
389191721983330.000080fr.cnil
3901916961233270.000010google.research
391191673467840.000035cc.postimg
392191651706090.000044gov.nist
3931916373414930.000020ca.sfu
394191626765070.000051com.gmail
3951916201623420.000015com.martinfowler
3961916096612970.000023org.imf
397191600949830.000030edu.si
398191586667090.000038org.oecd
399191571944020.000064ru.gov
4001915617810590.000028com.chicagotribune
4011915464012570.000023com.crunchbase
402191542904040.000064com.optimizely
40319153126750.000404net.akamaihd
4041915267210050.000029com.intuit
405191512001560.000188org.networkadvertising
4061915093821390.000016app.web
407191501009470.000031com.history
4081914725824980.000014com.ibtimes
409191468821610.000174com.rawgit
410191467342840.000093net.azureedge
411191464324780.000054nl.google
412191441224750.000055com.meetup
4131914365827280.000013com.cbs
4141914361415490.000019org.unhcr
415191430101190.000312de.google
416191404709770.000030com.sap
417191403085770.000047com.kickstarter
418191392324310.000060com.media-amazon
4191913846813400.000022com.aljazeera
420191382603460.000076net.php
4211913661018940.000018com.straitstimes
42219135416520.000645com.godaddy
4231913471811030.000027com.insider
4241913416210170.000029gov.treasury
4251913320416010.000019us.imageshack
4261913223212690.000023org.sphinx-doc
4271913182610730.000027link.page
428191312305080.000051cn.com.people
4291913085614570.000020de.mpg
430191291685140.000051org.debian
4311912487025970.000014au.com.news
432191236862670.000098jp.co.yahoo
433191226583360.000078com.typepad
43419121732730.000429com.wsimg
435191214449410.000031com.podbean
4361912132010660.000028uk.gov.service
437191213002350.000113gg.discord
4381912043014900.000020com.over-blog
439191193643060.000086com.eepurl
440191192746360.000042gov.usa
441191186346330.000043com.stumbleupon
442191169583540.000074org.hbr
4431911426417890.000019ms.1drv
444191142206730.000040google.blog
4451911201017960.000019com.buzzfeednews
4461911095410930.000027org.ilo
4471911029619620.000017com.mystrikingly
44819108200900.000367net.facebook
4491910793814460.000021de.zeit
450191078384710.000055com.tripod
4511910672810650.000028int.coe
4521910635813890.000021com.teachable
4531910615411360.000026com.thehill
45419106002560.000570net.typekit
4551910585820510.000016uk.co.standard
4561910352220710.000016com.newscientist
4571910239825270.000014com.channel4
4581910205826310.000013com.storify
4591909866014210.000021edu.duke
460190945005960.000045com.healthline
4611909363813020.000023au.gov.nsw
4621909157425250.000014org.maven
463190904386220.000043org.worldwildlife
4641908678610220.000029com.brightcove
4651908675615410.000020int.unfccc
466190866168960.000033com.withgoogle
467190856822920.000090com.squarespace
4681908561625780.000014com.instructables
4691908486220820.000016com.rt
4701908425224300.000015org.tensorflow
471190833062630.000101me.telegram
472190810826380.000042com.cisco
473190807668090.000035watch.fb
474190799004650.000056com.steampowered
475190790127600.000036com.deviantart
476190742426030.000044com.googlecode
4771907250218130.000018uk.parliament
478190706245720.000047com.airbnb
479190683605430.000048com.matterport
4801906715220200.000017org.edx
4811906309029090.000012com.dreamstime
4821906303818750.000018com.googlesource
4831906240811440.000026com.dell
4841906185029340.000012me.ogp
4851905972611310.000026org.hrw
4861905915220580.000016edu.cuny
487190568924360.000059com.elsevier
488190566009610.000030gov.dhs
4891905537210860.000027com.bostonglobe
490190533162830.000093com.salesforce
4911905165028760.000012org.icrc
4921905070014790.000020gov.defense
493190492861690.000163com.discord
4941904884626720.000013cc.arduino
495190486961720.000160com.addtoany
4961904833821220.000016com.padlet
4971904763421870.000016uk.co.thesun
4981904720419330.000017edu.georgetown
499190469845630.000047com.deloitte
5001904428020260.000017ca.blogspot
5011904412421800.000016edu.ucsb
5021904076629650.000012org.wikibooks
5031904024421160.000016edu.tufts
504190398822510.000107org.bbb
5051903704435190.000010com.deepmind
506190361403620.000073net.secureservercdn
5071903569233700.000010google.ai
5081903255814360.000021eu.politico
5091903200419650.000017edu.wustl
5101903145410400.000028com.istockphoto
511190296187610.000036com.thinkwithgoogle
5121902928830520.000011com.diigo
5131902920429450.000012com.snap
5141902768011810.000025us.icio
5151902741614470.000020ch.ipcc
516190268141480.000199com.jimdo
5171902479621190.000016com.france24
5181902401035480.000010com.ulule
5191902163211380.000026com.arstechnica
5201902008024710.000014com.instructure
521190192329600.000030edu.brookings
5221901904224090.000015edu.caltech
523190182942520.000105com.aliyuncs
52419018192240.001461cn.gov.miit
525190181206670.000041ca.amazon
5261901607014030.000021org.rfc-editor
5271901359411510.000026com.verizon
528190091602730.000096me.m
529190077202890.000091ru.ok
5301900659412120.000024uk.nhs
531190028707080.000038com.intel
5321900194819830.000017gov.lbl
5331900169024110.000015ru.kremlin
5341899923224430.000015edu.oregonstate
535189985584260.000060com.fastcompany
536189984285050.000052com.ssl-images-amazon
5371899773628730.000012fr.archives-ouvertes
5381899558220890.000016org.archlinux
539189952345760.000047com.wufoo
5401899496614220.000021com.people
5411899479024840.000014gov.cia
5421899444825640.000014tl.we
5431899436822490.000015org.unwomen
5441899338429420.000012com.kaggle
5451899174028080.000012com.aboutamazon
5461899162040940.000008com.sho
547189911583130.000085gov.ftc
548189911066590.000041com.docker
549189901187280.000037com.zoominfo
5501898749834170.000010com.pearltrees
5511898480020960.000016io.gitlab
5521898439830460.000011org.scala-lang
553189831883170.000083com.typeform
5541898149619560.000017com.asahi
555189797043680.000071net.imgix
556189789841680.000164com.youronlinechoices
557189777287380.000036com.symantec
5581897739629040.000012jp.co.japantimes
5591897597013610.000022com.buymeacoffee
5601897539415890.000019com.justia
5611897498027860.000012uk.co.huffingtonpost
562189744645520.000048com.gartner
5631897422629280.000012jp.ac.u-tokyo
564189736843980.000065com.force
5651897178228930.000012no.nrk
5661896949231130.000011cc.taplink
5671896759613300.000022org.amnesty
5681896756220900.000016com.thestar
5691896619023530.000015tv.ustream
5701896560634800.000010tv.blip
5711896442622120.000016com.peatix
572189611147550.000036com.redhat
5731896051818770.000018com.firebaseapp
5741896029620750.000016com.flipboard
575189594669890.000029com.stackexchange
576189593544460.000058com.herokuapp
577189588904950.000052com.campaign-archive
5781895872819860.000017org.nber
5791895818211720.000025com.ecwid
5801895758827190.000013hk.com.google
5811895714627870.000012blog.home
5821895704629170.000012com.rakuten
5831895606626390.000013org.biorxiv
5841895517411230.000026gov.wa
585189530205580.000048com.netflix
5861895265626810.000013com.gamespot
587189503585550.000048com.canva
5881894761630410.000011org.rsf
589189474624350.000059com.mckinsey
5901894582419520.000017com.reverbnation
5911894419812710.000023net.clickbank
592189440663190.000082jp.co.amazon
5931894140621090.000016com.jimdosite
5941893953230170.000011com.self
595189383322010.000132ru.mail
5961893739619420.000017gov.eia
5971893722830730.000011org.oas
598189368466560.000041com.iheart
5991893589221630.000016com.haaretz
6001893547830450.000011edu.syr
601189351949780.000030com.icons8
602189347382430.000108to.amzn
6031893416225620.000014org.computer
6041893354814600.000020site.notion
605189295546510.000042org.iso
6061892943618370.000018com.livescience
6071892899226180.000013com.infogram
6081892863621550.000016gov.usembassy
6091892843013480.000022com.mapquest
6101892826231110.000011com.tutorialspoint
611189280286650.000041com.qualtrics
6121892713020220.000017cn.gov.fmprc
613189269106970.000039org.ieee
6141892665210780.000027com.pcmag
6151892658824350.000015com.popsugar
6161892564621570.000016com.iconfinder
617189256006880.000039com.entrepreneur
618189254363440.000077com.visualstudio
6191892527210320.000028com.dropboxusercontent
6201892472028630.000012it.scoop
6211892297827670.000013com.pbworks
6221892177420430.000016ph.telegra
623189207049570.000030me.onelink
6241891980233260.000010org.grist
6251891723028330.000012com.fineartamerica
6261891711229390.000012au.edu.unimelb
6271891639418000.000019mil.army
6281891467044770.000007com.mail
6291891427435140.000010com.afp
6301891400213450.000022org.consumerreports
6311891148637600.000009net.docdroid
632189110968620.000034com.oreilly
6331891002229290.000012com.novell
634189088509810.000030org.mediawiki
6351890811018870.000018com.bol
6361890809627550.000013com.gq
6371890373213520.000022com.maxmind
6381890371012410.000023com.licdn
6391890322017910.000019gov.cancer
6401890296829510.000012com.eonline
6411890264431860.000011com.theonion
6421890228626970.000013net.openid
6431890183220600.000016com.dictionary
6441890157025230.000014com.foreignpolicy
6451890027625550.000014org.c-span
646188989205060.000052net.fastly
6471889858819410.000017edu.tamu
648188972906990.000039int.wipo
6491889695811010.000027com.merriam-webster
6501889687633660.000010org.freedomhouse
651188963641110.000328com.livestream
6521889636218320.000018com.verywellmind
653188956683790.000069jp.ameblo
654188948149790.000030com.forrester
6551889363615500.000019com.wikia
6561889313020610.000016org.unep
6571889260223400.000015com.patch
658188925861590.000184com.weibo
659188906905480.000048com.sxsw
6601888980427220.000013com.motherjones
6611888976812540.000023com.jekyllrb
662188897528740.000034gov.federalregister
6631888926839300.000009com.instapaper
6641888865028440.000012com.thecut
6651888719211070.000027net.authorize
6661888630824930.000014edu.gwu
6671888534832250.000010org.csis
6681888465025140.000014gov.ky
6691888434227090.000013com.theintercept
6701888380232130.000011ua.com.google
6711888355215300.000020com.snopes
6721888317836550.000009au.com.businessinsider
6731888309651720.000007com.ixbt
6741888243827570.000013org.fas
6751888241810700.000027com.tableau
6761888231014620.000020gov.uscourts
6771888174635510.000010com.teacherspayteachers
678188800666640.000041gov.sec
679188779545500.000048com.scorecardresearch
6801887713621260.000016org.ncsl
6811887707439750.000009org.cpj
682188761922300.000117com.naver
6831887569226800.000013uk.ac.imperial
6841887565412050.000024com.findlaw
6851887446228290.000012si.gov
6861887429811130.000026edu.ucla
6871887386623800.000015com.voanews
6881887111238560.000009org.edublogs
6891886818028690.000012org.marketplace
6901886785237250.000009net.aljazeera
6911886727623760.000015com.channelnewsasia
692188671947230.000037org.plos
6931886667410410.000028net.atlassian
6941886646841100.000008edu.ua
6951886637212100.000024gov.uspto
6961886471224180.000015com.goodhousekeeping
6971886330813280.000022org.altervista
6981886214818670.000018com.billboard
6991886166416860.000019gov.govinfo
7001886021422380.000015ru.ria
7011886003226790.000013com.nationalpost
7021885975049210.000007com.viki
7031885935032940.000010com.hm
7041885851028670.000012com.treehugger
705188577889320.000031com.termsfeed
7061885759034570.000010ru.interfax
7071885708012510.000023com.squarespace-cdn
7081885627229660.000012com.sandiegouniontribune
7091885605015060.000020io.termly
7101885439042880.000008com.dailycaller
7111885334027580.000013com.html5rocks
7121885096433770.000010is.archive
7131885029830570.000011com.nextdoor
7141884979637860.000009me.site123
7151884806413380.000022org.mitre
7161884702659850.000006com.fanpop
7171884674225990.000014org.pewtrusts
7181884672835740.000009org.britishcouncil
719188463921970.000136com.caniuse
7201884576023690.000015va.vatican
721188454342590.000102com.getbootstrap
7221884393837140.000009com.worldpopulationreview
723188433905300.000049com.adweek
7241884237822340.000016gov.oregon
725188421769060.000033com.digitaloceanspaces
7261884140636440.000009org.transparency
7271884119413240.000022com.windows
7281884108232470.000010com.tomsguide
729188409805350.000049com.gofundme
7301883919828750.000012org.unfpa
731188389769710.000030com.imageshack
7321883895439920.000008com.metacritic
7331883808440230.000008org.carnegieendowment
734188375724980.000052com.bigcommerce
735188366949160.000032com.libsyn
7361883608219240.000017com.kaltura
7371883525829260.000012org.wikisource
7381883514430210.000011org.gnupg
7391883500431640.000011org.signal
740188348844930.000052com.aol
7411883465830550.000011no.uio
7421883400650000.000007ua.nv
7431883389831100.000011ru.vedomosti
7441883379421690.000016com.wakelet
745188330668670.000034com.zoho
746188328966400.000042jp.ne.sakura
7471883211839630.000009com.theweek
7481883130821130.000016com.proquest
7491883033011180.000026com.slate
7501882903818680.000018com.speakerdeck
7511882889028250.000012jp.nicovideo
752188282142410.000110jp.co.google
7531882647232230.000010com.tradingeconomics
7541882453825260.000014com.radio
7551882446653660.000006org.bakerlab
7561882444411680.000025org.webaim
757188232921660.000169org.whatwg
7581882324630380.000011com.bloglovin
7591882209236200.000009edu.temple
7601882176210710.000027com.engadget
7611882160411930.000025io.powr
762188211346710.000040org.eff
7631882112842280.000008com.virgin
764188207083300.000080com.wistia
7651882047042230.000008com.scotsman
7661882039035330.000010ly.plot
7671881949636570.000009de.diplo
7681881861418550.000018com.ticketmaster
7691881769427000.000013com.me
77018816976710.000441com.oculus
7711881680429180.000012com.digitaljournal
7721881646615800.000019com.cbssports
7731881529437610.000009io.fabric
7741881425024850.000014com.surveygizmo
7751881168042340.000008io.meduza
776188112465940.000045fr.free
7771881021037820.000009org.neocities
7781881015026850.000013com.jpost
7791880983827010.000013com.washingtontimes
7801880974236100.000009org.annualreviews
7811880916033330.000010int.nato
7821880879219940.000017com.trustwave
7831880853232140.000011org.heritage
7841880807434860.000010org.repec
7851880774018090.000019co.carrd
7861880731641000.000008uk.co.timesonline
7871880618433580.000010re.appsto
78818805892810.000387org.nginx
7891880556813140.000022com.playstation
7901880550632610.000010uk.ac.leeds
791188050745210.000050org.drupal
7921880421236560.000009com.citylab
7931880343011850.000025com.gizmodo
7941880316239030.000009com.nationalreview
7951880282430950.000011org.nrdc
7961880274239720.000009net.openreview
7971880262427830.000012com.wpcomstaging
7981880258425890.000014org.sleepfoundation
7991880162047710.000007com.bizcommunity
8001880143013040.000023com.udemy
8011880112028740.000012com.towardsdatascience
8021880091436490.000009com.glitch
8031880089421620.000016com.unity
804188008885490.000048com.globenewswire
8051880024238360.000009com.bepress
8061880004625670.000014com.thespruce
8071879948819210.000017ru.rbc
8081879887237740.000009com.pbase
8091879873214860.000020br.com.uol
8101879869038850.000009ru.mid
8111879722835690.000009org.wilsoncenter
8121879651249050.000007it.justpaste
8131879592024600.000014ru.rutube
814187950648660.000034com.newsweek
8151879489832720.000010au.edu.sydney
8161879379219750.000017fr.blogspot
817187933846480.000042com.mimecast
8181879290437710.000009it.eventbrite
8191879267630900.000011com.financialpost
8201879261417420.000019com.technologyreview
8211879257446770.000007edu.csun
8221879243048100.000007org.scala-sbt
823187923427050.000038net.b-cdn
8241879218639820.000008com.indystar
8251879104025800.000014ru.tass
8261879061218630.000018ch.ethz
8271879049227640.000013com.newrepublic
8281879030436220.000009ca.uvic
829187902386140.000044com.fandom
8301878929845000.000007com.kinja
8311878843229230.000012int.wmo
8321878695010360.000028com.akamai
8331878525034440.000010ru.lenta
8341878315438860.000009com.slidesharecdn
8351878312442780.000008org.elifesciences
8361878289827200.000013com.fivethirtyeight
8371878270427700.000013com.verywellhealth
838187823987290.000037org.reactjs
8391878223610550.000028org.unicode
8401878026820320.000016org.americanbar
8411877961048500.000007co.aeon
842187795827540.000036com.moz
8431877948046950.000007com.jigsy
844187794802020.000131com.jimcdn
8451877916819900.000017com.kxcdn
8461877909618020.000019com.images-amazon
8471877859040280.000008com.thediplomat
8481877840841890.000008com.allafrica
8491877807418540.000018gov.medlineplus
8501877763010270.000029com.emarketer
8511877694831030.000011com.blogtalkradio
8521877666831480.000011com.biography
8531877565013550.000022com.xkcd
8541877471813870.000021com.thenextweb
8551877418210350.000028com.css-tricks
8561877374627650.000013io.redis
8571877363218300.000018io.kubernetes
8581877151234300.000010fr.rfi
8591877080643080.000008au.edu.adelaide
8601877002831760.000011org.nationalgeographic
8611876989610600.000028com.yandex
8621876783032640.000010org.panda
863187672922710.000097de.amazon
8641876693444710.000008fi.hs
8651876649829920.000011com.euractiv
8661876635051250.000007edu.umt
8671876552637200.000009net.ipsnews
868187650482240.000118org.icann
8691876464239710.000009gov.ornl
8701876373640420.000008org.thinkprogress
8711876353235360.000010vn.com.google
8721876349614740.000020edu.umd
873187634464410.000059org.opensource
8741876276628390.000012fi.yle
8751876126811290.000026com.glassdoor
8761876097438090.000009com.crashlytics
877187605803830.000069it.google
8781876053433920.000010cn.globaltimes
8791875947038410.000009com.sputniknews
8801875931632960.000010gov.doi
8811875928410480.000028ly.cutt
8821875915437690.000009com.clarin
8831875912438330.000009uk.gov.metoffice
8841875889457580.000006org.cgsociety
8851875874614520.000020com.rollingstone
8861875855014580.000020com.smashingmagazine
8871875801432090.000011org.cfr
8881875792639330.000009gov.fec
8891875783435300.000010ru.rosminzdrav
8901875697812430.000023org.golang
8911875689453820.000006edu.chapman
8921875682445440.000007uk.ac.nhm
8931875649651100.000007au.edu.uts
8941875628213730.000021edu.ucsd
8951875627234270.000010edu.unh
8961875543034980.000010jp.ne.docomo
8971875512423000.000015com.w3techs
898187548249500.000031com.ubuntu
8991875465418890.000018com.indiegogo
9001875456640750.000008org.tigris
9011875413618790.000018int.itu
9021875338641430.000008com.coca-cola
9031875251838250.000009ru.gazeta
9041875249843000.000008ch.swissinfo
9051875068225070.000014se.haxx
9061875043451430.000007com.chinatimes
9071874926047020.000007edu.upf
9081874792418620.000018sh.brew
9091874789647950.000007kr.co.koreatimes
9101874719238050.000009mt.gov
9111874609642170.000008com.motor1
9121874580857560.000006com.tv
9131874551221040.000016net.vnexpress
9141874408425650.000014gd.is
9151874401227930.000012ru.hh
9161874368412700.000023org.wiktionary
9171874362842990.000008uk.ac.exeter
9181874313234710.000010com.bhg
9191874251619040.000017org.linuxfoundation
9201874209039080.000009build.bazel
9211874162414010.000021com.freeprivacypolicy
9221874141033340.000010cn.org.china
9231874087625730.000014com.pcworld
9241874007044030.000008com.bravesites
9251874004432180.000010com.nyt
9261873996840270.000008com.usmagazine
9271873933012030.000024com.webs
9281873929855300.000006com.gust
9291873922856830.000006tv.eurovision
9301873899660030.000006ke.co.google
9311873876854980.000006tw.org.rti
932187382168990.000033com.elpais
9331873799625740.000014ru.rg
9341873731051390.000007com.defensenews
9351873678647090.000007com.alignable
9361873677222410.000015ru.kommersant
937187363366500.000042com.accenture
9381873612438430.000009tr.com.aa
9391873609410190.000029com.buzzsprout
9401873576425540.000014ru.mos
9411873566231920.000011com.post-gazette
9421873367248670.000007com.revolut
9431873341651070.000007org.siggraph
9441873288221000.000016com.hackerone
9451873267640960.000008uk.ac.core
9461873209271160.000005com.orgfree
9471873143442150.000008org.jenkins-ci
948187313443730.000070mp.mailchi
9491873121047640.000007fr.huffingtonpost
9501872895256030.000006net.zshare
9511872894641910.000008com.encyclopedia
9521872882230200.000011com.devpost
9531872820847920.000007com.iconarchive
9541872816833560.000010com.washingtonexaminer
9551872711055070.000006uk.org.rspb
9561872688422500.000015org.donorbox
957187268769540.000030edu.wisc
9581872655033730.000010org.rferl
9591872637837890.000009nl.wur
9601872590056370.000006jp.riken
9611872570618290.000018com.homeadvisor
9621872555814660.000020org.owasp
9631872521811590.000025com.imrworldwide
96418724578930.000344com.messenger
9651872330028970.000012ru.kp
9661872309639310.000009gov.ustr
967187224248780.000034edu.umich
9681872217633630.000010int.iom
9691872179614080.000021com.sfgate
9701872155427760.000013com.cloudwaysapps
971187214166600.000041com.psychologytoday
9721872127249960.000007org.geogebra
9731872084812630.000023edu.hbs
9741872078643360.000008com.podomatic
9751872050037580.000009ru.avito
9761872017411940.000025com.searchengineland
977187186149270.000032com.wikihow
9781871788053970.000006com.nippon
9791871781046620.000007org.democracynow
980187165045190.000050gov.fda
9811871644467160.000005uk.ac.aber
9821871630442040.000008com.vancouversun
9831871609856460.000006re.cli
9841871562238470.000009edu.sc
9851871454211530.000025to.dev
9861871376610120.000029org.frontiersin
987187133024050.000064com.constantcontact
9881871307642210.000008org.sonatype
9891871153425100.000014com.etonline
990187114185970.000045com.figma
9911871088412070.000024edu.nyu
9921870976636790.000009org.ets
9931870964862960.000006org.sfpl
9941870894624030.000015com.alibabagroup
9951870887447580.000007net.thedailystar
9961870872634770.000010com.bp
9971870866226470.000013ca.citizenlab
9981870834026820.000013com.discogs
9991870735055170.000006com.maxpreps
1000187055183640.000073com.heroku

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope you find the data useful for any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January/February 2023 crawl archive now available

The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content. Page captures are from 40 million hosts or 33 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The January/February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2023-06/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2023-06/segment.paths.gz100
WARC filesCC-MAIN-2023-06/warc.paths.gz8800088.02
WAT filesCC-MAIN-2023-06/wat.paths.gz8800021.72
WET filesCC-MAIN-2023-06/wet.paths.gz880009.05
Robots.txt filesCC-MAIN-2023-06/robotstxt.paths.gz880000.13
Non-200 responses filesCC-MAIN-2023-06/non200responses.paths.gz880002.04
URL index filesCC-MAIN-2023-06/cc-index.paths.gz3020.23
Columnar URL index filesCC-MAIN-2023-06/cc-index-table.paths.gz9000.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-06/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.