March/April 2023 crawl archive now available

The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content. Page captures are from 43 million hosts or 34 million registered domains and include 1.2 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The March/April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2023-14/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2023-14/segment.paths.gz100
WARC filesCC-MAIN-2023-14/warc.paths.gz8000087.95
WAT filesCC-MAIN-2023-14/wat.paths.gz8000021.1
WET filesCC-MAIN-2023-14/wet.paths.gz800008.74
Robots.txt filesCC-MAIN-2023-14/robotstxt.paths.gz800000.13
Non-200 responses filesCC-MAIN-2023-14/non200responses.paths.gz800002.09
URL index filesCC-MAIN-2023-14/cc-index.paths.gz3020.23
Columnar URL index filesCC-MAIN-2023-14/cc-index-table.paths.gz9000.25

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-14/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the September/October, November/December 2022 and January/February 2023 crawls. For more information about the data formats and the processing pipeline, please see the announcements of previous webgraph releases. You may also visit the cc-webgraph and cc-pyspark projects which contain all the scripts and tools needed to construct the graphs. Instructions for exploring the graphs in the webgraph format can be found in our collection of webgraph notebooks.

Host-level graph

The graph has of 325 million nodes and 2.63 billion edges. Both hyperlinks, HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc. However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 268 million dangling nodes (82.7%) and the largest strongly connected component contains 43.1 million (13.3%) nodes. Dangling nodes come from

  • hosts that are not crawled, but are referenced by a link on a crawled page
  • hosts with no links pointing to another hostname
  • or hosts that only returned an error page (e.g. HTTP 404).

Hostnames in the graph are in reverse domain name notation with the leading www. removed: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 325 million hosts from AWS S3 at s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ as prefix to access the files from everywhere.

Note that the text representation of the host-level graph is delivered in 10 gzip-compressed files listed in two path listings – one for the nodes (vertices), and one for the edges (arcs). First, download the path listing and decompress it with “gzip -d” or “gunzip”. Adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing will give you the list of URLs to download the entire graph.

Download files of the Common Crawl Sep/Nov/Jan 2022-2023 host-level webgraph

SizeFileDescription
2.34 GBcc-main-2022-23-sep-nov-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 4 vertices files
11.40 GBcc-main-2022-23-sep-nov-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 6 edges files
5.51 GBcc-main-2022-23-sep-nov-jan-host.graphgraph in BVGraph format
2 kBcc-main-2022-23-sep-nov-jan-host.properties
5.88 GBcc-main-2022-23-sep-nov-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2022-23-sep-nov-jan-host-t.properties
1 kBcc-main-2022-23-sep-nov-jan-host.statsWebGraph statistics
5.56 GBcc-main-2022-23-sep-nov-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org. Version (commit) 0bbf864 of the public suffix list was used (commit date 2023-03-08).

The domain-level graph has 88 million nodes and 1.68 billion edges. 52% or 46 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.

All domain graph files are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/.

Download files of the Common Crawl Sep/Nov/Jan 2022-2023 domain-level webgraph

SizeFileDescription
0.61 GBcc-main-2022-23-sep-nov-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.89 GBcc-main-2022-23-sep-nov-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.90 GBcc-main-2022-23-sep-nov-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2022-23-sep-nov-jan-domain.properties
3.81 GBcc-main-2022-23-sep-nov-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2022-23-sep-nov-jan-domain-t.properties
1 kBcc-main-2022-23-sep-nov-jan-domain.statsWebGraph statistics
1.90 GBcc-main-2022-23-sep-nov-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 88 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Sep/Nov/Jan 2022-2023)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13020826410.016763com.googleapis
22978372830.010997com.facebook
32938428620.015692com.google
42603256460.005934com.youtube
52580529250.006482com.twitter
62548799480.005484com.instagram
72522559070.005863org.w
82500241640.007149com.googletagmanager
92395287090.004622org.gmpg
1023337268120.003349com.linkedin
1123278606100.004162com.gstatic
1222405388150.002066com.gravatar
1322178872110.003793com.cloudflare
1421844608130.002400org.wordpress
1521681482250.001450com.pinterest
1621559576320.001217org.wikipedia
1721429174170.001813com.apple
1821227448310.001226com.wordpress
1921216668340.001088com.vimeo
2021195402390.000900be.youtu
2120932166140.002311net.cloudfront
2220898430180.001711com.bootstrapcdn
2320823002350.001080com.microsoft
2420748850480.000709com.amazon
2520741894550.000581com.blogspot
2620704246260.001442com.jquery
2720677290470.000780gl.goo
2820670994430.000852com.amazonaws
2920665338160.001840io.polyfill
3020631324440.000844eu.europa
3120618144500.000696com.wp
3220564226290.001298net.jsdelivr
3320528750450.000812org.mozilla
3420516286280.001369com.wixstatic
3520513258650.000505ly.bit
3620484072230.001470com.adobe
3720443722210.001544com.fontawesome
3820411612400.000896com.google-analytics
3920400328220.001485com.github
4020349558540.000635com.paypal
4120346992190.001674com.googleusercontent
4220328172300.001297com.whatsapp
43202923001030.000334com.tumblr
4420192382330.001113ru.yandex
4520186350850.000378com.medium
46201839441230.000299com.reddit
47201809141080.000331com.yahoo
4820178722620.000529com.shopify
4920170012580.000564com.flickr
5020167766690.000483io.github
5120163916630.000522co.t
52201522401280.000260com.nytimes
53201240681250.000293com.spotify
5420104136360.001066com.baidu
5520098412590.000553org.w3
5620094912370.001000com.qq
5720016100570.000569com.vk
5820005808410.000881com.googlesyndication
59200037061140.000313com.weebly
60199752981580.000186com.forbes
61199678741220.000302org.creativecommons
62199677261300.000251com.soundcloud
63199621901420.000211org.archive
64199424821500.000198gov.nih
65199187481260.000264com.tiktok
6619915782610.000538org.schema
6719912766600.000544com.unpkg
68199112341810.000153com.bing
69199110022490.000107com.imdb
70199055522640.000100edu.harvard
7119902624800.000387me.t
72199011681960.000137org.wikimedia
73198739901630.000174com.dropbox
74198722642880.000091net.slideshare
75198699921920.000142int.who
76198622622030.000131com.cnn
77198621421740.000157gov.cdc
78198588862140.000123com.theguardian
79198476421840.000151com.unsplash
80198451061210.000307com.list-manage
81198316902720.000097net.researchgate
82198191882530.000105com.wsj
83198160302930.000090com.bbc
8419812290640.000507com.macromedia
85198104522970.000089uk.co.bbc
86198051663040.000087com.reuters
87197889722460.000107com.washingtonpost
88197863142790.000094com.statista
89197756523180.000083edu.stanford
90197753464320.000060gov.nasa
9119774590670.000497com.addthis
9219770972460.000807net.doubleclick
93197698901510.000195com.wixsite
94197657142270.000117com.businessinsider
95197632702150.000123com.imgur
96197540143670.000072com.go
97197505862040.000129com.live
98197460041070.000332com.wix
99197455303720.000070com.wired
100197383661520.000195us.zoom
101197234221350.000225gle.forms
102197181482250.000118com.etsy
103197178242770.000095com.ibm
104197000924470.000058com.theverge
105196997883560.000074com.nature
106196966561200.000309me.wp
107196935241760.000156org.ietf
10819687268490.000700com.fb
109196869561570.000187com.ytimg
110196858004920.000053com.msn
111196854363150.000084com.android
112196842043160.000084com.cnbc
113196840241400.000214com.mailchimp
11419682468420.000858net.fbcdn
115196821982550.000104com.stackoverflow
116196812025110.000051edu.berkeley
117196799682750.000096org.un
118196795822650.000100com.bloomberg
119196777922200.000121com.outlook
120196735021780.000155org.apache
121196667062690.000098com.oracle
122196585423820.000069com.example
123196506603430.000077org.npr
124196495603500.000075com.quora
125196490301290.000254com.youtube-nocookie
126196468006010.000045com.zdnet
12719646398990.000336com.giphy
128196441281950.000140com.hubspot
129196437522190.000122org.doi
130196324905420.000048com.myspace
131196308862220.000120gov.ca
132196187263950.000067com.time
133196129543590.000074com.slack
13419610836700.000455com.ft
135196103504620.000056com.appspot
136196062341390.000217com.opera
137196052582610.000102com.sciencedirect
138196047804760.000054com.ted
139196041803420.000077com.springer
140196023624010.000065org.arxiv
141195982763930.000067com.usatoday
142195945541940.000140com.issuu
143195882883320.000080org.acm
144195850422050.000129com.npmjs
145195816643310.000080uk.co.amazon
146195772443070.000086com.githubusercontent
147195767842420.000108com.blogger
148195760543110.000086com.wiley
149195747323550.000074com.pexels
150195707824680.000055edu.cornell
151195694124820.000054com.theatlantic
152195670064400.000059org.python
153195644945930.000045org.worldbank
154195633905650.000047uk.co.telegraph
155195630087320.000037edu.psu
156195532044280.000060com.cnet
157195396181490.000199org.ampproject
158195375686230.000043org.weforum
159195373322960.000089uk.gov
160195336584440.000059com.huffingtonpost
161195328784900.000053com.latimes
162195285786450.000042org.unesco
163195272226550.000041com.livejournal
164195265344490.000058com.pixabay
165195084444110.000063com.sagepub
166195066944800.000054com.goodreads
167195061884990.000052uk.co.google
168195034022580.000103net.behance
169195031382700.000097com.bandcamp
170195030707270.000037org.chromium
171194959885250.000050com.cbsnews
172194951803260.000081ee.linktr
173194934525270.000049edu.yale
174194897302860.000092com.w3schools
175194831301750.000157com.yelp
176194824122990.000089edu.mit
17719481882680.000496com.googleadservices
17819480090780.000393me.wa
179194774305280.000049uk.co.independent
180194740801340.000228com.statcounter
181194729383000.000089com.tinyurl
182194721365090.000051com.fortune
183194713187140.000038edu.columbia
184194712186900.000039com.vox
185194612964220.000061gov.whitehouse
186194605464160.000062org.nodejs
187194578105120.000051uk.co.dailymail
188194554625370.000049com.indiatimes
189194526263490.000075com.businesswire
190194516182900.000090org.pewresearch
191194502765590.000047edu.cmu
192194426766430.000042com.marketwatch
193194368884940.000052com.tandfonline
194194316285980.000045org.pbs
195194292206570.000041com.usnews
196194289847770.000035edu.upenn
197194281002780.000095com.twimg
198194211146830.000039com.buzzfeed
199194205945900.000045gov.loc
200194200225040.000052com.fc2
201194195826740.000040com.git-scm
202194185549010.000033com.qz
203194161668850.000034edu.washington
204194156507780.000035com.trello
205194154446750.000040com.apnews
2061941191211210.000026com.techradar
207194117365170.000050com.investopedia
208194106106520.000041com.mysql
209194080941380.000218info.aboutads
210194074008800.000034me.about
211194036822760.000096org.gnu
212194031227360.000037com.economist
213194028626770.000040com.box
214193975007130.000038com.scribd
2151939645412760.000023com.techrepublic
216193957583700.000071com.gitlab
217193950364890.000053com.walmart
218193945683480.000075com.techcrunch
219193888884060.000064co.ibb
220193880544870.000053com.nationalgeographic
221193872587340.000037com.venturebeat
222193862485410.000048com.inc
223193855641650.000171com.staticflickr
224193852421410.000211me.line
225193828745340.000049com.theconversation
226193809285100.000051com.nbcnews
227193808025820.000046com.digg
2281937975211170.000026edu.northwestern
229193789527870.000035org.semver
2301937746610020.000029edu.jhu
231193770507370.000036ca.cbc
232193769425390.000049com.googleblog
2331937509613970.000021edu.rutgers
234193718126470.000042com.photobucket
2351937080610990.000027edu.usc
236193695046000.000045gov.senate
237193678341980.000134com.calendly
238193672222310.000116net.windows
2391936502012620.000023org.kernel
2401935967012870.000023co.elastic
241193587888080.000035com.shutterstock
242193584026910.000039org.cambridge
2431935637810910.000027fm.last
244193557983210.000082tv.twitch
245193557881870.000147page.g
246193557487190.000037com.newyorker
247193541167480.000036org.bitbucket
248193456085660.000047com.oup
249193441429300.000031org.sciencemag
250193434302500.000107com.jotform
251193405881910.000143com.cloudinary
252193402165380.000049org.unicef
2531933990610200.000029edu.princeton
254193390905880.000045io.codepen
255193388249820.000030gov.usgs
256193387948790.000034uk.ac.ox
257193366705620.000047com.xinhuanet
2581933611811840.000025com.alexa
259193336565150.000051org.js
260193314808640.000034edu.asu
261193293348840.000034com.nvidia
2621932855011870.000025com.mediafire
263193252822210.000121net.sourceforge
2641932243014330.000021com.euronews
265193214223350.000079com.prnewswire
266193209329370.000031com.foxnews
2671932090010830.000027gov.fbi
2681931490411740.000025org.coursera
269193134706270.000043com.biomedcentral
2701931272613670.000022com.500px
27119310966880.000374com.stripe
272193083922370.000112com.tripadvisor
273193067941860.000148com.xing
274193064382480.000107com.wpengine
275193060701540.000191com.sharethis
276193044769290.000031com.nypost
277193012787400.000036com.politico
278192977723250.000081com.automattic
2791929751614500.000020com.digitaltrends
2801929720810720.000027co.g
2811929705410580.000028org.pnas
2821929536811420.000026com.axios
283192952284100.000063org.openstreetmap
2841929363811910.000025uk.co.guardian
2851929354811110.000026com.scmp
286192902285950.000045ca.canada
2871928966014310.000021org.eclipse
2881928864412610.000023uk.co.blogspot
289192883006780.000040com.huffpost
2901928647613220.000022org.semanticscholar
291192840686370.000042gov.census
292192831944690.000055gov.usda
29319282746660.000504com.trustpilot
294192812827650.000036com.hp
295192812165560.000048io.readthedocs
2961927691212400.000023com.nymag
297192763042560.000104org.iana
298192761668700.000034com.ssrn
299192760929620.000030edu.umn
300192749869930.000029au.net.abc
3011927395814480.000020com.sky
3021927381612110.000024edu.purdue
303192733246250.000043org.apa
304192723622170.000122com.eventbrite
305192715121670.000167gov.privacyshield
3061927140410640.000028com.about
307192710583400.000078com.dribbble
308192702265830.000046gov.house
309192696849070.000033com.sciencedaily
310192691945710.000047gov.noaa
311192690344450.000059com.arcgis
312192675543630.000073com.feedburner
313192667145910.000045fm.anchor
3141926669815330.000020ru.spb
315192642805670.000047site.business
3161926426013080.000022com.nikkei
3171926116610300.000029com.ggpht
318192599808680.000034org.change
3191925743610530.000028com.evernote
3201925681813600.000022edu.illinois
321192556281620.000174com.office
3221925558213430.000022org.postgresql
323192529708820.000034org.pypi
324192529625810.000046com.163
325192513505740.000047com.dailymotion
3261924895616170.000019org.aclu
3271924789010510.000028edu.uchicago
328192451909530.000030com.mdpi
329192450108890.000034de.spiegel
330192436624960.000052gov.hhs
331192429904000.000065com.indeed
332192410066960.000039gov.justice
333192399986210.000043gov.state
3341923758416650.000019org.greenpeace
335192364601060.000333net.jsfiddle
336192336887890.000035gov.congress
337192331446890.000039com.bigcartel
3381923125220100.000017edu.gatech
339192302304840.000054gov.epa
3401922930619230.000017com.openai
341192287506190.000043org.ohchr
342192275788930.000033org.fao
343192271623900.000067com.atlassian
3441922715018120.000018org.science
345192248427680.000035com.jetbrains
3461922378612230.000024com.foursquare
347192236924540.000057com.squareup
348192201301240.000297com.alicdn
3491921920019960.000017org.phys
3501921821210900.000027cn.com.chinadaily
351192172843390.000078com.ebay
352192168163470.000075com.surveymonkey
353192167268830.000034com.chrome
3541921671612890.000023uk.co.thetimes
355192163482600.000102com.webflow
3561921609020230.000017com.foxbusiness
357192139729130.000032app.netlify
358192138223850.000068com.disqus
3591921253212290.000024com.hollywoodreporter
360192115507660.000036gov.archives
361192107464230.000061com.getpocket
362192105263960.000066com.samsung
363192073183710.000071com.proofpoint
3641920679211190.000026edu.utexas
365192061321320.000248com.zendesk
366192048343970.000066com.substack
367192021527020.000038com.mashable
3681920106010810.000027org.jstor
369192000686490.000042net.azurewebsites
370191989502000.000133org.allaboutcookies
371191985584210.000062com.freepik
372191978125290.000049com.netdna-ssl
373191967805470.000048com.snapchat
374191943726790.000040com.gumroad
375191942381440.000206com.paypalobjects
37619193934760.000397me.fb
377191928145260.000050ch.admin
378191913748630.000034com.pinimg
379191912507430.000036com.britannica
3801919108214890.000020au.com.smh
381191909067040.000038com.vice
382191890325540.000048gov.copyright
3831918366211570.000025com.dw
384191836263450.000076net.themeforest
385191825844520.000058com.patreon
3861918217812350.000024uk.co.mirror
3871917694413630.000022de.sueddeutsche
3881917346610140.000029uk.ac.cam
389191721983330.000080fr.cnil
3901916961233270.000010google.research
391191673467840.000035cc.postimg
392191651706090.000044gov.nist
3931916373414930.000020ca.sfu
394191626765070.000051com.gmail
3951916201623420.000015com.martinfowler
3961916096612970.000023org.imf
397191600949830.000030edu.si
398191586667090.000038org.oecd
399191571944020.000064ru.gov
4001915617810590.000028com.chicagotribune
4011915464012570.000023com.crunchbase
402191542904040.000064com.optimizely
40319153126750.000404net.akamaihd
4041915267210050.000029com.intuit
405191512001560.000188org.networkadvertising
4061915093821390.000016app.web
407191501009470.000031com.history
4081914725824980.000014com.ibtimes
409191468821610.000174com.rawgit
410191467342840.000093net.azureedge
411191464324780.000054nl.google
412191441224750.000055com.meetup
4131914365827280.000013com.cbs
4141914361415490.000019org.unhcr
415191430101190.000312de.google
416191404709770.000030com.sap
417191403085770.000047com.kickstarter
418191392324310.000060com.media-amazon
4191913846813400.000022com.aljazeera
420191382603460.000076net.php
4211913661018940.000018com.straitstimes
42219135416520.000645com.godaddy
4231913471811030.000027com.insider
4241913416210170.000029gov.treasury
4251913320416010.000019us.imageshack
4261913223212690.000023org.sphinx-doc
4271913182610730.000027link.page
428191312305080.000051cn.com.people
4291913085614570.000020de.mpg
430191291685140.000051org.debian
4311912487025970.000014au.com.news
432191236862670.000098jp.co.yahoo
433191226583360.000078com.typepad
43419121732730.000429com.wsimg
435191214449410.000031com.podbean
4361912132010660.000028uk.gov.service
437191213002350.000113gg.discord
4381912043014900.000020com.over-blog
439191193643060.000086com.eepurl
440191192746360.000042gov.usa
441191186346330.000043com.stumbleupon
442191169583540.000074org.hbr
4431911426417890.000019ms.1drv
444191142206730.000040google.blog
4451911201017960.000019com.buzzfeednews
4461911095410930.000027org.ilo
4471911029619620.000017com.mystrikingly
44819108200900.000367net.facebook
4491910793814460.000021de.zeit
450191078384710.000055com.tripod
4511910672810650.000028int.coe
4521910635813890.000021com.teachable
4531910615411360.000026com.thehill
45419106002560.000570net.typekit
4551910585820510.000016uk.co.standard
4561910352220710.000016com.newscientist
4571910239825270.000014com.channel4
4581910205826310.000013com.storify
4591909866014210.000021edu.duke
460190945005960.000045com.healthline
4611909363813020.000023au.gov.nsw
4621909157425250.000014org.maven
463190904386220.000043org.worldwildlife
4641908678610220.000029com.brightcove
4651908675615410.000020int.unfccc
466190866168960.000033com.withgoogle
467190856822920.000090com.squarespace
4681908561625780.000014com.instructables
4691908486220820.000016com.rt
4701908425224300.000015org.tensorflow
471190833062630.000101me.telegram
472190810826380.000042com.cisco
473190807668090.000035watch.fb
474190799004650.000056com.steampowered
475190790127600.000036com.deviantart
476190742426030.000044com.googlecode
4771907250218130.000018uk.parliament
478190706245720.000047com.airbnb
479190683605430.000048com.matterport
4801906715220200.000017org.edx
4811906309029090.000012com.dreamstime
4821906303818750.000018com.googlesource
4831906240811440.000026com.dell
4841906185029340.000012me.ogp
4851905972611310.000026org.hrw
4861905915220580.000016edu.cuny
487190568924360.000059com.elsevier
488190566009610.000030gov.dhs
4891905537210860.000027com.bostonglobe
490190533162830.000093com.salesforce
4911905165028760.000012org.icrc
4921905070014790.000020gov.defense
493190492861690.000163com.discord
4941904884626720.000013cc.arduino
495190486961720.000160com.addtoany
4961904833821220.000016com.padlet
4971904763421870.000016uk.co.thesun
4981904720419330.000017edu.georgetown
499190469845630.000047com.deloitte
5001904428020260.000017ca.blogspot
5011904412421800.000016edu.ucsb
5021904076629650.000012org.wikibooks
5031904024421160.000016edu.tufts
504190398822510.000107org.bbb
5051903704435190.000010com.deepmind
506190361403620.000073net.secureservercdn
5071903569233700.000010google.ai
5081903255814360.000021eu.politico
5091903200419650.000017edu.wustl
5101903145410400.000028com.istockphoto
511190296187610.000036com.thinkwithgoogle
5121902928830520.000011com.diigo
5131902920429450.000012com.snap
5141902768011810.000025us.icio
5151902741614470.000020ch.ipcc
516190268141480.000199com.jimdo
5171902479621190.000016com.france24
5181902401035480.000010com.ulule
5191902163211380.000026com.arstechnica
5201902008024710.000014com.instructure
521190192329600.000030edu.brookings
5221901904224090.000015edu.caltech
523190182942520.000105com.aliyuncs
52419018192240.001461cn.gov.miit
525190181206670.000041ca.amazon
5261901607014030.000021org.rfc-editor
5271901359411510.000026com.verizon
528190091602730.000096me.m
529190077202890.000091ru.ok
5301900659412120.000024uk.nhs
531190028707080.000038com.intel
5321900194819830.000017gov.lbl
5331900169024110.000015ru.kremlin
5341899923224430.000015edu.oregonstate
535189985584260.000060com.fastcompany
536189984285050.000052com.ssl-images-amazon
5371899773628730.000012fr.archives-ouvertes
5381899558220890.000016org.archlinux
539189952345760.000047com.wufoo
5401899496614220.000021com.people
5411899479024840.000014gov.cia
5421899444825640.000014tl.we
5431899436822490.000015org.unwomen
5441899338429420.000012com.kaggle
5451899174028080.000012com.aboutamazon
5461899162040940.000008com.sho
547189911583130.000085gov.ftc
548189911066590.000041com.docker
549189901187280.000037com.zoominfo
5501898749834170.000010com.pearltrees
5511898480020960.000016io.gitlab
5521898439830460.000011org.scala-lang
553189831883170.000083com.typeform
5541898149619560.000017com.asahi
555189797043680.000071net.imgix
556189789841680.000164com.youronlinechoices
557189777287380.000036com.symantec
5581897739629040.000012jp.co.japantimes
5591897597013610.000022com.buymeacoffee
5601897539415890.000019com.justia
5611897498027860.000012uk.co.huffingtonpost
562189744645520.000048com.gartner
5631897422629280.000012jp.ac.u-tokyo
564189736843980.000065com.force
5651897178228930.000012no.nrk
5661896949231130.000011cc.taplink
5671896759613300.000022org.amnesty
5681896756220900.000016com.thestar
5691896619023530.000015tv.ustream
5701896560634800.000010tv.blip
5711896442622120.000016com.peatix
572189611147550.000036com.redhat
5731896051818770.000018com.firebaseapp
5741896029620750.000016com.flipboard
575189594669890.000029com.stackexchange
576189593544460.000058com.herokuapp
577189588904950.000052com.campaign-archive
5781895872819860.000017org.nber
5791895818211720.000025com.ecwid
5801895758827190.000013hk.com.google
5811895714627870.000012blog.home
5821895704629170.000012com.rakuten
5831895606626390.000013org.biorxiv
5841895517411230.000026gov.wa
585189530205580.000048com.netflix
5861895265626810.000013com.gamespot
587189503585550.000048com.canva
5881894761630410.000011org.rsf
589189474624350.000059com.mckinsey
5901894582419520.000017com.reverbnation
5911894419812710.000023net.clickbank
592189440663190.000082jp.co.amazon
5931894140621090.000016com.jimdosite
5941893953230170.000011com.self
595189383322010.000132ru.mail
5961893739619420.000017gov.eia
5971893722830730.000011org.oas
598189368466560.000041com.iheart
5991893589221630.000016com.haaretz
6001893547830450.000011edu.syr
601189351949780.000030com.icons8
602189347382430.000108to.amzn
6031893416225620.000014org.computer
6041893354814600.000020site.notion
605189295546510.000042org.iso
6061892943618370.000018com.livescience
6071892899226180.000013com.infogram
6081892863621550.000016gov.usembassy
6091892843013480.000022com.mapquest
6101892826231110.000011com.tutorialspoint
611189280286650.000041com.qualtrics
6121892713020220.000017cn.gov.fmprc
613189269106970.000039org.ieee
6141892665210780.000027com.pcmag
6151892658824350.000015com.popsugar
6161892564621570.000016com.iconfinder
617189256006880.000039com.entrepreneur
618189254363440.000077com.visualstudio
6191892527210320.000028com.dropboxusercontent
6201892472028630.000012it.scoop
6211892297827670.000013com.pbworks
6221892177420430.000016ph.telegra
623189207049570.000030me.onelink
6241891980233260.000010org.grist
6251891723028330.000012com.fineartamerica
6261891711229390.000012au.edu.unimelb
6271891639418000.000019mil.army
6281891467044770.000007com.mail
6291891427435140.000010com.afp
6301891400213450.000022org.consumerreports
6311891148637600.000009net.docdroid
632189110968620.000034com.oreilly
6331891002229290.000012com.novell
634189088509810.000030org.mediawiki
6351890811018870.000018com.bol
6361890809627550.000013com.gq
6371890373213520.000022com.maxmind
6381890371012410.000023com.licdn
6391890322017910.000019gov.cancer
6401890296829510.000012com.eonline
6411890264431860.000011com.theonion
6421890228626970.000013net.openid
6431890183220600.000016com.dictionary
6441890157025230.000014com.foreignpolicy
6451890027625550.000014org.c-span
646188989205060.000052net.fastly
6471889858819410.000017edu.tamu
648188972906990.000039int.wipo
6491889695811010.000027com.merriam-webster
6501889687633660.000010org.freedomhouse
651188963641110.000328com.livestream
6521889636218320.000018com.verywellmind
653188956683790.000069jp.ameblo
654188948149790.000030com.forrester
6551889363615500.000019com.wikia
6561889313020610.000016org.unep
6571889260223400.000015com.patch
658188925861590.000184com.weibo
659188906905480.000048com.sxsw
6601888980427220.000013com.motherjones
6611888976812540.000023com.jekyllrb
662188897528740.000034gov.federalregister
6631888926839300.000009com.instapaper
6641888865028440.000012com.thecut
6651888719211070.000027net.authorize
6661888630824930.000014edu.gwu
6671888534832250.000010org.csis
6681888465025140.000014gov.ky
6691888434227090.000013com.theintercept
6701888380232130.000011ua.com.google
6711888355215300.000020com.snopes
6721888317836550.000009au.com.businessinsider
6731888309651720.000007com.ixbt
6741888243827570.000013org.fas
6751888241810700.000027com.tableau
6761888231014620.000020gov.uscourts
6771888174635510.000010com.teacherspayteachers
678188800666640.000041gov.sec
679188779545500.000048com.scorecardresearch
6801887713621260.000016org.ncsl
6811887707439750.000009org.cpj
682188761922300.000117com.naver
6831887569226800.000013uk.ac.imperial
6841887565412050.000024com.findlaw
6851887446228290.000012si.gov
6861887429811130.000026edu.ucla
6871887386623800.000015com.voanews
6881887111238560.000009org.edublogs
6891886818028690.000012org.marketplace
6901886785237250.000009net.aljazeera
6911886727623760.000015com.channelnewsasia
692188671947230.000037org.plos
6931886667410410.000028net.atlassian
6941886646841100.000008edu.ua
6951886637212100.000024gov.uspto
6961886471224180.000015com.goodhousekeeping
6971886330813280.000022org.altervista
6981886214818670.000018com.billboard
6991886166416860.000019gov.govinfo
7001886021422380.000015ru.ria
7011886003226790.000013com.nationalpost
7021885975049210.000007com.viki
7031885935032940.000010com.hm
7041885851028670.000012com.treehugger
705188577889320.000031com.termsfeed
7061885759034570.000010ru.interfax
7071885708012510.000023com.squarespace-cdn
7081885627229660.000012com.sandiegouniontribune
7091885605015060.000020io.termly
7101885439042880.000008com.dailycaller
7111885334027580.000013com.html5rocks
7121885096433770.000010is.archive
7131885029830570.000011com.nextdoor
7141884979637860.000009me.site123
7151884806413380.000022org.mitre
7161884702659850.000006com.fanpop
7171884674225990.000014org.pewtrusts
7181884672835740.000009org.britishcouncil
719188463921970.000136com.caniuse
7201884576023690.000015va.vatican
721188454342590.000102com.getbootstrap
7221884393837140.000009com.worldpopulationreview
723188433905300.000049com.adweek
7241884237822340.000016gov.oregon
725188421769060.000033com.digitaloceanspaces
7261884140636440.000009org.transparency
7271884119413240.000022com.windows
7281884108232470.000010com.tomsguide
729188409805350.000049com.gofundme
7301883919828750.000012org.unfpa
731188389769710.000030com.imageshack
7321883895439920.000008com.metacritic
7331883808440230.000008org.carnegieendowment
734188375724980.000052com.bigcommerce
735188366949160.000032com.libsyn
7361883608219240.000017com.kaltura
7371883525829260.000012org.wikisource
7381883514430210.000011org.gnupg
7391883500431640.000011org.signal
740188348844930.000052com.aol
7411883465830550.000011no.uio
7421883400650000.000007ua.nv
7431883389831100.000011ru.vedomosti
7441883379421690.000016com.wakelet
745188330668670.000034com.zoho
746188328966400.000042jp.ne.sakura
7471883211839630.000009com.theweek
7481883130821130.000016com.proquest
7491883033011180.000026com.slate
7501882903818680.000018com.speakerdeck
7511882889028250.000012jp.nicovideo
752188282142410.000110jp.co.google
7531882647232230.000010com.tradingeconomics
7541882453825260.000014com.radio
7551882446653660.000006org.bakerlab
7561882444411680.000025org.webaim
757188232921660.000169org.whatwg
7581882324630380.000011com.bloglovin
7591882209236200.000009edu.temple
7601882176210710.000027com.engadget
7611882160411930.000025io.powr
762188211346710.000040org.eff
7631882112842280.000008com.virgin
764188207083300.000080com.wistia
7651882047042230.000008com.scotsman
7661882039035330.000010ly.plot
7671881949636570.000009de.diplo
7681881861418550.000018com.ticketmaster
7691881769427000.000013com.me
77018816976710.000441com.oculus
7711881680429180.000012com.digitaljournal
7721881646615800.000019com.cbssports
7731881529437610.000009io.fabric
7741881425024850.000014com.surveygizmo
7751881168042340.000008io.meduza
776188112465940.000045fr.free
7771881021037820.000009org.neocities
7781881015026850.000013com.jpost
7791880983827010.000013com.washingtontimes
7801880974236100.000009org.annualreviews
7811880916033330.000010int.nato
7821880879219940.000017com.trustwave
7831880853232140.000011org.heritage
7841880807434860.000010org.repec
7851880774018090.000019co.carrd
7861880731641000.000008uk.co.timesonline
7871880618433580.000010re.appsto
78818805892810.000387org.nginx
7891880556813140.000022com.playstation
7901880550632610.000010uk.ac.leeds
791188050745210.000050org.drupal
7921880421236560.000009com.citylab
7931880343011850.000025com.gizmodo
7941880316239030.000009com.nationalreview
7951880282430950.000011org.nrdc
7961880274239720.000009net.openreview
7971880262427830.000012com.wpcomstaging
7981880258425890.000014org.sleepfoundation
7991880162047710.000007com.bizcommunity
8001880143013040.000023com.udemy
8011880112028740.000012com.towardsdatascience
8021880091436490.000009com.glitch
8031880089421620.000016com.unity
804188008885490.000048com.globenewswire
8051880024238360.000009com.bepress
8061880004625670.000014com.thespruce
8071879948819210.000017ru.rbc
8081879887237740.000009com.pbase
8091879873214860.000020br.com.uol
8101879869038850.000009ru.mid
8111879722835690.000009org.wilsoncenter
8121879651249050.000007it.justpaste
8131879592024600.000014ru.rutube
814187950648660.000034com.newsweek
8151879489832720.000010au.edu.sydney
8161879379219750.000017fr.blogspot
817187933846480.000042com.mimecast
8181879290437710.000009it.eventbrite
8191879267630900.000011com.financialpost
8201879261417420.000019com.technologyreview
8211879257446770.000007edu.csun
8221879243048100.000007org.scala-sbt
823187923427050.000038net.b-cdn
8241879218639820.000008com.indystar
8251879104025800.000014ru.tass
8261879061218630.000018ch.ethz
8271879049227640.000013com.newrepublic
8281879030436220.000009ca.uvic
829187902386140.000044com.fandom
8301878929845000.000007com.kinja
8311878843229230.000012int.wmo
8321878695010360.000028com.akamai
8331878525034440.000010ru.lenta
8341878315438860.000009com.slidesharecdn
8351878312442780.000008org.elifesciences
8361878289827200.000013com.fivethirtyeight
8371878270427700.000013com.verywellhealth
838187823987290.000037org.reactjs
8391878223610550.000028org.unicode
8401878026820320.000016org.americanbar
8411877961048500.000007co.aeon
842187795827540.000036com.moz
8431877948046950.000007com.jigsy
844187794802020.000131com.jimcdn
8451877916819900.000017com.kxcdn
8461877909618020.000019com.images-amazon
8471877859040280.000008com.thediplomat
8481877840841890.000008com.allafrica
8491877807418540.000018gov.medlineplus
8501877763010270.000029com.emarketer
8511877694831030.000011com.blogtalkradio
8521877666831480.000011com.biography
8531877565013550.000022com.xkcd
8541877471813870.000021com.thenextweb
8551877418210350.000028com.css-tricks
8561877374627650.000013io.redis
8571877363218300.000018io.kubernetes
8581877151234300.000010fr.rfi
8591877080643080.000008au.edu.adelaide
8601877002831760.000011org.nationalgeographic
8611876989610600.000028com.yandex
8621876783032640.000010org.panda
863187672922710.000097de.amazon
8641876693444710.000008fi.hs
8651876649829920.000011com.euractiv
8661876635051250.000007edu.umt
8671876552637200.000009net.ipsnews
868187650482240.000118org.icann
8691876464239710.000009gov.ornl
8701876373640420.000008org.thinkprogress
8711876353235360.000010vn.com.google
8721876349614740.000020edu.umd
873187634464410.000059org.opensource
8741876276628390.000012fi.yle
8751876126811290.000026com.glassdoor
8761876097438090.000009com.crashlytics
877187605803830.000069it.google
8781876053433920.000010cn.globaltimes
8791875947038410.000009com.sputniknews
8801875931632960.000010gov.doi
8811875928410480.000028ly.cutt
8821875915437690.000009com.clarin
8831875912438330.000009uk.gov.metoffice
8841875889457580.000006org.cgsociety
8851875874614520.000020com.rollingstone
8861875855014580.000020com.smashingmagazine
8871875801432090.000011org.cfr
8881875792639330.000009gov.fec
8891875783435300.000010ru.rosminzdrav
8901875697812430.000023org.golang
8911875689453820.000006edu.chapman
8921875682445440.000007uk.ac.nhm
8931875649651100.000007au.edu.uts
8941875628213730.000021edu.ucsd
8951875627234270.000010edu.unh
8961875543034980.000010jp.ne.docomo
8971875512423000.000015com.w3techs
898187548249500.000031com.ubuntu
8991875465418890.000018com.indiegogo
9001875456640750.000008org.tigris
9011875413618790.000018int.itu
9021875338641430.000008com.coca-cola
9031875251838250.000009ru.gazeta
9041875249843000.000008ch.swissinfo
9051875068225070.000014se.haxx
9061875043451430.000007com.chinatimes
9071874926047020.000007edu.upf
9081874792418620.000018sh.brew
9091874789647950.000007kr.co.koreatimes
9101874719238050.000009mt.gov
9111874609642170.000008com.motor1
9121874580857560.000006com.tv
9131874551221040.000016net.vnexpress
9141874408425650.000014gd.is
9151874401227930.000012ru.hh
9161874368412700.000023org.wiktionary
9171874362842990.000008uk.ac.exeter
9181874313234710.000010com.bhg
9191874251619040.000017org.linuxfoundation
9201874209039080.000009build.bazel
9211874162414010.000021com.freeprivacypolicy
9221874141033340.000010cn.org.china
9231874087625730.000014com.pcworld
9241874007044030.000008com.bravesites
9251874004432180.000010com.nyt
9261873996840270.000008com.usmagazine
9271873933012030.000024com.webs
9281873929855300.000006com.gust
9291873922856830.000006tv.eurovision
9301873899660030.000006ke.co.google
9311873876854980.000006tw.org.rti
932187382168990.000033com.elpais
9331873799625740.000014ru.rg
9341873731051390.000007com.defensenews
9351873678647090.000007com.alignable
9361873677222410.000015ru.kommersant
937187363366500.000042com.accenture
9381873612438430.000009tr.com.aa
9391873609410190.000029com.buzzsprout
9401873576425540.000014ru.mos
9411873566231920.000011com.post-gazette
9421873367248670.000007com.revolut
9431873341651070.000007org.siggraph
9441873288221000.000016com.hackerone
9451873267640960.000008uk.ac.core
9461873209271160.000005com.orgfree
9471873143442150.000008org.jenkins-ci
948187313443730.000070mp.mailchi
9491873121047640.000007fr.huffingtonpost
9501872895256030.000006net.zshare
9511872894641910.000008com.encyclopedia
9521872882230200.000011com.devpost
9531872820847920.000007com.iconarchive
9541872816833560.000010com.washingtonexaminer
9551872711055070.000006uk.org.rspb
9561872688422500.000015org.donorbox
957187268769540.000030edu.wisc
9581872655033730.000010org.rferl
9591872637837890.000009nl.wur
9601872590056370.000006jp.riken
9611872570618290.000018com.homeadvisor
9621872555814660.000020org.owasp
9631872521811590.000025com.imrworldwide
96418724578930.000344com.messenger
9651872330028970.000012ru.kp
9661872309639310.000009gov.ustr
967187224248780.000034edu.umich
9681872217633630.000010int.iom
9691872179614080.000021com.sfgate
9701872155427760.000013com.cloudwaysapps
971187214166600.000041com.psychologytoday
9721872127249960.000007org.geogebra
9731872084812630.000023edu.hbs
9741872078643360.000008com.podomatic
9751872050037580.000009ru.avito
9761872017411940.000025com.searchengineland
977187186149270.000032com.wikihow
9781871788053970.000006com.nippon
9791871781046620.000007org.democracynow
980187165045190.000050gov.fda
9811871644467160.000005uk.ac.aber
9821871630442040.000008com.vancouversun
9831871609856460.000006re.cli
9841871562238470.000009edu.sc
9851871454211530.000025to.dev
9861871376610120.000029org.frontiersin
987187133024050.000064com.constantcontact
9881871307642210.000008org.sonatype
9891871153425100.000014com.etonline
990187114185970.000045com.figma
9911871088412070.000024edu.nyu
9921870976636790.000009org.ets
9931870964862960.000006org.sfpl
9941870894624030.000015com.alibabagroup
9951870887447580.000007net.thedailystar
9961870872634770.000010com.bp
9971870866226470.000013ca.citizenlab
9981870834026820.000013com.discogs
9991870735055170.000006com.maxpreps
1000187055183640.000073com.heroku

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope you find the data useful for any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January/February 2023 crawl archive now available

The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content. Page captures are from 40 million hosts or 33 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The January/February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2023-06/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2023-06/segment.paths.gz100
WARC filesCC-MAIN-2023-06/warc.paths.gz8800088.02
WAT filesCC-MAIN-2023-06/wat.paths.gz8800021.72
WET filesCC-MAIN-2023-06/wet.paths.gz880009.05
Robots.txt filesCC-MAIN-2023-06/robotstxt.paths.gz880000.13
Non-200 responses filesCC-MAIN-2023-06/non200responses.paths.gz880002.04
URL index filesCC-MAIN-2023-06/cc-index.paths.gz3020.23
Columnar URL index filesCC-MAIN-2023-06/cc-index-table.paths.gz9000.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2023-06/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November/December 2022 crawl archive now available

The crawl archive for November/December 2022 is now available! The data was crawled November 26 – December 10 and contains 3.35 billion web pages or 420 TiB of uncompressed content. Page captures are from 44 million hosts or 34 million registered domains and include 1.2 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The November/December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2022-49/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2022-49/segment.paths.gz100
WARC filesCC-MAIN-2022-49/warc.paths.gz8800092.59
WAT filesCC-MAIN-2022-49/wat.paths.gz8800022.89
WET filesCC-MAIN-2022-49/wet.paths.gz880009.58
Robots.txt filesCC-MAIN-2022-49/robotstxt.paths.gz880000.15
Non-200 responses filesCC-MAIN-2022-49/non200responses.paths.gz880002.43
URL index filesCC-MAIN-2022-49/cc-index.paths.gz3020.25
Columnar URL index filesCC-MAIN-2022-49/cc-index-table.paths.gz9000.28

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-49/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

September/October 2022 crawl archive now available

The crawl archive for September/October 2022 is now available! The data was crawled September 24 – October 8 and contains 3.15 billion web pages or 380 TiB of uncompressed content. Page captures are from 44 million hosts or 34 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls. This crawl includes improvements made in extracting clean text in WET files and WAT anchor texts.

Archive Location and Download

The September/October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2022-40/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2022-40/segment.paths.gz100
WARC filesCC-MAIN-2022-40/warc.paths.gz8000082.71
WAT filesCC-MAIN-2022-40/wat.paths.gz8000021.49
WET filesCC-MAIN-2022-40/wet.paths.gz800009.11
Robots.txt filesCC-MAIN-2022-40/robotstxt.paths.gz800000.13
Non-200 responses filesCC-MAIN-2022-40/non200responses.paths.gz800001.96
URL index filesCC-MAIN-2022-40/cc-index.paths.gz3020.23
Columnar URL index filesCC-MAIN-2022-40/cc-index-table.paths.gz9000.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-40/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs May, June/July and August 2022

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June/July and August 2022. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks. See below for a summary of changes and improvements implemented for the current web graph release.

Changes, improvements and bug fixes

  • Unicode internationalized domain names are always converted into their ASCII equivalents (IDNA). This is now ensured for node labels in the host-level webgraph (see cc-pyspark#35) and consequently also for the domain-level webgraph where non-ASCII characters were replaced by question marks (see cc-webgraph#6)
  • The nodes of the domain graph are now strictly sorted lexicographically by node label (the reverse domain name). This should allow for more efficient compression of the list of domain nodes.
  • The strict sorting was implemented to address a bug (cc-webgraph#3) which may cause duplicated nodes (two or more nodes with the same label) in the domain graph.
  • The domain graph includes domain names equal to multi-part public suffixes. Previously the assumption was that names of registered domains are exactly one level below any ICANN suffix in the public suffix list and host names which are equal to multi-part suffixes (including at least one dot) were excluded. Such host names are now included, eg. gov.uk, freight.aero or altoadige.it. No further validation (eg. DNS lookup) is performed, so also invalid domain names may be included. Generally, except for a valid domain name string with a valid TLD or public suffix, no further validation is performed for any domain name. For more details, see cc-webgraph#1.

Host-level graph

The graph consists of 449 million nodes and 2.69 billion edges. Both hyperlinks and HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including pure “technical” ones pointing to images, JavaScript libraries, web fonts, etc. However, only host names with a valid IANA TLD are used. Consequently, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 389 million dangling nodes (86.6%) and the largest strongly connected component contains 46.4 million (10.3%) nodes. Dangling nodes stem from

  • hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • hosts without any links pointing to a different host name
  • or hosts which did only return an error page (eg. HTTP 404)

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 449 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 96 gzip-compressed files listed in two path listings – one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using “gzip -d” or “gunzip”. By adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl May/Jun/Aug 2022 host-level webgraph

SizeFileDescription
3.09 GBcc-main-2022-may-jun-aug-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 4 vertices files
11.91 GBcc-main-2022-may-jun-aug-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 9 edges files
5.76 GBcc-main-2022-may-jun-aug-host.graphgraph in BVGraph format
2 kBcc-main-2022-may-jun-aug-host.properties
6.20 GBcc-main-2022-may-jun-aug-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2022-may-jun-aug-host-t.properties
1 kBcc-main-2022-may-jun-aug-host.statsWebGraph statistics
7.46 GBcc-main-2022-may-jun-aug-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org. Version (commit) e5ff0c7 of the public suffix list was used (commit date 2022-09-15).

The domain-level graph has 91 million nodes and 1.57 billion edges. 50% or 45 million nodes are dangling nodes, the largest strongly connected component covers 37 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-may-jun-aug/domain/.

Download files of the Common Crawl May/Jun/Aug 2022 domain-level webgraph

SizeFileDescription
0.63 GBcc-main-2022-may-jun-aug-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.52 GBcc-main-2022-may-jun-aug-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.77 GBcc-main-2022-may-jun-aug-domain.graphgraph in BVGraph format
2 kBcc-main-2022-may-jun-aug-domain.properties
3.59 GBcc-main-2022-may-jun-aug-domain-t.graphtranspose of the graph
2 kBcc-main-2022-may-jun-aug-domain-t.properties
1 kBcc-main-2022-may-jun-aug-domain.statsWebGraph statistics
1.96 GBcc-main-2022-may-jun-aug-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (May/Jun/Aug 2022)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13291468610.018077com.googleapis
23213156230.012273com.facebook
33177019020.015371com.google
42797171450.007018com.twitter
52783907470.006164com.youtube
62766270860.006892org.w
72725446880.005701com.instagram
82673948640.007602com.googletagmanager
925688664100.004673org.gmpg
102557859090.004792com.gstatic
1125033650120.003435com.linkedin
1224026588110.004116com.cloudflare
1323679430170.002013com.gravatar
1423526052130.002488org.wordpress
1523497052240.001546com.pinterest
1623111634280.001244org.wikipedia
1723073066140.002254com.apple
1822828954250.001434com.wordpress
1922794652310.001150com.vimeo
2022725476390.000940be.youtu
2122500412180.001913com.bootstrapcdn
2222420202320.001128com.microsoft
2322394764150.002193net.cloudfront
2422370010220.001568com.jquery
2522285798230.001553io.polyfill
2622278972510.000652com.blogspot
2722275856440.000799gl.goo
2822242208350.001012com.amazonaws
2922199000470.000701com.amazon
3022170846270.001252net.jsdelivr
3122147346460.000764eu.europa
3222143092410.000874ly.bit
3322058786420.000835org.mozilla
3422050970380.000958com.google-analytics
3522028542210.001626com.fontawesome
3621967818360.001001com.adobe
3721947388200.001865com.github
3821939440940.000371com.tumblr
3921919148190.001882com.googleusercontent
4021916910490.000687com.wp
4121896858520.000647com.paypal
4221790948610.000550co.t
4321769982480.000695com.whatsapp
4421761882540.000605com.flickr
4521753952990.000356com.yahoo
4621729404690.000515io.github
47217135181330.000248com.nytimes
4821675054340.001031ru.yandex
4921669788910.000382com.medium
5021638440300.001195com.wixstatic
5121614440670.000526com.shopify
52216013721190.000315com.reddit
53215943561580.000193com.forbes
5421576744400.000925com.googlesyndication
5521576166630.000546org.w3
56215626681310.000257com.soundcloud
57215395261080.000328com.weebly
5821506586590.000571org.schema
59214851141220.000306org.creativecommons
60214731801530.000207gov.nih
61214648621790.000156int.who
6221460442650.000529com.vk
63214215981770.000158com.theguardian
64214192922030.000129com.cnn
65214140341470.000213org.archive
66214109022180.000122uk.co.bbc
6721408640500.000660net.doubleclick
6821393638620.000550com.unpkg
69213901942170.000122com.businessinsider
70213842621490.000212com.tiktok
71213794501980.000134com.imgur
72213699061060.000332me.wp
7321362530780.000407com.android
74213616481480.000213com.wixsite
7521359636560.000603com.addthis
76213505402810.000098com.bloomberg
7721338508600.000564com.fb
78213326143370.000083edu.stanford
79213308843550.000078com.theverge
8021308772570.000588com.macromedia
81213068122400.000109com.imdb
82213056301170.000324me.t
83213017141810.000154com.bing
8421299082920.000379com.giphy
85212773663000.000093com.bbc
86212709581000.000353com.list-manage
8721266506430.000827net.fbcdn
88212651921430.000218gle.forms
89212633782520.000106com.wsj
90212378263680.000075com.go
91212366543210.000087com.reuters
92212366142200.000120org.ietf
93212335261320.000253com.statcounter
9421223988930.000375com.stripe
95212223321940.000137uk.gov
96212214063020.000093edu.mit
97212199642540.000105org.un
98212183022950.000096edu.harvard
99212181401840.000151com.issuu
100212155661750.000159gov.cdc
101212132921200.000314de.google
102212128762850.000097com.oracle
103212088481500.000209com.ytimg
104212065043960.000068com.cnet
105212047903380.000083com.techcrunch
106212030723650.000075gov.nasa
107211980901570.000198com.dropbox
108211974164760.000055com.msn
109211967222490.000107com.twimg
110211919143570.000077com.quora
111211909243670.000075com.wired
112211841722890.000097net.slideshare
113211841361900.000142com.unsplash
11421183394730.000469com.wix
115211817081740.000160org.apache
116211711024550.000058com.googleblog
117211693301350.000237com.mailchimp
118211687361820.000153com.etsy
119211675483640.000075org.hbr
120211629761250.000284com.spotify
121211597002480.000107com.stackoverflow
122211457102060.000127com.blogger
123211447963970.000067org.arxiv
124211403362920.000096com.slack
125211395362700.000101net.researchgate
126211385402630.000104uk.co.amazon
127211367883450.000080org.npr
128211343103740.000073com.example
129211285341560.000200us.zoom
130211280182360.000110com.washingtonpost
131211244943630.000076com.appspot
132211172901240.000298com.ft
133211156743250.000086com.cnbc
134211151363090.000091com.wiley
135211126043560.000078com.nature
136211049605100.000052edu.berkeley
137211000904830.000055com.myspace
138210953542160.000122com.outlook
139210936802980.000095org.acm
140210914961550.000203com.weibo
141210889361520.000208org.networkadvertising
142210803825720.000047com.cbsnews
143210791522790.000099org.gnu
144210747604370.000061uk.co.telegraph
14521073516680.000520com.godaddy
146210680682910.000096uk.co.google
147210650501300.000266com.youtube-nocookie
148210650301950.000136org.wikimedia
149210620663930.000069com.usatoday
150210618526450.000041com.intel
151210527524780.000055com.goodreads
152210481123780.000072com.time
153210411384810.000055com.theatlantic
154210403746330.000042com.box
155210385022740.000100com.squarespace
156210313082040.000129com.eventbrite
15721030242370.000977com.qq
158210300401850.000150com.yelp
159210256641370.000230com.opera
160210223303600.000076ee.linktr
1612102218211410.000026com.wikia
162210206703490.000079com.springer
163210175244650.000056com.latimes
164210172021700.000165com.zendesk
165210155964240.000062com.huffingtonpost
166210142021620.000185org.ampproject
167210118745740.000046com.indiatimes
168210098301450.000217info.aboutads
169210061828860.000035com.qz
170210051107040.000039org.chromium
171210035386820.000040com.buzzfeed
172210008982210.000120org.doi
173209994425850.000045com.vice
1742098945811160.000027com.thenextweb
175209879143040.000092com.typeform
176209836122610.000104com.sciencedirect
177209827025060.000053edu.cornell
178209822945440.000049com.mashable
179209771726260.000043com.scribd
180209736965230.000051edu.yale
181209712205010.000053uk.co.independent
182209708662580.000105net.behance
183209707766790.000040com.economist
184209682907470.000037edu.upenn
185209642822780.000099org.pewresearch
186209608285450.000049com.cisco
187209605824510.000058com.bigcommerce
188209560625640.000047com.psychologytoday
189209427265130.000052com.fortune
190209426181930.000139page.g
191209403023820.000071com.gitlab
192209391364620.000057uk.co.dailymail
193209361464320.000061com.pixabay
194209339223060.000091com.tinyurl
195209325364970.000053com.deloitte
196209320409560.000031com.evernote
197209254365420.000049io.codepen
198209244502120.000125com.calendly
199209232266940.000039com.vox
200209194947310.000038com.git-scm
201209185826100.000044org.unesco
2022091744810080.000030com.about
20320916974710.000469net.facebook
204209158325710.000047org.weforum
205209150844190.000062com.w3schools
206209149783260.000086com.typepad
207209112685150.000052com.squareup
208209077429040.000034com.arstechnica
209209009524730.000055com.nbcnews
210208999243730.000074co.ibb
211208995326320.000042com.withgoogle
212208989288090.000036edu.washington
213208966165210.000051com.inc
214208918208980.000034uk.ac.cam
215208863104050.000066com.sagepub
216208781345430.000049fm.anchor
217208768266830.000040com.apnews
218208756709670.000031com.slate
219208756504420.000059gov.whitehouse
220208726646890.000040com.venturebeat
221208711025300.000050com.pexels
222208666322420.000109org.iana
223208654282600.000105de.amazon
224208620545490.000048gov.noaa
225208608847550.000037me.about
22620858432330.001073com.baidu
2272085640613120.000023org.eclipse
228208542146090.000044com.mysql
229208470142440.000108com.live
230208460646540.000041com.nationalgeographic
231208443588760.000035edu.asu
232208428822990.000094com.ibm
233208390801960.000136jp.co.google
234208358383510.000078com.dribbble
235208354727160.000038ca.cbc
236208280325580.000048org.worldbank
2372082750012780.000023com.nike
238208149144590.000057gov.fda
239208130986030.000044org.pbs
240208114345860.000045gov.loc
241208104904670.000056gov.usda
242208102844850.000054com.gofundme
243208078083160.000088com.feedburner
244208070063290.000084net.windows
2452080527611320.000027com.hollywoodreporter
246208049301610.000187com.staticflickr
2472080469010030.000030org.greenpeace
248208023104920.000054com.tandfonline
249208023083390.000081eu.youronlinechoices
250208017249920.000031app.netlify
2512080137612810.000023com.billboard
252207993386420.000042com.newyorker
253207981948750.000035edu.wisc
254207969367020.000039au.net.abc
255207962729160.000033org.pypi
256207959001760.000159com.office
2572079531812660.000024com.technologyreview
258207849744770.000055com.theconversation
259207828988870.000035org.sciencemag
260207826402530.000105com.jotform
261207794809840.000031com.gizmodo
262207787086730.000040org.cambridge
2632077771412940.000023com.500px
264207772387300.000038com.walmart
265207759005250.000051com.oup
266207732166080.000044com.xinhuanet
267207721444290.000061com.getpocket
2682077059011810.000025edu.umd
269207687725000.000053gov.epa
270207678627090.000039org.bitbucket
2712076739011440.000026edu.purdue
2722076344013830.000022ms.1drv
2732076297410840.000028co.elastic
274207601208910.000034org.semver
275207555444300.000061org.debian
2762075363013080.000023org.kernel
277207497687570.000037com.britannica
278207497169630.000031com.nypost
279207471306380.000042com.elpais
280207446529290.000032com.foxnews
281207383605020.000053com.dailymotion
2822073661211540.000026com.sky
2832073567810000.000030com.uk
284207296882460.000108com.wpengine
2852072889016230.000019com.googlesource
2862072684610070.000030edu.princeton
287207254405480.000048gov.house
288207224365920.000045com.mozilla
28920721772860.000393com.wsimg
2902072165814040.000021com.over-blog
291207189064880.000054com.ted
2922071683816600.000018com.lego
293207159287540.000037gov.justice
2942071483210050.000030uk.co.guardian
295207138564250.000062com.arcgis
2962071348613190.000023com.digitaltrends
297207117507950.000036edu.umich
298207106504280.000061org.openstreetmap
299207095862410.000109net.sourceforge
300207086249470.000032com.ssrn
3012070308816970.000018org.usenix
302207003543860.000070com.netdna-ssl
303206983189350.000032com.ggpht
304206975182320.000113com.amazon-adsystem
305206968383140.000090tv.twitch
306206963309500.000032uk.co.blogspot
3072069613614160.000021com.hatenablog
3082069288411490.000026co.g
309206919642300.000114gov.ca
310206895888010.000036com.politico
3112068924613150.000023com.socialmediatoday
312206864407280.000038org.change
313206855282390.000110uk.org.ico
314206854982230.000119jp.co.yahoo
315206852125880.000045uk.gov.service
316206843541710.000162com.rawgit
317206842322800.000098net.azureedge
3182068182612100.000025io.itch
3192068034613180.000023de.mpg
3202067871415520.000019com.euronews
321206764909640.000031edu.jhu
322206761869400.000032edu.umn
323206750585310.000050site.business
324206727081690.000166com.addtoany
325206717744740.000055gov.hhs
326206701144120.000064com.ebay
3272066893815500.000019com.urbandictionary
3282066486611820.000025com.axios
3292066400812420.000024org.semanticscholar
3302066303411030.000027com.udemy
3312066250013950.000021com.reverbnation
3322065985815050.000020edu.indiana
3332065682414810.000020au.com.news
3342065492410790.000028edu.uchicago
335206542747520.000037org.fao
336206531126220.000043gov.census
3372065269811780.000025net.speedtest
3382065080817710.000017org.phys
33920650016740.000424net.akamaihd
340206479382290.000115com.hubspot
341206425949950.000030com.scientificamerican
3422064146613280.000023com.nymag
3432063887017880.000017com.martinfowler
3442063838216630.000018edu.gatech
345206376805550.000048com.kickstarter
346206355581870.000146com.xing
3472063550611290.000027org.wiktionary
3482063459210420.000029edu.utexas
3492063391223140.000015com.flipboard
350206336345670.000047com.snapchat
3512063324632040.000011com.openai
3522063031614230.000021ch.ethz
3532062952214200.000021com.businessweek
354206291008730.000035watch.fb
355206260181540.000206com.sharethis
356206256729480.000032com.timeanddate
357206250367200.000038org.d3js
3582062457817440.000017com.itv
3592062069012670.000024uk.ac.ucl
3602061883414550.000020uk.co.metro
361206177183200.000087com.statista
362206176025290.000050com.googlecode
3632061737811470.000026com.jetbrains
364206172926140.000044org.ohchr
365206172669150.000033de.spiegel
366206166544720.000055com.meetup
367206165803220.000086com.disqus
368206159663990.000067com.optimizely
3692061541428020.000013com.diigo
370206150482870.000097jp.ne.hatena
3712061436012850.000023com.smithsonianmag
3722061409811520.000026com.scmp
3732061401011100.000027com.foursquare
3742061049026360.000014blog.home
3752061020220200.000016com.knowyourmeme
376206078563530.000078net.themeforest
377206075067330.000038au.gov.nsw
3782060621810780.000028com.chicagotribune
3792060354811640.000026au.com.smh
3802060324815890.000019uk.co.express
3812059998611210.000027edu.nyu
382205995082680.000102com.npmjs
383205985666410.000042gov.senate
384205949746390.000042com.zdnet
3852059426411280.000027link.page
386205915689680.000031com.usps
387205887328900.000035gov.congress
388205868882930.000096com.eepurl
3892058531410020.000030com.history
390205840246770.000040com.pinimg
391205822661410.000221com.paypalobjects
39220581216660.000528com.googleadservices
393205803444500.000058es.google
3942057905227360.000014edu.byu
395205777488990.000034au.com.google
3962057758814500.000021uk.co.standard
397205766327110.000039com.istockphoto
39820572810970.000357net.jsfiddle
399205722022830.000097me.telegram
4002056853613330.000022cn.com.chinadaily
401205683225520.000048ca.google
4022056793611740.000025de.bild
4032056670413940.000022com.producthunt
404205660743920.000069com.proofpoint
405205647889550.000031edu.si
406205624166350.000042org.oecd
4072055958414790.000020ca.ubc
4082055917414670.000020com.wattpad
4092055814221320.000015app.web
410205579568880.000035google.blog
4112055788010950.000028com.dw
412205543187190.000038gov.archives
4132055317214910.000020com.buzzfeednews
414205529965030.000053nl.google
4152055192619210.000016com.mystrikingly
416205513844580.000057com.criteo
4172055086610350.000029uk.co.thetimes
418205496563520.000078com.prnewswire
4192054898214630.000020uk.ac.lse
420205487089740.000031in.co.google
421205482383800.000071com.sohu
4222054495614480.000021uk.co.wired
423205446863890.000069com.atlassian
424205443263590.000077net.php
425205420345270.000050com.matterport
4262054066616380.000018de.ebay
427205365067770.000036com.livejournal
428205354543280.000085ru.ok
4292053513010590.000029gov.treasury
4302053425011940.000025com.sun
4312053369817870.000017com.channel4
432205329003810.000071net.imgix
4332053263819320.000016gov.cia
4342053237010540.000029org.telegram
4352053150010530.000029uk.parliament
4362053109627590.000013ph.telegra
4372053055615090.000020uk.co.thesun
438205299285930.000045edu.cmu
4392052969410700.000028int.coe
440205280584940.000053com.media-amazon
4412052805218140.000017com.hindustantimes
442205277169190.000033com.iconfinder
4432052666410040.000030org.jstor
4442052529015900.000019com.straitstimes
4452052468017670.000017edu.tufts
446205233524180.000062com.elsevier
447205208684900.000054ru.gov
4482052020810830.000028gov.fbi
4492051631013710.000022edu.duke
450205149684080.000065com.adroll
4512051422613440.000022int.itu
4522051302613820.000022de.zeit
4532051242216540.000018com.newscientist
454205115743720.000074com.githubusercontent
4552051153214540.000021com.unity3d
4562050901417120.000018org.maven
457205086049880.000031de.focus
4582050828225250.000015com.storify
4592050653414750.000020com.irishtimes
460205064746270.000043gov.state
461205052687050.000039uk.nhs
4622050518817110.000018com.mercurynews
4632050514611960.000025edu.unc
464205044003110.000090com.mapbox
465205034206000.000044net.ctfassets
4662050308414060.000021jp.ne.goo
4672050137216900.000018org.propublica
468204999169000.000034gov.sba
4692049960027650.000013me.ogp
4702049896015410.000020com.mcafee
4712049829215640.000019com.nydailynews
4722049701213220.000023org.unhcr
4732049250619800.000016com.csmonitor
4742049141616450.000018ca.mcgill
475204909724960.000053org.python
476204884742590.000105gg.discord
4772048772835690.000010net.docdroid
4782048581818810.000016app.vercel
4792048497625570.000015com.instructure
4802048382812630.000024ch.ipcc
4812048115019430.000016io.gitlab
482204811442670.000102com.aliyuncs
4832048039819630.000016com.thoughtco
4842047852210250.000030gov.dhs
4852047793416350.000019com.lenovo
4862047744011240.000027gov.usgs
4872047531410370.000029org.ilo
4882047287812460.000024org.hrw
48920472770950.000363me.wa
490204721524530.000058com.samsung
491204704201420.000219com.salesforce
4922046719628180.000013com.oxforddictionaries
4932046655025390.000015au.com.sbs
494204656524360.000061com.filesusr
4952046410420510.000016com.brave
4962046213211070.000027com.thehill
4972046200812490.000024com.aljazeera
4982046132813270.000023com.brightcove
499204605327800.000036com.thinkwithgoogle
500204602985760.000046org.worldwildlife
5012045783028340.000013sg.edu.nus
502204558924350.000061com.visualstudio
5032045445438170.000009com.minds
5042045402410290.000029edu.brookings
5052045279810880.000028sg.com.google
506204520182960.000095gov.ftc
5072045125020810.000016com.rt
5082045057413350.000022de.welt
509204503268890.000035com.fandom
5102044911013070.000023de.sueddeutsche
511204488504980.000053com.fastcompany
512204482847680.000037com.oreilly
5132044815231820.000011cc.uxdesign
514204474089050.000034com.deviantart
515204426844490.000058com.ssl-images-amazon
5162044257228910.000013org.accessnow
5172044250038090.000009org.edublogs
518204410721590.000192com.jimdo
5192043955622450.000015tl.we
5202043864631430.000012com.instapaper
521204378422080.000125ru.mail
522204363964570.000057com.patreon
5232043619828410.000013com.bloglovin
5242043498415390.000020com.firebaseapp
5252043234235870.000010com.pearltrees
5262042997425650.000015edu.oregonstate
527204281203690.000074com.surveymonkey
528204262224030.000066com.businesswire
5292042590029070.000013org.wikibooks
5302042366211080.000027de.stern
5312042331016530.000018com.warnerbros
5322041899414070.000021be.google
5332041838011480.000026ly.rebrand
5342041610619130.000016edu.ucsb
535204157745980.000044com.airbnb
53620414102980.000356com.messenger
5372041310015660.000019org.rfc-editor
538204130103030.000093net.secureservercdn
5392041283419110.000016co.carrd
5402041263825550.000015it.scoop
541204119207870.000036com.zoho
542204117225390.000050com.gmail
543204112849230.000033com.thelancet
5442041047820230.000016com.dictionary
5452040866246620.000008com.folkd
546204084169530.000032edu.psu
5472040824819750.000016org.documentcloud
5482040767812030.000025org.undp
549204064646970.000039io.readthedocs
5502040627214030.000021net.codecanyon
5512040567631420.000012com.hubpages
552204039586400.000042com.entrepreneur
5532040231018550.000017com.france24
554204005242370.000110to.amzn
5552039957625500.000015gov.lbl
5562039799632960.000011google.ai
5572039739228120.000013com.aboutamazon
5582039573812840.000023com.snopes
5592039474814150.000021int.unfccc
560203943629540.000032com.ubuntu
561203942642130.000125com.aspnetcdn
562203930365610.000047com.steampowered
5632039234430410.000012com.dreamstime
5642039166615270.000020gov.defense
5652039024618290.000017org.iea
5662038772229910.000012com.oregonlive
5672038397233950.000011org.neocities
5682038375216520.000018io.ghost
5692037956826250.000014org.nature
5702037893411800.000025com.prweb
571203782425650.000047com.netflix
5722037787814590.000020mil.army
573203773884990.000053org.nodejs
5742037714422290.000015uk.bl
5752037700820490.000016org.archlinux
5762037639211190.000027com.dell
5772037607430490.000012org.paho
5782037576221030.000016com.thefreedictionary
579203730647010.000039com.docker
5802037249627210.000014org.computer
5812037243626500.000014com.googlegroups
5822037072022330.000015org.ap
5832036947031160.000012com.webbyawards
584203693941380.000229me.line
585203693746060.000044com.investopedia
5862036915431260.000012org.scala-lang
5872036834827380.000014com.msnbc
5882036599215470.000019ca.sfu
5892036351217640.000017com.patch
5902036291412390.000024net.clickbank
5912036208832230.000011de.chip
5922035964032070.000011org.vim
593203579489360.000032org.js
5942035751813900.000022io.shields
5952035737229650.000012org.rsf
5962035294217980.000017gov.usembassy
5972035115413850.000022com.mixpanel
598203505401020.000349com.uservoice
5992035011840860.000009com.bravesites
6002035005227390.000014edu.iastate
6012034994032020.000011com.slides
602203498608930.000034com.office365
6032034911834430.000010org.aclweb
6042034910433750.000011org.google
6052034713834110.000011uk.co.yougov
606203451885790.000046org.unicef
6072034509231950.000011com.dummies
6082034507243950.000008it.justpaste
6092034479633940.000011org.globalcitizen
6102034459018740.000016ca.globalnews
611203438845070.000053com.fc2
612203420345240.000051com.adweek
6132034136228670.000013jp.co.japantimes
6142034050412700.000023com.loom
615203393769370.000032com.digitaloceanspaces
61620339258720.000469com.oculus
617203391267440.000038uk.co.pinterest
6182033877010910.000028com.webs
6192033848634160.000011com.thecvf
6202033242826570.000014ca.ualberta
6212033187422520.000015com.channelnewsasia
6222033156029680.000012in.businessinsider
623203306089820.000031org.mediawiki
6242033043815250.000020com.bol
6252032881023250.000015com.foreignpolicy
626203280385570.000048com.digg
627203261242880.000097com.bandcamp
628203258469590.000031com.variety
6292032510012930.000023org.imf
6302032496211300.000027ly.cutt
6312032387232210.000011org.freedomhouse
6322032338417460.000017us.mn.state
6332032335849700.000007com.sendspace
6342032248043370.000008org.marxists
63520322396640.000540com.trustpilot
636203218343320.000084me.fb
6372032076616990.000018com.ipsos
6382032020216980.000018gov.uscis
639203196802110.000125org.whatwg
6402031912018110.000017eu.politico
6412031874455190.000006com.edocr
6422031826633510.000011de.diplo
6432031652420770.000016com.spreaker
6442031641629610.000012com.space
6452031513418660.000017com.voanews
6462031489627500.000014org.wikidata
6472031389620470.000016dk.google
6482030999810330.000029me.onelink
6492030916036950.000010com.prweek
6502030864435780.000010com.virgin
6512030788232060.000011com.slidesharecdn
652203062306050.000044com.canva
6532030582616150.000019com.indianexpress
6542030536233790.000011com.reason
655203034869080.000034com.imageshack
6562030321238150.000009org.cpj
6572030292211570.000026com.att
658203021067590.000037uk.co.eventbrite
6592030074832330.000011com.hm
660203000127610.000037com.gumroad
6612029980431510.000012de.taz
6622029767836710.000010uk.ac.nhm
663202974729450.000032com.fiverr
6642029732427290.000014com.verywellhealth
665202971105730.000046com.globenewswire
666202968328830.000035com.wikihow
6672029399822510.000015org.ocks
6682029348832110.000011org.iucnredlist
6692029343430080.000012edu.uoregon
6702029320426580.000014com.gfycat
6712029309434090.000011org.oxfam
672202930668050.000036int.wipo
6732029283428510.000013com.fineartamerica
6742029272015010.000020pl.gov
6752029212445360.000008com.backblazeb2
6762029170818180.000017com.jimdosite
6772029095817740.000017com.thestar
6782029048031390.000012org.eji
6792029043242600.000008com.theodysseyonline
6802028928615330.000020com.routledge
6812028671235500.000010uk.co.timesonline
6822028543229000.000013org.gnupg
6832028491825700.000015com.infogram
6842028474619610.000016uk.org.greenend
6852028469422690.000015org.rand
6862028438019010.000016com.surveygizmo
687202835706210.000043br.com.uol
688202832185110.000052org.drupal
6892028283434230.000011org.democracynow
6902028141810570.000029org.unicode
6912027968042480.000008com.roche
6922027910249780.000007re.cli
6932027847229590.000012com.kaggle
6942027834412080.000025cn.news
6952027808221820.000015cc.tiny
6962027756435130.000010org.bitcointalk
6972027696831800.000011com.gawker
6982027552434590.000010com.bigthink
6992027539613620.000022com.jekyllrb
7002027425017350.000017com.justia
7012027330010770.000028com.css-tricks
7022027237828210.000013com.motherjones
7032027215628500.000013edu.nd
7042027207616910.000018org.ourworldindata
7052027109818850.000016ca.on.gov
7062027016628030.000013com.timesofisrael
7072027014036460.000010org.project-syndicate
708202699985090.000052com.mckinsey
709202695961920.000140com.discord
7102026951425530.000015net.openid
7112026946614050.000021org.amnesty
7122026945228420.000013net.vnexpress
7132026829641610.000009com.crayola
7142026829214460.000021gov.uscourts
7152026768220290.000016gov.faa
716202673444840.000055com.onesignal
7172026667023800.000015com.lexisnexis
7182026573032970.000011com.nme
7192026500612310.000024ms.aka
7202026494820430.000016gov.usaid
7212026363210660.000028com.pcmag
7222026348029760.000012com.mathworks
7232026342227960.000013uk.ac.kcl
7242026300227460.000014fr.gouv.diplomatie
7252026213618540.000017org.worldcat
726202605225530.000048ca.youradchoices
7272025773630500.000012org.csis
7282025721633300.000011org.repec
7292025719220030.000016de.ndr
7302025691011930.000025com.playstation
7312025630830830.000012ru.kp
7322025487033760.000011no.uib
733202547126600.000041gov.nist
7342025367431290.000012org.ewg
7352025357025880.000014de.web
736202531329010.000034com.mobirise
7372025267830050.000012au.com.businessinsider
7382025202235860.000010org.polymer-project
739202518465410.000049com.sxsw
740202499586880.000040com.usnews
741202484482090.000125com.myshopify
742202475203420.000081mp.mailchi
743202474948120.000036net.b-cdn
7442024643042360.000008com.mail
7452024638825710.000015com.sina
7462024565815400.000020com.pastebin
7472024472446330.000008com.mysanantonio
7482024472026560.000014org.unctad
7492024392434920.000010com.thejakartapost
7502024340012890.000023org.coursera
7512024300212960.000023com.smashingmagazine
7522024239637140.000010io.fabric
753202423441640.000176de.bund
7542024167036060.000010com.shell
7552024142430640.000012com.biography
7562024111637510.000010com.nwsource
7572024046842890.000008build.bazel
7582024014425290.000015org.medrxiv
7592023691429140.000013com.coca-colacompany
760202365029460.000032com.shutterstock
761202362429490.000032uk.gov.legislation
762202358765180.000052com.herokuapp
763202346546290.000042it.placehold
7642023438057700.000006com.filedropper
7652023422843310.000008org.globalnetworkinitiative
7662023356615980.000019org.altervista
7672023356232980.000011com.sacbee
7682023342625380.000015org.biorxiv
7692023254032130.000011fr.rfi
7702023253429740.000012com.ericsson
7712023253040730.000009com.kinja
772202303829910.000031com.trello
7732022823031080.000012org.oas
7742022765011830.000025com.ycombinator
7752022670018100.000017org.donorbox
7762022631027170.000014com.e-monsite
7772022571210220.000030gov.fcc
7782022527420990.000016org.unodc
7792022406811590.000026com.tableau
78020223780750.000419net.cpanel
7812022230635800.000010org.tigris
7822022207813660.000022com.alexa
7832022199012480.000024gov.uspto
7842022120638280.000009com.wasabisys
7852022108818090.000017com.speakerdeck
7862021979424160.000015com.miamiherald
7872021913837710.000010com.bangkokpost
7882021836811250.000027gov.cms
7892021675011430.000026org.reactjs
790202160265620.000047com.gartner
7912021567211110.000027com.jwplayer
7922021500828440.000013edu.usf
7932021425829030.000013com.thenation
7942021360627300.000014com.washingtontimes
7952021305833550.000011com.wikidot
796202129729600.000031com.hp
797202109046510.000041gov.sec
7982021023220820.000016com.squarespace-cdn
7992020945026470.000014jp.nicovideo
8002020843044130.000008de.otto
8012020760826790.000014ru.kremlin
802202070882510.000106com.cloudinary
803202064125800.000046fr.free
8042020638410160.000030com.podbean
8052020623664750.000006com.uberant
806202061967140.000039org.apa
8072020488026260.000014se.haxx
8082020477040900.000009com.bloombergquint
8092020379219960.000016org.khanacademy
8102020330810810.000028com.engadget
8112020323037050.000010com.allafrica
8122020319032860.000011vn.com.google
8132020274651230.000007to.gplus
8142020174034680.000010my.com.thestar
8152020148436170.000010uk.org.asa
8162020094828960.000013com.simonandschuster
8172020083029040.000013com.lowes
8182020080422230.000015org.wto
819201999822070.000126com.caniuse
820201999822240.000118com.getbootstrap
8212019996626680.000014tv.ustream
8222019981441180.000009uk.co.spectator
823201988702270.000117org.icann
824201983946530.000041org.eff
8252019752234710.000010com.sputniknews
8262019634039820.000009com.manta
8272019599435090.000010uk.ac.qmul
8282019593033780.000011com.eiu
8292019540631850.000011com.financialpost
8302019539829540.000012uk.gov.metoffice
831201939822690.000102com.naver
8322019390022400.000015gov.gao
8332019313211140.000027edu.ucla
8342019242017130.000018fr.blogspot
8352019234234150.000011org.heritage
8362019205244450.000008org.scala-sbt
8372019181436520.000010com.thenationalnews
8382019140841790.000009com.rappler
8392019136042640.000008com.wusa9
8402019039632420.000011org.rferl
8412018971828070.000013ru.kommersant
8422018962238140.000009org.grist
8432018925414850.000020us.imageshack
8442018882213910.000022com.freeprivacypolicy
8452018878026280.000014org.wbur
8462018823650020.000007com.picsart
8472018803048270.000007org.frontlinedefenders
8482018770038440.000009com.newatlas
849201857805360.000050com.wufoo
8502018499614010.000021edu.northwestern
8512018341226730.000014com.fivethirtyeight
852201832527030.000039com.moz
8532018254420500.000016to.dev
8542018186037990.000009de.wwf
8552018174442660.000008com.iconarchive
8562018136637750.000009org.pri
857201794789430.000032com.redhat
85820178530550.000603com.dan
8592017848444900.000008tw.blogspot
8602017761028760.000013com.infoworld
861201757366640.000041com.aliexpress
862201756566850.000040com.photobucket
8632017454637100.000010int.au
8642017238636130.000010org.jenkins-ci
8652017196038070.000009com.obsproject
8662017018427270.000014com.discogs
8672017014845320.000008com.koreaherald
8682016968438380.000009ru.forbes
869201693809750.000031com.stackexchange
8702016757230070.000012com.yougov
8712016722835420.000010ly.plot
8722016687630860.000012org.panda
8732016668034000.000011com.law360
8742016575410640.000028com.emarketer
8752016381045490.000008org.article19
8762016377011090.000027com.merriam-webster
877201632063980.000067com.bitly
8782016195842270.000008com.prevention
8792016171263140.000006org.arkive
8802016165019360.000016com.hackerone
8812016140434500.000010com.news24
8822016138832770.000011com.foreignaffairs
8832016123244510.000008fr.huffingtonpost
884201605264430.000059com.skype
8852015839066530.000006com.booklikes
886201582829130.000033com.marketwatch
8872015800611990.000025org.webkit
8882015772640760.000009au.com.heraldsun
8892015756044760.000008org.siggraph
8902015695016480.000018com.newrelic
8912015638438840.000009gov.fec
8922015590238500.000009org.brainpickings
8932015515036310.000010de.uni-frankfurt
8942015484823390.000015com.w3techs
8952015442832670.000011edu.unh
8962015430243870.000008br.unicamp
89720153190580.000586com.afternic
8982015287657420.000006cc.kknews
8992015261010460.000029com.pwc
9002015235841820.000008com.wallethub
9012015149239870.000009com.collinsdictionary
902201502823120.000090com.webflow
9032015019445690.000008org.firstmonday
9042015016611620.000026com.appnexus
9052014971243710.000008uk.ac.westminster
9062014858447700.000007com.selfridges
9072014852233410.000011com.scotsman
9082014839419420.000016com.ssllabs
9092014767241690.000009com.datacenterknowledge
9102014658229530.000012com.washingtonexaminer
911201463964170.000063com.force
9122014571647390.000007br.ufrgs
9132014559826120.000014ru.ria
9142014507060180.000006com.armorgames
9152014441444780.000008net.middleeasteye
9162014300437430.000010com.thediplomat
9172014193043400.000008com.the-scientist
9182014185236760.000010gov.ornl
9192014133226180.000014gov.energystar
9202014031829270.000013org.wri
9212013986612730.000023org.owasp
9222013750635720.000010org.wilsoncenter
9232013722641940.000008uk.co.manchestereveningnews
924201371984520.000058gov.consumerfinance
9252013718013540.000022com.symantec
926201369368770.000035com.libsyn
9272013693014720.000020com.twilio
9282013678011770.000025com.semrush
9292013675657630.000006net.postheaven
9302013674241070.000009com.crashlytics
9312013634016080.000019com.techrepublic
9322013627814690.000020com.createjs
933201361749110.000033edu.columbia
934201355269580.000031com.buzzsprout
935201350185910.000045net.azurewebsites
9362013485827880.000013org.iucn
9372013445444840.000008com.googledrive
9382013417635560.000010org.sonatype
9392013411818040.000017ly.ow
9402013410445220.000008io.meduza
9412013399415590.000019net.msecnd
9422013397214170.000021com.weather
9432013279613750.000022com.rollingstone
9442013255041670.000009ru.aif
9452013248816490.000018com.upwork
9462013207818320.000017com.chrome
947201314184450.000059com.dmca
9482013093840970.000009org.avaaz
9492012964253340.000007cn.edu.sdu
9502012859025740.000015ru.rbc
951201285266610.000041com.figma
9522012748433910.000011nl.rug
9532012654052100.000007org.sourcewatch
9542012586651480.000007com.wsoctv
9552012554651490.000007com.linodeobjects
9562012546227240.000014int.reliefweb
9572012486031300.000012org.cfr
9582012454228590.000013com.springeropen
959201239703350.000083com.wistia
960201221989900.000031org.json
9612012198258400.000006com.grabcad
9622012111835510.000010ru.vedomosti
9632012088455050.000006org.sfpl
9642012018846370.000008ch.qos
9652011971636930.000010org.escholarship
9662011914839770.000009uk.ac.sussex
967201189823440.000081com.automattic
9682011877841310.000009com.gannett-cdn
9692011734640520.000009edu.scu
9702011727440680.000009org.nationalinterest
9712011708435100.000010com.tradingeconomics
9722011705237020.000010org.thinkprogress
9732011696442400.000008com.dawn
9742011628441660.000009cc.taplink
9752011517831060.000012ca.citizenlab
9762011482826520.000014com.bankrate
9772011468220300.000016com.tutsplus
9782011446214120.000021org.golang
9792011383257060.000006com.london2012
9802011363620330.000016org.linuxfoundation
9812011328018400.000017edu.rutgers
9822011304831790.000011org.undocs
9832011267248640.000007za.co.dailymaverick
9842011262831980.000011com.springernature
9852011255237400.000010au.edu.adelaide
9862011182246650.000008com.mnn
9872011028043120.000008ae.google
9882011027430890.000012org.crossref
9892011026235450.000010com.vox-cdn
9902011025641080.000009com.dailykos
9912010954848820.000007uk.ac.lancs
992201094728630.000036org.ieee
993201080507290.000038ca.canada
9942010684834440.000010org.cato
9952010659437720.000009gov.ustr
996201065624690.000056com.indeed
9972010655434790.000010com.cityam
9982010617241930.000008de.ebay-kleinanzeigen
9992010580410190.000030com.techtarget
1000201044946690.000040gov.copyright

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

August 2022 crawl archive now available

The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content. Page captures are from 46 million hosts or 37 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2022-33/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2022-33/segment.paths.gz100
WARC filesCC-MAIN-2022-33/warc.paths.gz8000068.95
WAT filesCC-MAIN-2022-33/wat.paths.gz8000017.06
WET filesCC-MAIN-2022-33/wet.paths.gz800007.24
Robots.txt filesCC-MAIN-2022-33/robotstxt.paths.gz800000.15
Non-200 responses filesCC-MAIN-2022-33/non200responses.paths.gz800002.92
URL index filesCC-MAIN-2022-33/cc-index.paths.gz3020.2
Columnar URL index filesCC-MAIN-2022-33/cc-index-table.paths.gz9000.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-33/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

June/July 2022 crawl archive now available

The crawl archive for June/July 2022 is now available! The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The June/July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2022-27/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2022-27/segment.paths.gz100
WARC filesCC-MAIN-2022-27/warc.paths.gz8000084.08
WAT filesCC-MAIN-2022-27/wat.paths.gz8000020.99
WET filesCC-MAIN-2022-27/wet.paths.gz800008.97
Robots.txt filesCC-MAIN-2022-27/robotstxt.paths.gz800000.15
Non-200 responses filesCC-MAIN-2022-27/non200responses.paths.gz800001.92
URL index filesCC-MAIN-2022-27/cc-index.paths.gz3020.23
Columnar URL index filesCC-MAIN-2022-27/cc-index-table.paths.gz9000.26

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-27/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May 2022 crawl archive now available

The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content. Page captures are from 45 million hosts or 36 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The May crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2022-21/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively, please see accessing the data for detailed instructions.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2022-21/segment.paths.gz100
WARC filesCC-MAIN-2022-21/warc.paths.gz8000092.81
WAT filesCC-MAIN-2022-21/wat.paths.gz8000023.30
WET filesCC-MAIN-2022-21/wet.paths.gz800009.80
Robots.txt filesCC-MAIN-2022-21/robotstxt.paths.gz800000.15
Non-200 responses filesCC-MAIN-2022-21/non200responses.paths.gz800002.26
URL index filesCC-MAIN-2022-21/cc-index.paths.gz3020.25
Columnar URL index filesCC-MAIN-2022-21/cc-index-table.paths.gz9000.28

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2022-21/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of webgraph notebooks.

Host-level graph

The graph consists of 384 million nodes and 2.47 billion edges. Both hyperlinks and HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including pure “technical” ones pointing to images, JavaScript libraries, web fonts, etc. However, only host names with a valid IANA TLD are used. Consequently, URLs with an IP address as host component are not taken into account for building the host-level graph.

There are 326 million dangling nodes (84.6%) and the largest strongly connected component contains 45.2 million (11.7%) nodes. Dangling nodes stem from

  • hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • hosts without any links pointing to a different host name
  • or hosts which did only return an error page (eg. HTTP 404)

Host names in the graph are in reverse domain name notation and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 384 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 96 gzip-compressed files listed in two path listings – one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using “gzip -d” or “gunzip”. By adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl Oct/Nov/Jan 2021-2022 host-level webgraph

SizeFileDescription
2.66 GBcc-main-2021-22-oct-nov-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 32 vertices files
11.76 GBcc-main-2021-22-oct-nov-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 64 edges files
5.32 GBcc-main-2021-22-oct-nov-jan-host.graphgraph in BVGraph format
2 kBcc-main-2021-22-oct-nov-jan-host.properties
5.78 GBcc-main-2021-22-oct-nov-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2021-22-oct-nov-jan-host-t.properties
1 kBcc-main-2021-22-oct-nov-jan-host.statsWebGraph statistics
6.38 GBcc-main-2021-22-oct-nov-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org. Version (commit) 68b67d3 of the public suffix list was used (commit date 2022-03-04).

The domain-level graph has 90 million nodes and 1.55 billion edges. 50% or 45 million nodes are dangling nodes, the largest strongly connected component covers 36 million or 40% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/.

Download files of the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph

SizeFileDescription
0.62 GBcc-main-2021-22-oct-nov-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
6.36 GBcc-main-2021-22-oct-nov-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.65 GBcc-main-2021-22-oct-nov-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2021-22-oct-nov-jan-domain.properties
3.53 GBcc-main-2021-22-oct-nov-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2021-22-oct-nov-jan-domain-t.properties
1 kBcc-main-2021-22-oct-nov-jan-domain.statsWebGraph statistics
1.93 GBcc-main-2021-22-oct-nov-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 90 million domain ranks is available for download.

Top 1000 domains ranked by harmonic centrality (Oct/Nov/Jan 2020-2021)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
13144841810.017921com.googleapis
23010508630.013006com.facebook
32899172620.014028com.google
42623388660.007154org.w
52623015240.008081com.twitter
62586499480.006261com.youtube
72532525070.006558com.instagram
82466516050.007716com.googletagmanager
92409722490.004716org.gmpg
1023219788110.003948com.gstatic
1123145196130.003418com.linkedin
1222256010100.004364com.cloudflare
1321953630170.001942com.gravatar
1421841462210.001594com.pinterest
1521786278140.003223org.wordpress
1621441268250.001417org.wikipedia
1721311496160.002057com.apple
1821176702330.001077com.wordpress
1921045484320.001141com.vimeo
2021032900150.002070com.bootstrapcdn
2120995456410.000913be.youtu
2220796928180.001658com.jquery
2320726342280.001191com.microsoft
2420673758450.000789com.blogspot
2520634464230.001451io.polyfill
2620617114470.000775gl.goo
2720589400490.000736com.amazon
2820486426640.000550ly.bit
2920476754290.001178com.wixstatic
3020455806500.000729com.wp
3120452992220.001551net.cloudfront
3220445298400.000962com.amazonaws
3320410144430.000863org.mozilla
3420403210310.001162net.jsdelivr
3520395958510.000721eu.europa
3620389728370.000994com.google-analytics
3720343648200.001595com.fontawesome
3820336692910.000384com.tumblr
3920333242190.001648com.adobe
4020324398240.001421com.github
4120181488750.000480com.googleusercontent
4220150384550.000684com.flickr
43201227581040.000326com.yahoo
4420112952570.000670com.paypal
4520106820480.000752io.github
46201027501060.000314com.reddit
47200600461170.000268com.soundcloud
4820049174380.000966com.googlesyndication
4920043166810.000425com.medium
5020007290530.000703org.w3
51199947101270.000231com.nytimes
5219974404620.000611co.t
53199561441020.000338com.weebly
54199556061140.000277com.spotify
5519925440580.000656com.whatsapp
5619906786340.001038ru.yandex
5719901846850.000401org.creativecommons
58198942141360.000208org.archive
59198708461830.000139com.cnn
6019863762630.000608org.schema
6119855684600.000645com.addthis
62198513441460.000194com.forbes
63198365822030.000127uk.co.bbc
6419834544700.000513com.shopify
65198117582340.000113com.washingtonpost
6619808664690.000523com.vk
67198048621850.000138com.bing
68198036221470.000193gov.cdc
69198029141570.000172int.who
7019798680920.000383me.wp
7119780956440.000809net.doubleclick
72197607381410.000201gov.nih
7319749646590.000649com.macromedia
7419748198710.000506com.unpkg
75197476642600.000103net.researchgate
76197341102520.000106com.wsj
77197331522930.000093edu.stanford
78197316742320.000115com.imdb
79197247601880.000134org.wikimedia
8019712934390.000965net.fbcdn
81197009582060.000124com.businessinsider
82196901141430.000200com.dropbox
83196878202840.000095edu.mit
84196830601000.000363com.list-manage
85196823383080.000089com.tinyurl
8619665976270.001204org.apache
87196574461630.000157com.theguardian
88196514582990.000092com.android
89196477684450.000067com.quora
90196430181650.000156org.doi
91196365943280.000085com.go
92196298262400.000109com.bloomberg
93196297622740.000098edu.harvard
94196274445010.000060com.msn
95196258701640.000157com.issuu
96196256462540.000106com.oracle
97196246783440.000083com.springer
98196219461490.000190com.wixsite
99196151821400.000203us.zoom
100196124841310.000220com.npmjs
101196088121110.000294me.t
102196083143340.000084com.slack
103196023541240.000241com.mailchimp
104195961922440.000108com.stackoverflow
105195923123030.000091com.reuters
106195878003210.000087com.techcrunch
107195861925050.000059com.myspace
108195861262410.000109com.twimg
109195842381660.000155com.giphy
110195839082920.000093com.example
11119577384520.000709com.fb
112195767261670.000153com.yelp
113195760201690.000151com.office
114195725023410.000083com.prnewswire
115195679661900.000133com.unsplash
116195644921080.000309de.google
117195610163000.000091com.wiley
11819555072460.000785net.facebook
119195525102680.000101org.un
120195490982560.000105com.sciencedirect
121195481624890.000061com.latimes
122195475085940.000050com.livejournal
123195474521450.000196gle.forms
124195443004660.000063uk.co.telegraph
125195436904020.000078com.nature
126195423843400.000083org.npr
127195414204840.000062com.ted
128195353325140.000057edu.berkeley
129195331066470.000046com.vice
130195313282370.000110org.gnu
131195297981980.000130org.ietf
132195278527140.000042uk.ac.cam
133195249384340.000071com.time
134195223822890.000094com.bbc
135195174604970.000060com.goodreads
136195136164120.000076org.arxiv
137195100343060.000090com.cnbc
138195081841440.000197com.ytimg
139194922067000.000043edu.columbia
140194836424330.000071com.sagepub
141194759081530.000186com.ft
142194745921730.000149org.acm
143194733163230.000086com.githubusercontent
144194731324380.000069com.cnet
145194680421200.000254com.youtube-nocookie
146194585404090.000077com.wired
147194556342050.000125com.imgur
148194542105180.000057uk.co.dailymail
149194496282020.000127com.blogger
15019448116790.000459com.godaddy
15119443158260.001330cn.gov.miit
152194377583430.000083com.theverge
153194355685090.000058edu.yale
154194342361710.000150org.ampproject
155194264464720.000063com.nationalgeographic
156194232522810.000096com.squarespace
157194198808040.000037org.chromium
158194165306960.000043uk.ac.ox
159194131384880.000061com.googleblog
160194078484580.000064gov.whitehouse
161194076143040.000090com.usatoday
162194072284990.000060com.staticflickr
163194070668290.000036com.evernote
164194047341920.000131com.hubspot
165194044144760.000062org.ieee
166194023985870.000051org.worldbank
167194020044070.000077com.dribbble
168194002141320.000219com.statcounter
169193997965120.000058ee.linktr
170193993585270.000056edu.cornell
171193955001230.000243com.sharethis
172193882465190.000057com.theatlantic
173193874544080.000077com.docker
174193851826730.000044com.git-scm
175193833042140.000122com.wpengine
176193830307970.000037org.sciencemag
177193817508110.000037com.arstechnica
178193814742750.000098gg.discord
179193812221590.000167com.zendesk
180193790422640.000101uk.co.google
181193766641380.000206me.line
182193735262460.000107uk.co.amazon
183193734486020.000050com.zdnet
184193721542390.000109net.slideshare
185193690463250.000086com.appspot
186193688667020.000042com.economist
187193686207370.000041org.cambridge
188193681125710.000052com.cisco
189193671447920.000038edu.washington
190193663446480.000046org.weforum
191193617606680.000044com.box
192193616926240.000047org.pbs
193193562204790.000062org.python
194193556044360.000070com.huffingtonpost
195193542842260.000117com.outlook
196193531125420.000055com.typepad
197193500442880.000095org.pewresearch
198193456105580.000054com.cbsnews
199193427742670.000101net.windows
200193346385560.000054com.deloitte
2011933368810580.000028com.rollingstone
202193332604960.000060com.pixabay
203193332605540.000054gov.usda
204193323907200.000042google.blog
205193312865720.000052site.business
206193308025440.000055uk.co.independent
2071932516212230.000025ly.cutt
20819323700420.000906com.qq
209193236067600.000039com.apnews
210193233027450.000040ca.cbc
211193228386390.000046org.unesco
212193224344930.000061com.gitlab
213193079487040.000042com.mysql
214193074006110.000049com.pexels
215193021865860.000051gov.loc
216192976227280.000041edu.upenn
217192962627130.000042edu.wisc
218192958864850.000062com.getpocket
219192950205530.000054com.nbcnews
220192940104510.000065com.fastcompany
2211929259213810.000022com.ikea
222192912941780.000143com.tripadvisor
2231928675811710.000026org.eclipse
224192859346760.000043com.scribd
225192850007680.000039com.shutterstock
226192841786340.000047com.mozilla
2271928273812250.000025org.kernel
228192778687980.000037uk.co.blogspot
229192774807840.000038com.qz
230192722468590.000034com.ggpht
231192719883180.000087com.live
232192710809790.000030uk.co.guardian
233192689343010.000091com.w3schools
2341926539819240.000018com.lego
235192643824980.000060gov.irs
236192621488850.000034edu.jhu
237192600066720.000044com.buzzfeed
238192597086740.000044uk.co.eventbrite
239192593667960.000038com.trello
2401925845811810.000025com.technologyreview
241192561929840.000030com.playstation
2421925501410570.000028fr.lemonde
243192514585520.000054com.squareup
244192512665500.000054com.fortune
245192498145070.000059gov.nasa
246192496447510.000040me.about
247192441865960.000050com.oup
248192432562830.000095net.behance
249192417509730.000031com.foursquare
2501924036422120.000017com.hbo
251192388066350.000047fm.anchor
252192372123270.000085com.disqus
253192358809150.000033com.slate
25419234328540.000697co.g
25519232270360.000996com.baidu
256192310944560.000065com.bigcommerce
257192294301750.000146jp.co.google
258192292702490.000107com.calendly
259192292287560.000040com.vox
260192275645570.000054com.dailymotion
261192264806540.000045com.investopedia
262192255248600.000034com.ubuntu
263192251462720.000099com.bandcamp
2641922425815640.000021com.hatenablog
2651922244610870.000027co.elastic
266192214447240.000042com.newyorker
267192213608540.000035com.about
268192207624600.000064com.arcgis
269192194408140.000037com.variety
270192190247810.000038au.net.abc
271192172285510.000054com.elpais
272192156128220.000036edu.ucla
273192149428020.000037gov.congress
274192144226770.000043org.apa
275192124027400.000041com.freepik
2761921134611260.000026com.steamcommunity
277192108982500.000107gov.ca
278192098507730.000039org.pypi
279192095267190.000042com.libsyn
280192083088650.000034edu.princeton
281192079241420.000200com.opera
282192072849400.000032com.nypost
283192052968130.000037edu.umich
2841920419213390.000023com.billboard
285192040563120.000088com.typeform
286192019042800.000097com.feedburner
287192005508710.000034com.ssrn
288191997145460.000055com.tandfonline
289191979089050.000033com.podbean
290191977542240.000117page.g
291191976048270.000036org.fao
292191975388680.000034com.foxnews
293191971269640.000031com.merriam-webster
2941919624210930.000027edu.purdue
2951919033015790.000021ca.ubc
296191867727100.000042org.bitbucket
297191864181010.000349com.wix
2981918397210750.000028org.owasp
299191823722730.000099com.ibm
300191809869550.000031com.newsweek
301191782906880.000043org.semver
302191779502510.000106org.bbb
3031917037414030.000022ca.sfu
3041916991021550.000017com.discovery
3051916970614540.000022uk.co.metro
306191689484030.000078org.openstreetmap
307191681289280.000032com.webs
308191642242770.000098com.eepurl
309191633123990.000079com.netdna-ssl
310191626703100.000089com.wistia
3111916076813010.000023app.netlify
312191602429100.000033com.nasdaq
313191588846370.000047gov.senate
314191587341350.000212com.filesusr
315191567925280.000056com.snapchat
316191567323170.000088tv.twitch
3171915547213520.000023uk.co.standard
318191536349770.000030com.uk
319191504546690.000044org.eff
3201914870614330.000022io.gitlab
3211914676216090.000021com.warnerbros
3221914544810250.000029com.techradar
3231914461811730.000026com.500px
3241914363011580.000026com.pastebin
325191416445830.000051gov.epa
326191405748200.000036com.theconversation
3271913929411470.000026org.semanticscholar
328191391301250.000238com.rawgit
3291913884212280.000025com.sky
3301913653417470.000019com.flipboard
331191348024540.000065com.ebay
332191338162040.000125com.amazon-adsystem
333191330285990.000050edu.cmu
3341913292612820.000024edu.illinois
3351913271013270.000023org.greenpeace
336191317924290.000073com.optimizely
3371913056216680.000020com.urbandictionary
338191303122010.000127org.iana
339191295466090.000049gov.house
34019129270980.000373com.stripe
341191270044630.000064org.opensource
342191253782470.000107com.cloudinary
343191224569030.000033edu.academia
3441912007011820.000025org.mitre
3451911999210080.000030gov.usgs
346191190442150.000122net.sourceforge
3471911892421170.000018com.channel4
3481911813614890.000022uk.co.thesun
3491911751615580.000021com.deadline
350191146109960.000030com.thehill
351191135768970.000033edu.umn
352191132267580.000040gov.justice
3531911018616430.000020org.maven
354191100161560.000173com.addtoany
355191081844040.000077com.criteo
3561910322023560.000016com.freep
357191013681280.000230com.paypalobjects
3581909877011330.000026com.nikkei
359190987584570.000065es.google
360190964486780.000043org.oecd
3611909332812910.000024org.postgresql
3621909221419410.000018com.euronews
363190917227720.000039gov.archives
3641909126215840.000021com.reverbnation
3651909037611380.000026uk.co.mirror
366190887765640.000053com.kickstarter
3671908746027660.000013edu.byu
3681908707013380.000023edu.hbs
3691908541413820.000022com.googlesource
3701908472621160.000018edu.wustl
3711908445810100.000030com.politico
3721908322222260.000017org.nobelprize
3731908258411110.000027com.dw
3741908228410220.000029com.pingdom
375190821945480.000054com.walmart
37619078996840.000405net.jsfiddle
3771907856812840.000024ch.ethz
3781907376622990.000016gov.cia
379190734789830.000030com.salon
380190730427150.000042org.change
3811907281011840.000025com.theglobeandmail
382190714724620.000064com.elsevier
3831907122218370.000019com.storify
384190664381370.000207de.bund
385190663841210.000250com.jimdo
3861906634414290.000022edu.gatech
38719064084670.000527net.typekit
3881906273813100.000023com.digitaltrends
3891906247812860.000024int.unfccc
390190613128100.000037au.com.google
3911905974610260.000029gov.treasury
3921905951422830.000016com.mystrikingly
393190591109060.000033com.britannica
3941905817012720.000024edu.ucdavis
395190575589040.000033uk.parliament
39619056260760.000468me.fb
3971905432010270.000029com.mdpi
3981905274612440.000024com.aljazeera
399190521442070.000124com.etsy
400190520622690.000100net.azureedge
4011905181210630.000028gov.fbi
4021905172016080.000021ms.1drv
403190488368240.000036com.bmj
4041904865413250.000023de.mpg
4051904742425380.000014com.virustotal
4061904712211060.000027org.nejm
407190463362110.000123com.tiktok
408190457541940.000131org.nodejs
4091904489225660.000014com.diigo
4101904361411830.000025com.scmp
4111904280010950.000027au.com.smh
412190422587180.000042org.d3js
4131904180413110.000023com.history
4141904150611780.000026org.hrw
4151903975412350.000025uk.ac.ucl
4161903835612130.000025com.socialmediatoday
4171903577610380.000029edu.uchicago
4181903377427100.000013com.thecvf
419190303569680.000031org.readthedocs
42019030118610.000612com.googleadservices
4211902981410180.000029org.jstor
422190290708330.000035com.pinimg
4231902869624860.000015com.oxforddictionaries
4241902841021450.000017com.discogs
4251902747625680.000014edu.buffalo
4261902335218100.000019com.buzzfeednews
427190231969570.000031watch.fb
4281902249211360.000026org.sphinx-doc
4291902249018510.000018com.spreaker
4301902207815480.000021com.irishtimes
431190218506080.000049com.biomedcentral
4321901943814520.000022uk.ac.lse
433190183084690.000063org.hbr
434190181063960.000079com.statista
4351901738610450.000029com.substack
436190141822850.000095ru.ok
4371901337435090.000010com.quizlet
438190124708070.000037com.deviantart
4391901060211150.000027org.undp
4401900553219590.000018com.rt
4411900465210040.000030org.ilo
4421900441633520.000011cc.uxdesign
4431900330622340.000017org.wto
4441900269016730.000020org.rfc-editor
4451900203413020.000023com.penguinrandomhouse
446190015369200.000032de.spiegel
4471899994013180.000023com.producthunt
448189975827050.000042gov.sec
449189971285200.000057com.meetup
4501899471821620.000017com.ibtimes
4511899389810610.000028com.sun
4521899278821200.000018gov.federalreserve
4531899162417600.000019edu.arizona
454189907326000.000050edu.utah
4551899064218610.000018com.newscientist
456189897305370.000055com.gmail
4571898941011230.000026net.java
4581898664418850.000018com.itv
459189864505040.000059com.ssl-images-amazon
460189860102430.000108uk.org.ico
4611898554217810.000019ca.blogspot
46218985454730.000503net.akamaihd
463189850108400.000035in.co.google
4641898480213910.000022de.zeit
4651898448010050.000030uk.co.thetimes
4661898445810420.000029com.prweb
4671898387626610.000013com.twitpic
4681898379011950.000025io.pypa
4691898264829560.000012com.openai
470189804944470.000067net.imgix
4711897985019010.000018com.martinfowler
472189790622420.000108org.purl
473189785966550.000045de.gesetze-im-internet
474189784503970.000079net.themeforest
475189784082160.000121jp.co.yahoo
4761897799214090.000022edu.ufl
477189773504250.000073com.atlassian
4781897578413730.000023edu.duke
479189744862360.000111to.amzn
4801897413022850.000016edu.gmu
4811897400410560.000028edu.nyu
482189734145750.000052org.debian
4831897314812490.000024com.jetbrains
484189731082350.000111com.mapbox
485189730122960.000092me.telegram
4861897255612870.000024com.wikia
48718972192820.000409com.oversightboard
488189711303420.000083com.proofpoint
489189708309940.000030com.jimdofree
4901897078226380.000014org.nypl
491189688348060.000037edu.brookings
4921896847022130.000017org.wfp
4931896836624170.000015mp.j
4941896796626240.000014app.web
4951896744823630.000015com.instructables
4961896695012670.000024org.imf
4971896595211340.000026org.unhcr
4981896565016100.000021edu.virginia
4991896558630000.000012ph.telegra
5001896335821220.000018org.propublica
5011896306816550.000020edu.brown
5021896102014300.000022com.seattletimes
503189604543130.000088io.shields
5041896017423480.000016org.archlinux
505189597303550.000081com.surveymonkey
506189588927470.000040gov.state
5071895686410190.000029com.yarnpkg
5081895670621420.000017org.phys
5091895667216880.000020org.unwomen
510189552008090.000037com.fiverr
5111895488429300.000012org.vim
5121895463434470.000010com.instapaper
513189538142210.000119com.eventbrite
514189531368780.000034edu.psu
5151894851415990.000021com.asahi
5161894820623380.000016ca.ualberta
5171894770631300.000011com.rd
518189472607480.000040com.intel
5191894708824140.000015com.gfycat
5201894697626490.000013org.icrc
5211894470024670.000015org.biorxiv
5221894121416580.000020org.r-project
523189398742570.000105com.aliyuncs
524189368461390.000205com.weibo
5251893647418520.000018com.gettyimages
526189317985250.000057com.googlecode
5271893022436990.000010com.plurk
5281892932021360.000017org.unep
5291892911417370.000019com.howstuffworks
5301892892226170.000014com.udacity
5311892753418300.000019edu.georgetown
5321892673823270.000016com.esri
533189256126850.000043uk.gov.service
5341892427427420.000013jp.co.japantimes
5351892424224060.000015com.kobo
536189215985550.000054com.samsung
5371892149414470.000022fr.gouvernement
5381892051828210.000013org.wikibooks
5391891955024540.000015it.scoop
5401891754234410.000010net.openreview
5411891695021370.000017es.abc
5421891368424220.000015jp.geocities
5431891349427480.000013edu.uoregon
5441891290021190.000018google.ai
5451890939226550.000013co.carrd
5461890761621150.000018uk.co.huffingtonpost
547189044046900.000043com.mashable
548189035706040.000049com.steampowered
5491890282817300.000020org.torproject
550189025704490.000066com.netflix
5511890168422150.000017google.research
5521890047826990.000013at.ac.univie
5531889942821430.000017edu.tufts
554188977749130.000033com.thelancet
5551889491637910.000009goog.translate
556188938289820.000030org.ohchr
5571889246438800.000009com.bravesites
5581889043228870.000012org.rsf
5591889034217920.000019gov.usembassy
5601888666230160.000012com.architecturaldigest
5611888609811030.000027cn.news
5621888556623670.000015uk.bl
5631888318634070.000010uk.co.walesonline
5641888234424610.000015org.accessnow
5651888085622910.000016com.france24
5661887997633090.000011com.pearltrees
5671887890228150.000013org.freedomhouse
568188755282190.000120com.salesforce
5691887486825750.000014org.scala-lang
5701887426611420.000026be.google
5711887422026840.000013re.appsto
5721887324624000.000015org.ap
5731887300831830.000011do.bit
5741887231428330.000012com.sputniknews
5751887192621650.000017org.americanprogress
5761887118615680.000021com.chron
5771887107227920.000013org.unaids
5781886990226890.000013com.ajc
5791886855822510.000016app.vercel
580188680764950.000060com.visualstudio
5811886674613330.000023net.daringfireball
5821886546426540.000013org.csis
5831886376023500.000016com.ew
5841886209212930.000024link.page
5851886128623200.000016fr.gouv.diplomatie
586188607901890.000133ru.mail
587188606368520.000035org.mediawiki
588188599306220.000048com.thinkwithgoogle
5891885861228650.000012com.duolingo
5901885558621120.000018com.domaintools
591188553942910.000094net.secureservercdn
5921885523628600.000012com.biography
5931885396416480.000020jp.ne.goo
5941885343418270.000019com.lifewire
5951885316024790.000015ie.independent
5961885013629030.000012uk.ac.leeds
5971884948429510.000012com.allure
5981884945222410.000016com.timeout
5991884820631350.000011org.cpj
6001884701641370.000009com.bonanza
6011884432416400.000020ca.globalnews
6021884338017740.000019gov.in
6031884317017310.000020com.images-amazon
6041884311627820.000013com.depositphotos
6051884307418430.000018com.thebalance
60618842016860.000401com.livestream
607188416822870.000095com.naver
608188415584770.000062com.force
6091884075614220.000022net.codecanyon
6101883889033460.000011io.ghost
6111883888424190.000015com.teenvogue
6121883883623660.000015nz.co.stuff
6131883743827320.000013com.123rf
6141883640825080.000014com.motherjones
615188363546790.000043int.wipo
6161883633226150.000014edu.kit
6171883468417060.000020com.routledge
618188343366950.000043io.readthedocs
6191883403836920.000010com.laweekly
620188322084410.000069com.businesswire
6211883126629140.000012org.oxfam
622188298524860.000062com.adweek
6231882961624980.000014edu.hawaii
6241882921635430.000010com.udn
625188282707260.000042com.canva
6261882803429850.000012com.slides
627188279086270.000047io.codepen
6281882755226640.000013com.googlegroups
6291882634839010.000009cn.org.china
6301882570026980.000013com.coca-colacompany
631188256028880.000033uk.co.pinterest
6321882305825510.000014org.fas
6331882283011100.000027net.clickbank
6341882256830990.000011uk.co.timesonline
635188224323630.000081net.php
6361882141026120.000014edu.iastate
6371882139221440.000017com.refinery29
638188203149880.000030gov.dhs
6391881989642160.000008com.alamy
640188198029910.000030de.t-online
641188177002790.000097com.iubenda
6421881743416120.000021com.haaretz
6431881661615650.000021mil.army
6441881648430530.000011com.hm
6451881615616050.000021uk.gov.ons
646188137403590.000081mp.mailchi
6471881142834690.000010org.heritage
6481880892812220.000025org.eugdpr
6491880874823410.000016za.co.google
650188083485630.000053org.unicef
6511880820629380.000012com.theonion
652188079323450.000083com.akismet
653188072161500.000190org.networkadvertising
654188053848340.000035com.venturebeat
6551880367026710.000013com.timesofisrael
6561880296831820.000011com.ogilvy
657188009781340.000217info.aboutads
6581880034412650.000024tw.com.google
659187994725490.000054com.fc2
6601879817027150.000013com.theintercept
6611879814023770.000015com.foreignpolicy
6621879794440090.000009com.zara
6631879675830420.000012org.project-syndicate
6641879641027900.000013cn.gov.fmprc
665187955865410.000055com.patreon
6661879433229120.000012org.ballotpedia
6671879378636750.000010uk.co.guim
6681879366812030.000025com.thenextweb
6691879325024600.000015nz.co.nzherald
6701879195623290.000016gov.faa
671187916066640.000045com.entrepreneur
6721879029816990.000020com.nike
6731878822022570.000016com.voanews
6741878568236460.000010com.podomatic
675187836523940.000080jp.ameblo
6761878173236650.000010nz.co.scoop
6771877982626680.000013com.jpost
678187797707550.000040org.js
6791877887425890.000014de.tagesspiegel
680187784365910.000051com.gofundme
681187782366510.000046it.placehold
682187781306920.000043gov.nist
6831877752830900.000011no.uib
6841877726437590.000010com.clustrmaps
6851877623825330.000014com.channelnewsasia
6861877562623320.000016com.carto
6871877497025440.000014edu.usf
6881877491044010.000008uk.ac.essex
6891877447422030.000017de.br
6901877310242560.000008org.marxists
6911876955828260.000013br.com.blogspot
692187694286800.000043com.photobucket
6931876921633950.000010com.parade
6941876841237310.000010com.mongabay
695187682447420.000041com.moz
6961876807035190.000010ar.com.lanacion
6971876748011620.000026com.digitaloceanspaces
6981876725238330.000009com.scribblelive
6991876628436670.000010ru.msk
7001876514817070.000020org.oxfordjournals
7011876501818460.000018com.speakerdeck
7021876434211770.000026com.jekyllrb
7031876427213200.000023com.imageshack
704187636286980.000043com.withgoogle
7051876351827750.000013com.fineartamerica
7061876330616270.000020org.amnesty
7071876251225720.000014org.unctad
7081876194030490.000012int.au
709187611621090.000306me.wa
7101876093621840.000017org.ncsl
7111875934028570.000012uk.org.nationaltrust
7121875839644350.000008com.mysanantonio
7131875827831540.000011fr.rfi
7141875767413000.000023gov.federalregister
7151875692059680.000006org.arkive
7161875685833850.000011com.nationalreview
7171875639822320.000017org.worldcat
7181875638841600.000009com.turkishairlines
7191875610030770.000011uk.ac.york
7201875587630290.000012org.nationalgeographic
7211875539033600.000011org.tigris
722187552283620.000081com.adnxs
7231875366817650.000019com.indianexpress
7241875348232930.000011org.neocities
7251875170230580.000011ly.genial
7261875039232740.000011uk.co.penguin
727187502429190.000032com.hootsuite
7281875000234940.000010com.nme
7291874733427350.000013com.kaggle
730187468941930.000131com.discord
7311874660033570.000011de.taz
7321874629831470.000011edu.bc
7331874619030410.000012tr.com.aa
7341874554431560.000011com.cgtn
7351874513224310.000015org.unodc
736187445502820.000096gov.ftc
7371874368223440.000016eu.politico
7381874305213640.000023com.symantec
7391874083425200.000014net.openid
7401874075236370.000010il.ac.tau
7411873939824530.000015ru.ria
7421873894830310.000012com.allafrica
7431873877211720.000026jp.ac.keio
7441873820028690.000012edu.educause
7451873815634020.000010org.firstmonday
7461873812226420.000014org.wikidata
747187377884910.000061com.placeholder
7481873498029000.000012com.simonandschuster
7491873486037960.000009org.amnestyusa
7501873469621680.000017com.justia
7511873330821970.000017ca.on.gov
7521873308837070.000010uk.gov.scotland
7531873302645630.000008com.flightradar24
7541873294443260.000008com.interviewmagazine
7551873274237440.000010com.afp
7561873215441120.000009org.scala-sbt
7571873099227800.000013ae.google
7581873082413030.000023org.webkit
7591872988229470.000012com.superuser
7601872967810140.000029com.highcharts
7611872902038020.000009com.wusa9
7621872890025500.000014jp.nicovideo
7631872809421410.000017gov.pa
7641872799039370.000009org.one
7651872667229080.000012edu.uky
7661872503430520.000011in.businessinsider
7671872447433820.000011org.hypotheses
7681872337826810.000013org.wbur
769187232365690.000052com.inc
7701872106614340.000022com.upwork
7711872090842320.000008org.sourcewatch
7721872067633660.000011com.sciencealert
7731872022213160.000023de.rki
7741871911434660.000010org.royalsociety
7751871852423540.000016ru.rbc
7761871819810340.000029com.videojs
7771871753033830.000011org.polymer-project
778187172284870.000062ee.lin
7791871700832020.000011org.texastribune
7801871659810540.000028fm.last
7811871656436580.000010se.gu
7821871591621810.000017it.redd
7831871555212190.000025com.smashingmagazine
7841871553627430.000013org.undocs
7851871346027130.000013org.iucn
7861871309236620.000010com.hashicorp
7871871294417120.000020scot.gov
788187125649760.000031com.jwplayer
7891871233038430.000009edu.wayne
790187111225760.000052com.booking
791187109068030.000037com.fandom
7921871085838500.000009com.triplepundit
793187096903290.000085com.hackerone
7941870931442350.000008com.letterboxd
7951870765411510.000026com.alexa
7961870753422580.000016com.knightlab
797187069647700.000039com.sedo
7981870675033940.000010org.iucnredlist
7991870638019440.000018com.firebaseapp
8001870580840360.000009com.manta
8011870254630720.000011au.com.theage
8021870145034750.000010org.sierraclub
803187004244000.000078com.onesignal
8041870034826520.000013ru.kommersant
8051870025033790.000011com.hasbro
8061869935638210.000009edu.unu
8071869914636560.000010com.crashlytics
808186988948170.000037com.marketwatch
8091869849440710.000009ru.aif
8101869822444190.000008com.folkd
8111869815212850.000024gov.uspto
8121869773433580.000011net.ipsnews
8131869748426320.000014org.unfpa
814186973369870.000030com.stackexchange
8151869651032070.000011ly.plot
816186961845320.000056com.indeed
8171869533416670.000020fr.blogspot
818186952469720.000031com.css-tricks
819186943829330.000032org.reactjs
8201869329240830.000009com.marinetraffic
8211869294027060.000013ru.rg
8221869290043650.000008com.balenciaga
8231869246825610.000014com.kinstacdn
8241869171025860.000014build.bazel
825186907465450.000055com.digg
8261869019039970.000009jp.co.tepco
8271869018214570.000022io.webflow
8281869014846850.000008com.gmanetwork
8291868999437790.000009org.rferl
8301868987044160.000008kr.co.koreatimes
831186894528210.000036com.oreilly
832186894269530.000031gov.fcc
8331868902429570.000012com.articulate
8341868863430540.000011site.notion
8351868825624410.000015int.reliefweb
8361868821827220.000013com.insidehighered
8371868787210860.000027so.notion
8381868781044660.000008org.sfpl
8391868767235800.000010uk.co.spectator
8401868762228400.000012com.suntimes
841186874529170.000032com.verisign
8421868688029670.000012org.cfr
8431868662228190.000013org.panda
8441868629810160.000029com.mixcloud
84518686002830.000405com.messenger
846186859345330.000056jp.co.rakuten
8471868576443430.000008com.upworthy
8481868549426440.000014ru.kremlin
849186848062780.000097com.sxsw
8501868421235420.000010com.flippa
851186840065680.000052com.mckinsey
8521868395625340.000014net.convio
8531868351012310.000025com.buffer
8541868300631000.000011com.yougov
8551868292851100.000007com.viki
8561868243646740.000008org.birdlife
8571868196644130.000008com.itsnicethat
858186813746500.000046com.gartner
8591868137229270.000012uk.gov.metoffice
860186810844940.000061com.dmca
8611868084035970.000010org.jenkins-ci
8621868035830890.000011int.iom
8631867867040820.000009com.iconarchive
8641867738444800.000008com.oriflame
8651867651845380.000008net.middleeasteye
8661867585450410.000007com.waitbutwhy
8671867553443160.000008org.pen
8681867527428300.000013fm.omny
8691867428239020.000009org.icij
8701867404441880.000008org.constitutioncenter
8711867397238570.000009ch.qos
8721867347440700.000009com.9to5google
8731867343235260.000010uk.gov.companieshouse
8741867340639940.000009uk.ac.sussex
8751867325832660.000011com.foreignaffairs
8761867324628350.000012com.news24
8771867320441320.000009re.cli
8781867269042270.000008jp.ac.kobe-u
879186717149020.000033br.com.uol
8801867155237180.000010com.nybooks
8811867144618180.000019com.over-blog
8821867136255780.000006com.symbaloo
8831866961231700.000011uk.co.bbci
884186692009340.000032com.pubmatic
8851866901023850.000015com.scene7
8861866881034670.000010org.wikileaks
8871866724246370.000008org.foodandwaterwatch
8881866643833050.000011at.derstandard
889186660748760.000034com.zoho
8901866541031670.000011org.adb
8911866436235180.000010com.benzinga
892186639887300.000041com.usnews
8931866345055920.000006io.postach
8941866303035470.000010com.palgrave
8951866246411090.000027net.media
8961866212040910.000009net.datasociety
897186614265400.000055com.googleoptimize
8981866104441160.000009au.com.heraldsun
8991865906834400.000010ru.kp
9001865798826750.000013com.thenation
901186576906630.000045me.zalo
9021865705230010.000012com.unity
9031865579616350.000020org.altervista
9041865458048260.000007it.polito
9051865449044820.000008edu.odu
9061865420031710.000011org.sonatype
9071865379028530.000012net.vnexpress
908186532847350.000041com.alibaba
9091865251044540.000008com.muckrack
9101865240029370.000012com.lexology
9111865227049620.000007kr.co.hani
9121865081837230.000010com.tradingeconomics
9131865060638820.000009com.study
914186505945950.000050com.airbnb
9151864977636630.000010gov.ustr
9161864967449000.000007com.theodysseyonline
9171864952037240.000010uk.gov.homeoffice
9181864827810310.000029com.pcmag
919186471346330.000047org.joomla
9201864577033670.000011br.scielo
92118645448740.000486com.trustpilot
9221864489055520.000006au.edu.vu
9231864471031270.000011tw.com.pchome
924186442607250.000042com.splashthat
9251864356027450.000013ca.citizenlab
9261864262845990.000008com.condenast
9271864235416760.000020com.techrepublic
9281864143021250.000018io.pantheonsite
9291864132232810.000011ru.cbr
9301864124028660.000012ca.uwaterloo
9311864091243840.000008uk.co.belfasttelegraph
932186406885970.000050com.wufoo
9331863919432310.000011org.ellenmacarthurfoundation
9341863904648210.000007com.zimbio
9351863882433210.000011com.rabbitmq
936186384225470.000054com.herokuapp
9371863685256570.000006org.cgsociety
9381863491257040.000006in.teletype
939186349005310.000056com.aol
9401863402637520.000010edu.ucpress
9411863397633750.000011com.scotsman
9421863349432770.000011com.kroger
943186322724050.000077com.constantcontact
944186319308700.000034com.emarketer
9451863087456430.000006com.dbs
9461863083844070.000008au.edu.deakin
9471863063033150.000011org.osce
9481862945829710.000012com.euractiv
9491862864247880.000007com.latercera
9501862610238810.000009com.bloombergquint
951186255809520.000031com.digitalocean
9521862510236380.000010org.ushmm
9531862488837510.000010com.lawfareblog
9541862466448770.000007ke.co.google
9551862440038350.000009com.thenationalnews
9561862437847160.000007com.kongregate
9571862424051320.000007com.apsense
9581862402013420.000023com.nvidia
959186238386170.000048gov.copyright
9601862350444370.000008com.jacobinmag
9611862339629340.000012net.dwcdn
962186226806430.000046com.accenture
9631862232045290.000008uk.ac.soas
9641862116634450.000010de.test
9651862056816610.000020com.createjs
9661862014032180.000011com.obsproject
9671861997628920.000012org.gnupg
9681861987043180.000008com.washingtonian
9691861939249080.000007uk.co.birminghammail
9701861915445480.000008io.meduza
9711861905840340.000009ru.mid
9721861888212070.000025org.golang
9731861853439300.000009org.cgiar
9741861711624110.000015co.pcdn
9751861630426130.000014com.olark
9761861556210070.000030com.gumroad
9771861365227550.000013ru.tass
9781861351048250.000007com.selfridges
9791861281437700.000009fr.capital
9801861221443910.000008za.co.mg
981186121689110.000033net.atlassian
982186120448440.000035com.redhat
9831861151817490.000019com.indiegogo
9841861143850090.000007edu.utep
9851861085617270.000020org.linuxfoundation
986186104469560.000031com.att
9871860918628900.000012org.transparency
9881860858839180.000009com.encyclopedia
98918606828720.000505com.oculus
990186067726990.000043com.psychologytoday
9911860669830910.000011com.sharefile
992186065041510.000189org.whatwg
993186063547540.000040org.poynter
9941860626833860.000011com.alchemer
995186046487070.000042co.ibb
996186044322860.000095com.caniuse
9971860440227380.000013com.springeropen
9981860438624900.000014studio.flourish
9991860413843730.000008com.googledrive
10001860401444900.000008tw.com.books

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!