We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017). Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the preceding announcements.

Please note that the first released version (released 2018-02-08, withdrawn 2018-02-21) contained only links from the January 2018 crawl, see the notice on the Common Crawl user group. On 2018-02-28 a fix has been provided with graphs or rankings containing all links, hosts and/or domains over all 3 crawls. We also provide the erroneously released graphs and rankings from the January 2018 crawl.

What’s new?

Here is a summary of notable aspects and changes of this web graph release:

  • a bug has been fixed which caused that relative links pointing to a different host (//www.example.com/index.html) are not added as edges of the host/domain-level webgraphs
  • the domain graph now contains the number of hosts per domain as additional column in the vertices and rankings files
  • the naming scheme has changed – the release name is now part of the file name
  • webgraph offset files are not released any more, they can be created by running

    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-host
    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-domain

Host-level graph

The graph consists of 2.75 billion nodes and 8.6 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 2.67 billion dangling nodes (97%) and the largest strongly connected component contains only 65 million (2.3%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 2.75 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/. Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Nov/Dec/Jan 2017-18 host-level webgraph

SizeFileDescription
15.9 GBcc-main-2017-18-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
40.0 GBcc-main-2017-18-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 28 edges files
16.4 GBcc-main-2017-18-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-host.properties
24.2 GBcc-main-2017-18-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2017-18-nov-dec-jan-host-t.properties
1 kBcc-main-2017-18-nov-dec-jan-host.statsWebGraph statistics
38.1 GBcc-main-2017-18-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted; “private” domains are not accepted (cf. section “divisions” in the documentation on publicsuffix.org). For example, foo.blogspot.com and data.commoncrawl.org are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com and stored in the reversed form com.blogspot.

The domain-level graph has 94 million nodes and 1.44 billion edges. 59% or 56 million nodes are dangling nodes, the largest strongly connected component covers 33 million or 35% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/ resp. https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2017-18 domain-level webgraph

SizeFileDescription
0.67 GBcc-main-2017-18-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.7 GBcc-main-2017-18-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.1 GBcc-main-2017-18-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-domain.properties
3.3 GBcc-main-2017-18-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2017-18-nov-dec-jan-domain-t.properties
1 kBcc-main-2017-18-nov-dec-jan-domain.statsWebGraph statistics
2.0 GBcc-main-2017-18-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 94 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2017-2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12607321020.013220com.facebook
22550183210.016444com.googleapis
32371825630.009278com.google
42337153440.008406com.twitter
52283219250.007823com.youtube
62165337660.006112org.w
72032463670.004710org.gmpg
82004592880.003501com.instagram
919837996100.002871com.linkedin
1019439618120.002753org.wordpress
1119334234140.002070com.wordpress
1219214522170.001665com.pinterest
1319145770270.001242org.wikipedia
1419121822230.001462com.gravatar
1518842296330.000966com.blogspot
1618810990110.002837com.bootstrapcdn
1718718320190.001594com.apple
1818626224260.001255com.vimeo
1918434062150.001863com.adobe
2018419880440.000691be.youtu
2118397832340.000964com.amazon
2218350614130.002084com.macromedia
2318323552290.001015com.microsoft
2418321908410.000757gl.goo
2518302296310.001009com.flickr
2618270630460.000657com.tumblr
2718183288590.000540com.yahoo
2818136014200.001531net.doubleclick
2918074436700.000464ly.bit
3018072284320.000988com.amazonaws
3118039506180.001618com.googletagmanager
3217994916350.000913com.paypal
3317957448780.000417eu.europa
3417950818250.001280com.cloudflare
3517880136870.000397com.weebly
3617863816300.001012com.github
3717859140810.000412org.mozilla
3817838500400.000769net.cloudfront
3917830430950.000348co.t
4017794416800.000414org.creativecommons
41177732261020.000289com.googleusercontent
4217757566570.000562org.w3
4317751372390.000782io.github
4417703562970.000340com.soundcloud
45176746261180.000226com.blogger
46176734861380.000182net.slideshare
47176663841080.000265com.reddit
4817650506510.000617com.bing
49176226781470.000171com.myspace
5017614686650.000474com.medium
51176003021170.000233org.archive
52175976521360.000187com.imgur
5317581558660.000474com.list-manage
5417545184370.000804org.apache
55174990741550.000154com.imdb
56174933162400.000097com.about
5717491778280.001104com.gstatic
58174715561690.000144com.wsj
59174641361260.000218com.jimdo
60174622402340.000101com.livejournal
6117450286470.000649com.wp
62174478361290.000206com.issuu
63174452441300.000204com.android
64174435181220.000222com.yelp
6517419300430.000721com.statcounter
6617406774500.000626me.wp
67173928921790.000138com.oracle
68173725701620.000148com.digg
69173686322310.000102me.about
70173673182990.000078com.scribd
71173619462550.000091org.python
72173596881270.000210uk.co.google
7317357006610.000525com.cnn
74173420141240.000220com.nytimes
75173399683190.000073com.quora
76173296202490.000092com.ted
77173214501530.000161com.spotify
78173019981480.000168com.wixsite
79173005002330.000101com.dailymotion
80172979082080.000118com.staticflickr
81172889543900.000062org.chromium
82172764041060.000273com.ytimg
83172698902590.000089com.webs
84172653061450.000177org.ietf
85172553422220.000109com.mozilla
86172436661870.000133net.behance
87172430481910.000130com.disqus
88172424762730.000085com.mysql
89172400421580.000152com.stumbleupon
90172364102680.000085com.foursquare
91172312723140.000075gov.loc
92172130101510.000164org.gnu
93172101181460.000171com.tripadvisor
94172033743610.000066org.nodejs
95172018823780.000064com.storify
96171787901560.000153com.forbes
9717177956600.000527com.huffingtonpost
98171684641330.000196com.dropbox
99171640121990.000125com.typepad
100171565222410.000097com.example
101171501881660.000146uk.co.bbc
102171485284790.000051edu.virginia
10317142618890.000384com.paypalobjects
10417140226480.000645net.fbcdn
105171305684030.000060com.pixabay
106171263163830.000063ca.blogspot
107171184922000.000124org.wikimedia
108171163582970.000079com.githubusercontent
109171156763630.000066com.sun
11017111592360.000863com.squarespace
111171065722920.000079com.goodreads
11217105500560.000574com.fb
113171037684590.000053kr.flic
114170942264310.000057org.ampproject
115170863965300.000048edu.gatech
116170863561800.000137com.theguardian
11717085768960.000344com.wix
118170830325180.000049it.scoop
119170813824270.000057org.sciencemag
120170721061390.000182net.sourceforge
121170622585150.000049com.nike
122170567084000.000060org.eclipse
123170547704320.000056co.g
124170524382690.000085com.tinyurl
12517052256620.000509net.akamaihd
126170479044370.000055org.kernel
12717045616680.000467com.mashable
128170454604890.000051au.com.blogspot
12917042294640.000480org.schema
130170410626200.000043com.discogs
131170382341410.000181com.youtube-nocookie
132170372623700.000065com.npmjs
133170346182980.000079com.symantec
134170235881960.000126com.live
135170186123280.000072com.googlecode
136170166203960.000061com.git-scm
137170121303940.000061com.500px
138170113101980.000126edu.stanford
139170107824160.000058com.unity3d
140170105326860.000042com.wikidot
141169924943340.000071com.alexa
142169836104470.000054com.sap
143169781462500.000092com.businessinsider
144169767262720.000085com.cnet
145169763663720.000064com.getpocket
146169716982520.000092com.go
147169644562320.000101com.washingtonpost
148169570345670.000046com.chrome
1491695597690.003080com.godaddy
150169541521400.000182com.sharethis
151169535422110.000115com.ebay
152169495565060.000050edu.berkeley
153169488123770.000064au.gov.nsw
154169425942890.000080com.msn
155169383343330.000072com.time
156169377522130.000114com.nbcnews
15716934364750.000429edu.utexas
158169300385320.000048com.jetbrains
159169277123170.000074edu.harvard
160169247705450.000047ms.1drv
161169178501890.000130com.etsy
162169148381760.000140gov.nih
163169117526640.000043com.klout
164169050583270.000072edu.mit
165169039283160.000074com.reuters
166168989462350.000098com.mapquest
167168988763180.000074com.wired
168168933645700.000046com.crunchbase
169168932704010.000060gov.nasa
170168901307220.000040com.4shared
171168857702810.000082io.codepen
172168828822950.000079com.photobucket
173168759322570.000090com.udacity
174168656923090.000076com.aol
175168581684080.000059com.cnbc
176168538162930.000079com.tripod
177168486765170.000049org.aarp
178168477205630.000046edu.utah
179168469283420.000070org.npr
180168441287460.000039com.diigo
181168420743030.000077com.meetup
182168409241200.000223com.mailchimp
183168400963670.000065com.gmail
18416835606240.001310ru.yandex
185168346124250.000057com.appspot
186168335562870.000080com.ibm
187168270303380.000071gov.ca
188168262022420.000095com.surveymonkey
189168255322760.000083com.usatoday
190168249887780.000038com.googledrive
191168228467490.000039com.naturalnews
192168199907640.000038io.soup
193168158803400.000070uk.co.telegraph
194168142361630.000148com.eventbrite
195168138842060.000119com.opera
196168133066760.000043com.zappos
19716811868880.000394com.jquery
198168117966920.000042com.wholefoodsmarket
199168095085350.000048com.createspace
200168090723220.000073com.images-amazon
201168075923040.000077com.bloomberg
202167965021930.000128com.twimg
203167933644140.000058com.kickstarter
204167927301030.000285com.addthis
205167918442510.000092com.techcrunch
206167910748040.000037edu.washington
207167908526890.000042com.abebooks
208167906942940.000079com.googlesyndication
209167902465110.000049edu.cornell
210167852605290.000048com.buzzfeed
211167831304120.000059org.un
212167811322630.000087com.stackoverflow
213167809581490.000166com.feedburner
214167792506080.000044com.theverge
215167751307960.000037com.pearltrees
21616774700670.000473com.vk
217167745863750.000064com.latimes
218167655796990.000042com.sublimetext
219167606964980.000050org.rubyonrails
220167559111700.000142com.zendesk
221167548008800.000035com.fotolog
22216754091690.000466me.fb
223167513005770.000045com.audible
224167506155490.000047org.pbs
225167493145360.000048com.deviantart
226167477654100.000059com.wiley
227167466603070.000077org.acm
228167453268620.000036tl.page
229167445722120.000114com.ssl-images-amazon
230167438908240.000037com.instapaper
231167426627410.000039com.kinja
232167420081100.000253com.shopify
233167408117670.000038com.newyorker
234167403695030.000050com.yellowpages
235167361722030.000122org.drupal
236167349787580.000039com.xda-developers
237167323119210.000035com.adsoftheworld
238167318952210.000110org.mediawiki
239167311372790.000083fr.free
240167300808050.000037co.ello
241167295154440.000054com.theatlantic
242167252514090.000059uk.co.dailymail
2431672136511890.000031edu.columbia
244167202953880.000062com.bbc
24516720112450.000661com.yimg
246167189054510.000054com.wikihow
247167186972360.000098net.php
248167147875890.000044com.citysearch
249167010818110.000037com.jigsy
250166995516840.000043com.vice
251166934169920.000034ly.ow
252166920565340.000048com.exacttarget
253166855272610.000089com.salesforce
254166828195390.000047com.cbsnews
255166780185020.000050com.zdnet
256166765263970.000061gov.whitehouse
257166754195820.000045com.ft
258166696941050.000280de.google
2591666723911900.000031edu.yale
2601666147812130.000031edu.ucla
261166577076060.000044uk.co.guardian
262166553246850.000043com.googleblog
263166540767340.000040com.nationalgeographic
26416651951920.000369com.qq
2651664966611660.000032edu.psu
266166493943990.000060uk.co.blogspot
267166487307660.000038com.foxnews
268166483216440.000043org.virtualbox
269166478505230.000048org.maven
27016647058770.000418com.people
271166467312160.000113uk.co.amazon
272166459642580.000089com.hp
273166427385500.000047com.cisco
274166400847770.000038com.economist
275166392833210.000073gov.cdc
276166347105900.000044com.bandsintown
2771663436813260.000027com.indiegogo
2781663075411870.000031com.gizmodo
279166300792180.000112com.windowsphone
280166286345840.000045org.hbr
281166281969190.000035com.authorstream
282166276724390.000055edu.cmu
283166243968510.000036com.timeanddate
2841662146811860.000031com.evernote
285166204765780.000045com.dropboxusercontent
2861661939411600.000033com.sciencedaily
287166167936870.000042com.wikia
288166152862240.000108com.bandcamp
289166132933950.000061org.whatbrowser
290166121432560.000090io.atom
2911661200912590.000029in.blogspot
292166101747140.000040com.dpreview
293166100982800.000083com.smugmug
294166094561710.000142com.weibo
295166057545280.000048com.theknot
296166041157510.000039com.merchantcircle
297165996148710.000035us.imageshack
298165985678820.000035com.slate
299165984911970.000126com.blogblog
300165965757150.000040org.imagemagick
3011659419711970.000031org.arxiv
302165916804760.000051com.squareup
303165916613690.000065com.skype
3041658834214280.000023edu.ucsd
3051658652112970.000028com.ning
306165829595750.000046com.tinypic
307165827044930.000050com.giphy
308165824236960.000042com.box
309165820583110.000076com.nypost
3101657662614540.000023com.posterous
311165761586880.000042com.bookdepository
312165760738850.000035com.brandyourself
3131657517512490.000029edu.upenn
3141657330911550.000033org.eff
315165726214780.000051org.postgresql
316165718146770.000043de.blogspot
317165682134070.000059com.angieslist
318165649537870.000038com.samsung
319165633398430.000036com.comixology
3201656166314080.000024edu.wisc
3211656098411610.000032gov.census
322165599417470.000039com.shutterstock
3231655946313230.000027uk.ac.cam
3241655892711710.000032gov.nist
325165588585430.000047com.geocities
326165588411680.000144com.xing
327165584554220.000057com.oreilly
3281655802714590.000023edu.purdue
329165566587160.000040com.nature
3301655618013970.000024com.hotmail
331165550288020.000037com.uk
332165543519960.000034com.livestream
333165532029200.000035com.arstechnica
334165520063370.000071com.prnewswire
335165488582840.000081ca.google
336165467277050.000041org.vim
337165458662200.000111com.getclicky
338165435484150.000058int.who
3391654142514360.000023edu.princeton
340165382685690.000046com.entrepreneur
341165382223820.000063com.sxsw
3421653811014990.000022com.angelfire
3431653792312290.000030edu.umich
344165378894260.000057com.springer
345165336007790.000038com.bravesites
3461653309610380.000033org.unesco
3471653159413510.000026uk.ac.ox
348165314564840.000051com.office
3491652905512600.000029org.iso
3501652876613300.000027com.pcworld
351165277788600.000036com.unsplash
352165273757550.000039com.blackberry
353165266802100.000117de.amazon
354165257907810.000038gov.state
355165235814490.000054com.fortune
356165222177030.000041org.aclweb
357165221787860.000038net.vnexpress
358165220493540.000068com.booking
359165217278790.000035com.dynamics
3601652110610200.000034com.weather
3611652024510030.000034com.communitywalk
362165196727630.000039com.vagrantup
363165161371590.000152com.constantcontact
364165148877080.000041jobs.amazon
3651651472110390.000033com.indiatimes
366165126257750.000038com.cbslocal
3671651227612000.000031com.lifehacker
3681651197214320.000023com.vox
369165097932700.000085it.placehold
370165087115650.000046com.newsweek
3711650820316090.000020net.comcast
372165055012090.000118org.joomla
373165054424480.000054com.force
3741650514812990.000028com.politico
3751650270113100.000028org.altervista
376165006715880.000044com.venturebeat
377164987652780.000083gov.ftc
378164971987560.000039com.java
3791649700012640.000029co.vine
3801649336410680.000033com.ubuntu
3811649330314630.000023com.thinkwithgoogle
382164884084460.000054com.businesswire
383164883482530.000091to.amzn
3841648817513430.000026fm.last
385164870618680.000035hu.elte
3861648601212030.000031com.gofundme
3871648569811680.000032ca.cbc
3881648494010710.000033gov.senate
3891648270715900.000020edu.uchicago
390164825506790.000043com.googlesource
391164812017130.000040org.sqlite
3921647357313350.000026com.airbnb
393164710456800.000043gov.noaa
394164704567190.000040com.manta
395164702971420.000180org.bbb
3961646673312370.000029com.searchengineland
3971646569021030.000014com.twitpic
3981646526514060.000024edu.umn
399164648208840.000035com.googlelabs
4001646429411690.000032com.engadget
4011646408913990.000024uk.co.theregister
402164632665190.000049com.inc
40316463082790.000414com.bleacherreport
404164613532640.000086es.google
4051646116813240.000027com.dell
4061645980416500.000019com.blogs
4071645934412360.000030com.stackexchange
4081645881016760.000019edu.usc
4091645835714820.000022com.mtv
410164562995270.000048org.sonatype
4111645611517200.000018mp.j
4121645608613160.000027com.variety
413164555537400.000039org.gnupg
4141645444125100.000011edu.unl
4151645336113320.000027org.ieee
4161645222215540.000021edu.northwestern
4171645119711840.000031com.americanexpress
418164501224560.000053com.snapchat
419164500652190.000111fr.google
4201644830513070.000028com.discovery
4211644792612570.000029com.businessweek
4221644771112190.000030com.netflix
4231644587015990.000020edu.jhu
424164458597690.000038com.jsbin
425164453701280.000209com.googleadservices
426164451007350.000040com.intel
427164448235660.000046com.delicious
4281644454111520.000033com.pinimg
429164432824740.000052com.nwsource
4301644237311560.000033tv.ustream
431164396011650.000147it.google
432164395137230.000040br.com.uol
433164394085210.000048com.herokuapp
434164380623120.000075com.bitly
435164321141840.000134com.eepurl
4361643132516200.000019com.examiner
437164312443580.000067com.bizjournals
438164303388550.000036com.souq
4391642956011740.000032au.net.abc
4401642920911920.000031fr.blogspot
4411642887418060.000016edu.rutgers
442164286508580.000036ca.pinterest
4431642838616300.000019com.udemy
4441642632416800.000018uk.co.thesun
4451642597514290.000023com.prezi
4461642287116930.000018com.speakerdeck
4471642146712900.000028com.mlb
448164214657820.000038com.mysanantonio
4491642119412110.000031com.chicagotribune
450164206057200.000040com.shopbop
4511641847615750.000020it.blogspot
452164183482900.000080com.hubspot
4531641616318990.000015edu.msu
454164159952820.000082com.fc2
455164157116970.000042com.moz
456164142487840.000038com.boxofficemojo
457164137827270.000040io.getmdl
45816410305760.000421me.m
4591640906112740.000028gov.fbi
4601640643219660.000015ch.ethz
461164062442620.000088com.dribbble
462164044201940.000126jp.co.yahoo
4631640279414910.000022com.trello
4641640156110260.000034com.slack
4651640127513250.000027net.researchgate
466164006893320.000072edu.nyu
467163987701370.000185com.google-analytics
468163982915550.000047com.wunderground
469163981054290.000057com.naver
4701639796018240.000016com.tutsplus
4711639658721710.000013com.googlepages
4721639612815940.000020edu.academia
473163951044130.000059com.bigcartel
474163943488100.000037it.binged
4751639398313800.000025org.khanacademy
4761639328511940.000031com.reverbnation
4771639303515870.000020com.mac
4781639278114720.000022com.target
4791639248520850.000014edu.asu
480163917962750.000084com.wufoo
4811639121920360.000014edu.arizona
482163904597000.000041uk.co.independent
4831638960215190.000022com.pexels
4841638950914120.000024com.over-blog
485163882604660.000052com.adweek
486163873622600.000089com.myshopify
4871638734413950.000024com.bostonglobe
4881638714515720.000020com.zazzle
4891638713413610.000025com.libsyn
490163860104180.000058com.fastcompany
491163854975800.000045gov.ed
492163851611190.000223com.baidu
493163851076120.000044cn.com.sina
494163844555910.000044gov.fda
495163844157280.000040es.com.blogspot
4961638409910400.000033gov.nps
4971638358616460.000019com.vanityfair
498163829788870.000035ws.snack
4991638229710460.000033com.marketwatch
5001638180018880.000016com.yolasite
5011638132415580.000021com.nba
502163799621090.000261org.networkadvertising
5031637908211530.000033gov.house
504163768938980.000035com.sfgate
5051637350522890.000012edu.caltech
506163727834650.000053com.w3schools
507163725661230.000221jp.co.google
5081637204120350.000014com.instructables
5091636946818210.000016com.msnbc
5101636853914700.000022com.scientificamerican
5111636850315430.000021com.ehow
5121636662519840.000015uk.ac.ucl
513163660406900.000042org.bitbucket
5141636572521620.000013ca.ualberta
515163648064640.000053net.openid
516163646177680.000038org.gradle
5171636420317720.000017org.aclu
5181636354014740.000022com.elpais
519163634847240.000040com.yarnpkg
5201636307727420.000010com.hubpages
521163626756330.000043com.cargocollective
5221636196416400.000019com.mercurynews
5231635918711770.000032com.steampowered
5241635897418450.000016edu.ufl
5251635883112350.000030org.change
526163587195830.000045gov.usda
527163580838280.000036com.warriorplus
5281635730212440.000029com.thenextweb
5291635727513860.000024de.spiegel
5301635706411580.000033com.proofpoint
531163569658090.000037com.whitepages
5321635335211880.000031gov.fcc
5331635281517950.000017com.nfl
5341635206813450.000026com.globo
5351635196730650.000009com.answers
536163515797060.000041org.jenkins-ci
5371635105715570.000021com.billboard
538163504577760.000038ly.snip
5391634937611720.000032com.ggpht
5401634932417800.000017org.ap
5411634917219010.000015edu.indiana
5421634906614450.000023com.nokia
5431634888819430.000015com.ign
5441634818519260.000015com.ikea
5451634804717060.000018edu.umd
54616348024520.000596com.messenger
547163478147570.000039com.msdn
5481634586018010.000017org.weforum
549163457875260.000048org.doi
550163452332020.000122jp.ameblo
551163445778910.000035com.woot
5521634444712830.000028com.patreon
5531634395715380.000021br.com.blogspot
554163434141320.000199ru.mail
5551634327621040.000014com.oxforddictionaries
556163425907440.000039com.photoshelter
5571634191913220.000027gov.uspto
5581634130916520.000019fr.lemonde
5591634093915910.000020com.rollingstone
5601634063017630.000017uk.co.metro
561163400436020.000044com.sciencedirect
5621633973037790.000007mx.unam
563163394369440.000035com.hotfrog
5641633859221270.000014com.fiverr
565163377951730.000141jp.ne.hatena
5661633744318400.000016com.aliexpress
5671633637330720.000009com.123rf
568163360763860.000063au.com.google
5691633480112380.000029com.prweb
5701633415218350.000016br.com.abril
5711633287514870.000022com.pcmag
572163321928730.000035ly.plot
5731633170932500.000008com.blog
574163315493910.000061us.icio
575163311998370.000036com.folkd
5761633117623160.000012org.kiva
5771633075223960.000012edu.brown
5781633062414780.000022com.qz
5791633049011800.000032com.psychologytoday
5801632968920880.000014com.newscientist
5811632911415770.000020com.playstation
5821632642514010.000024edu.si
583163249798460.000036io.material
5841632400010720.000033gov.usa
5851632289616080.000020com.hulu
5861632167213410.000026com.cafepress
5871632119519860.000015ca.utoronto
5881632100315970.000020com.econsultancy
589163209398150.000037gov.copyright
590163200384400.000055gov.irs
5911631869030080.000009cc.co
5921631868118340.000016com.canva
5931631743227920.000010pt.sapo
5941631529717050.000018com.colourlovers
595163148819940.000034com.hotukdeals
596163142718300.000036com.getskeleton
5971631251514950.000022com.nymag
598163121874060.000059com.barnesandnoble
5991631155212580.000029org.worldbank
6001631024520060.000014com.bestbuy
6011631010817830.000017com.nhl
6021630887920130.000014edu.uci
6031630883113980.000024com.boston
604163088148780.000035com.insiderpages
6051630753928560.000010edu.tufts
606163072173650.000066nl.google
607163064238260.000037gov.hhs
6081630606622290.000013edu.osu
6091630602417160.000018edu.duke
6101630499612260.000030com.hootsuite
611163047032470.000093jp.co.amazon
6121630221715340.000021gov.nyc
6131630158718550.000016com.fifa
6141630125916420.000019com.withgoogle
615163007274240.000057com.clicky
616162982295240.000048com.whatsapp
617162978627040.000041com.redbubble
6181629767629340.000009com.friendfeed
6191629763719540.000015com.gawker
6201629698113330.000027org.oecd
6211629656820820.000014nl.xs4all
6221629638318570.000016com.pastebin
623162954279380.000035com.tiki-toki
6241629483628090.000010edu.uic
6251629475012810.000028com.istockphoto
6261629435616050.000020com.hyatt
6271629432220590.000014edu.tamu
6281629301721640.000013edu.ncsu
6291629238913850.000024com.com
630162921039460.000034jp.ac.kobe-u
631162917419060.000035com.quantcast
6321629163519310.000015nl.blogspot
633162916188030.000037com.webmd
6341629135328230.000010com.wolfram
635162913367290.000040ca.amazon
636162909414550.000053net.launchpad
6371629009521700.000013com.wikispaces
6381628960013040.000028com.walmart
6391628897830810.000009edu.colostate
640162880945200.000048in.co.google
6411628641712070.000031com.redhat
6421628640915740.000020com.merriam-webster
6431628614217300.000018int.wipo
6441628474411960.000031com.adage
6451628415112240.000030com.ups
646162839888440.000036com.newsbank
6471628394130780.000009com.squidoo
6481628379113370.000026gov.dot
6491628370516770.000018com.me
6501628327114440.000023com.mediafire
6511628319021220.000014ca.ubc
6521628226126320.000011ca.uwaterloo
6531628188816000.000020edu.unc
6541628140520020.000015org.kde
6551628099921090.000014org.gimp
656162806114770.000051com.pingdom
6571627926225170.000011gd.is
6581627923327130.000010edu.hawaii
6591627820420760.000014com.aljazeera
6601627788015250.000021com.xbox
6611627659316110.000020com.freewebs
6621627606422530.000013com.britannica
6631627515916860.000018uk.co.mirror
6641627496222610.000013uk.co.timesonline
6651627410220650.000014au.com.news
6661627378815320.000021com.xkcd
6671627355411980.000031com.feedly
6681627304529310.000009com.laughingsquid
6691627229115070.000022gov.wa
6701627228618420.000016tv.periscope
6711627212214600.000023com.mixcloud
6721627025729190.000010com.codecademy
6731626987120030.000015edu.illinois
6741626939916790.000018uk.co.huffingtonpost
675162691073870.000062net.themeforest
6761626903116540.000019uk.co.ebay
677162688943790.000063com.ea
678162685369980.000034com.att
6791626842316660.000019net.daum
6801626796326130.000011ca.mcgill
681162650416590.000043com.houzz
6821626464815140.000022com.intuit
683162643355730.000046fr.amazon
6841626238420870.000014com.softpedia
6851626196818720.000016com.autodesk
686162618992070.000119org.icann
6871626185618120.000016com.deadline
6881626130627080.000010edu.vanderbilt
6891626120816430.000019com.foxbusiness
6901626054015980.000020gov.uscourts
691162590383800.000063com.heroku
6921625842914710.000022com.gumroad
6931625766722150.000013com.flipboard
6941625642515780.000020com.us
6951625621216340.000019de.welt
6961625576111930.000031com.deloitte
6971625473622760.000012com.yfrog
6981625468715960.000020org.owasp
6991625442427290.000010com.lynda
7001625413920460.000014org.coursera
701162534239420.000035com.cdbaby
7021625225913030.000028com.sagepub
7031625224315850.000020com.vmware
7041625222520450.000014net.earthlink
705162514177110.000041com.usnews
7061625131513830.000025org.unicef
7071625116537140.000007com.space
7081625074521210.000014com.vogue
709162497134230.000057com.cracked
7101624943618440.000016com.domain
711162492625050.000050net.yahoo
712162489212480.000092com.nielsen
7131624781810660.000033site.tenerifeforum
7141624774421290.000014com.theonion
715162474577320.000040com.atlassian
716162468817260.000040com.sharefile
717162453048210.000037org.osgeo
7181624475323290.000012com.searchenginejournal
7191624459115410.000021com.searchenginewatch
7201624386216690.000019com.windows
7211624380925030.000011org.greenpeace
7221624289410610.000033org.bravenewvoices
7231624269523140.000012edu.wustl
7241624247820150.000014uk.ac.lse
725162421489580.000034com.2findlocal
7261624195819290.000015edu.ucdavis
7271623899428080.000010edu.uoregon
728162386227720.000038org.openweathermap
7291623844514790.000022com.kissmetrics
7301623776920950.000014net.jsfiddle
7311623740516290.000019com.chron
7321623722519000.000015gov.usaid
7331623701114040.000024com.steamcommunity
734162361949410.000035com.ripple
7351623439817730.000017org.craigslist
7361623438017680.000017com.howstuffworks
737162338227880.000038com.hilton
7381623373513820.000025com.alibaba
7391623336125820.000011edu.uga
7401623297327250.000010edu.pitt
7411623284916630.000019com.yoast
7421623222628470.000010com.rottentomatoes
743162322032390.000097org.purl
7441623001314160.000024org.plos
7451622950618490.000016com.espn
7461622865720780.000014com.gamespot
7471622665741490.000007ca.yorku
7481622614524370.000012gov.cia
749162249683850.000063com.youku
7501622476916830.000018com.csmonitor
751162245118900.000035tv.twitch
7521622430035020.000008com.secondlife
7531622384314810.000022com.hollywoodreporter
7541622267016680.000019net.battle
7551622135718700.000016com.irishtimes
756162210829270.000035com.bizcommunity
7571622082023280.000012edu.vt
7581622067923750.000012com.technet
759162206698060.000037uk.co.currys
7601621920631570.000009com.avast
7611621746314850.000022org.fao
7621621725720550.000014com.twilio
763162157647170.000040com.netdna-cdn
7641621573028440.000010com.popsci
7651621529922050.000013com.podbean
7661621480512560.000029org.redcross
7671621394528130.000010org.kqed
7681621393714530.000023us.tx.state
769162138864200.000058br.com.google
7701621253017850.000017mil.navy
7711621178520100.000014com.netvibes
7721621171532550.000008edu.iastate
7731620934116310.000019com.animoto
7741620926829160.000010int.esa
7751620926122140.000013com.makezine
7761620803720240.000014edu.ucsf
7771620802935760.000008uk.ac.manchester
7781620662319040.000015com.foxsports
7791620602417920.000017com.blogtalkradio
7801620514513360.000026com.docker
7811620497516320.000019mil.army
7821620461823350.000012com.lonelyplanet
7831620434514400.000023jp.blogspot
7841620370635170.000008edu.wsu
7851620346217540.000017co.angel
786162029317020.000041com.technorati
7871620271316210.000019com.today
788162024732280.000104com.elegantthemes
7891620139515530.000021com.fedex
7901620133618270.000016com.macworld
7911620085516890.000018ru.spb
7921620044034910.000008org.eu
7931619979839400.000007edu.byu
7941619960419800.000015com.topsy
7951619951813080.000028gov.energy
7961619942519320.000015edu.umass
7971619801017820.000017org.cancer
798161976848270.000037com.themonitor
7991619720415390.000021gov.congress
800161971734530.000054com.zenfolio
801161951313260.000073com.newrelic
802161936379810.000034com.scribblemaps
8031619344513270.000027com.webnode
8041619335113090.000028com.zoho
8051619291614030.000024com.techrepublic
806161926004690.000052jp.ne.sakura
8071619194114750.000022com.html5rocks
808161915127520.000039gov.sec
809161910112460.000093me.line
810161903965600.000046gov.export
8111619032124210.000012com.redbull
8121619024512890.000028de.bund
8131619014812730.000028com.formstack
8141618940117270.000018org.pewresearch
8151618705525040.000011org.documentcloud
8161618615322060.000013com.denverpost
8171618510517910.000017com.freepik
8181618450111640.000032gov.justice
81916184479830.000405com.shareaholic
820161842118570.000036org.bouncycastle
821161841341810.000137info.aboutads
8221618307310060.000034com.weddingbee
82316182519220.001469com.wixstatic
8241618115318220.000016com.sky
8251618089835870.000008edu.syr
826161807044420.000055com.teamviewer
8271618050217410.000017edu.cuny
8281617943714920.000022de.heise
8291617939822900.000012com.refinery29
8301617910513730.000025com.gigaom
8311617905176530.000004nr.co
8321617894330660.000009com.seekingalpha
833161787955090.000049com.informit
8341617839421690.000013com.pbworks
8351617677928570.000010com.threadless
8361617652310090.000034com.spoke
8371617624416900.000018com.salon
838161753148640.000036com.tractorsupply
839161750364360.000055ru.vkontakte
8401617341970920.000004com.xanga
841161730618340.000036com.withoutabox
8421617216734310.000008edu.rochester
8431617200419280.000015google.blog
8441617199630390.000009cc.tiny
8451617186223380.000012com.sony
846161717454950.000050com.mapbox
8471617157922160.000013edu.uiuc
8481617143513690.000025com.justgiving
849161710189730.000034com.quandl
8501617090730440.000009edu.oregonstate
8511617077930880.000009edu.rice
852161702579890.000034com.citysquares
8531616930315220.000021com.accenture
8541616904517170.000018gov.weather
8551616826425780.000011ch.cern
8561616778723650.000012com.nbcsports
8571616766534580.000008tt.db
8581616745713110.000027gov.ny
8591616690437640.000007com.panoramio
860161664713980.000061com.list-manage1
8611616571634990.000008edu.fsu
8621616560216720.000019com.indeed
8631616482416700.000019org.gnome
8641616426723060.000012com.motherjones
8651616410932860.000008com.techsmith
8661616402320210.000014de.bild
867161637869870.000034com.zwire
868161637689360.000035org.gwtproject
8691616344115930.000020uk.co.thetimes
8701616174511830.000031com.hostgator
8711616168322470.000013com.shutterfly
8721616122475570.000004com.weheartit
8731616107810370.000033com.lacartes
8741615988521200.000014me.flavors
8751615949418690.000016com.digitaltrends
8761615887725180.000011com.lego
8771615886746850.000006com.skyrock
8781615824814550.000023com.ssrn
879161565917740.000038ru.google
8801615638016610.000019ru.narod
8811615531927110.000010au.edu.anu
8821615513529070.000010net.nocookie
8831615439516820.000018com.infoworld
8841615373617770.000017com.starbucks
8851615281710180.000034com.live5news
8861615193239840.000007to.gplus
8871615147040440.000007org.nypl
8881615145421060.000014com.trendmicro
8891615093516160.000019com.codeplex
8901615078616490.000019com.gettyimages
891161501625160.000049com.typeform
8921614938018140.000016com.amzn
8931614921217330.000018com.upwork
8941614895923740.000012com.hatenablog
8951614853712750.000028uk.co.eventbrite
8961614831129520.000009ly.cl
897161482679790.000034au.com.yelp
8981614766912210.000030com.linksynergy
8991614762324500.000012tv.blip
900161475939660.000034com.strawberryperl
9011614669225160.000011com.ezinearticles
9021614625057460.000005com.minus
9031614612312800.000028gov.archives
904161460029910.000034net.brownbook
9051614589520410.000014org.c-span
9061614583343990.000006com.treehugger
9071614569614960.000022se.google
9081614559014130.000024com.smashingmagazine
9091614520531200.000009com.askmen
9101614459923590.000012com.rt
9111614436213400.000026gov.sba
9121614336821910.000013com.madmimi
9131614328932010.000009com.voanews
9141614219310310.000034edu.alamo
9151614127812930.000028be.google
91616141249980.000323org.nginx
9171614025527900.000010com.asus
9181613977816990.000018com.techradar
9191613970220090.000014com.allthingsd
9201613907421500.000013com.mentalfloss
9211613895540090.000007net.minecraft
9221613770244170.000006com.pbase
9231613622316590.000019com.bloglovin
9241613601415230.000021com.forrester
925161359249290.000035com.sacurrent
9261613555611820.000032com.strikingly
9271613537717810.000017org.openoffice
9281613481710540.000033com.garmin
9291613475411570.000033org.postimg
9301613456524750.000011com.eonline
9311613418015950.000020com.lulu
9321613193618090.000016com.ibtimes
933161317789240.000035com.fabric
9341613171316550.000019com.zillow
935161316239900.000034com.shareasale
9361613149121610.000013com.history
9371613133215420.000021com.mcafee
9381613103154420.000005com.archdaily
939161307913240.000073com.cloudinary
9401613060437000.000007com.thingiverse
9411613041636330.000008com.starwars
9421613003931490.000009com.pitchfork
9431613000735280.000008com.gyazo
9441612970818610.000016ca.huffingtonpost
945161290393550.000068com.monster
9461612894740340.000007com.tistory
9471612878340790.000007edu.utk
9481612854938580.000007com.lmgtfy
9491612849610640.000033mp.mailchi
9501612786017240.000018com.ssllabs
9511612748112470.000029org.moodle
9521612630610170.000034org.simile-widgets
9531612614222310.000013com.invisionapp
9541612601521050.000014com.real
9551612528936400.000007edu.buffalo
9561612497333420.000008com.indiewire
957161249592830.000082org.debian
9581612481120300.000014com.ew
9591612481115310.000021com.uber
9601612474750510.000006edu.gsu
961161244578360.000036com.list-manage2
9621612438013640.000025net.java
9631612393311670.000032com.tandfonline
964161239114860.000051com.taobao
9651612360316600.000019com.bmj
9661612324034200.000008org.lifehack
9671612280823020.000012com.canalblog
9681612259721410.000013edu.ucsc
969161223689800.000034org.tpr
9701612235827810.000010nl.utwente
9711612160819410.000015com.getresponse
9721612157726310.000011com.dallasnews
9731612099822370.000013edu.colorado
9741611856016380.000019com.ecwid
9751611847612870.000028es.amazon
9761611847110220.000034com.ibegin
9771611813516370.000019com.deezer
9781611798913940.000024jp.ne.goo
9791611777219710.000015jp.ne.biglobe
9801611775621300.000014edu.bu
981161177092140.000114com.homestead
982161174779310.000035com.chamberofcommerce
9831611613058920.000005ie.tcd
9841611577240850.000007edu.uconn
9851611473135900.000008edu.usf
9861611470215260.000021com.warnerbros
9871611434847770.000006ca.ucalgary
9881611385720140.000014hk.com.google
989161137861780.000139com.parallels
9901611346718410.000016com.getfirebug
9911611321915300.000021com.waze
9921611314133720.000008ru.org
9931611294931830.000009com.polyvore
9941611262424730.000011com.campaignmonitor
9951611255516840.000018com.thehill
996161124079850.000034com.showmelocal
9971611235313210.000027gov.usgs
9981611193719080.000015jp.or.nhk
9991611175758510.000005com.rapidshare
10001611164730400.000009com.expedia

Graphs of January 2018 Crawl

Erroneously we released webgraphs and rankings of a single monthly crawl (January 2018) instead of a quarterly release covering 3 crawls. To ensure reproducibility we’ve preserved the erronuous release.

The host-level graph consists of 775 million nodes and 2.7 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 719 million dangling nodes (93%).

Download files of the Common Crawl Jan 2018 host-level webgraph

SizeFileDescription
4.84 GBcc-main-2018-jan-host-vertices.txt.gznodes ⟨id, rev host⟩
10.21 GBcc-main-2018-jan-host-edges.txt.gzedges ⟨from_id, to_id⟩
4.90 GBcc-main-2018-jan-host.graphgraph in BVGraph format
2 kBcc-main-2018-jan-host.properties
5.94 GBcc-main-2018-jan-host-t.graphtranspose of the graph (outlinks mapped to inlinks)
2 kBcc-main-2018-jan-host-t.properties
1 kBcc-main-2018-jan-host.statsWebGraph statistics
10.79 GBcc-main-2018-jan-host-ranks.txt.gzharmonic centrality and pagerank

The domain-level graph with 70 million nodes and 835 million edges has 60% or 42 million nodes are dangling nodes, the largest strongly connected component covers 22 million or 31% of the nodes.

Download files of the Common Crawl Jan 2018 domain-level webgraph

SizeFileDescription
0.49 GBcc-main-2018-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
3.30 GBcc-main-2018-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
1.80 GBcc-main-2018-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2018-jan-domain.properties
1.89 GBcc-main-2018-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2018-jan-domain-t.properties
1 kBcc-main-2018-jan-domain.statsWebGraph statistics
1.46 GBcc-main-2018-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!