Common Crawl’s First In-House Web Graph

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

The following results from the development of this graph:

  • a ranked list of hosts to expand the crawl frontier;
  • pages ranked by Harmonic Centrality with less influence from spam, among other attributes (for comparison we include PageRank);
  • the template/process for Common Crawl to produce graphs and page rankings at regular intervals.

We produced this graph, and intend to produce similar graphs going forward, because the Common Crawl community has expressed a strong interest in using Common Crawl data for graph processing, particularly with respect to:

*Please note: the graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. Seventeen percent (65 million) of the hosts represented have been crawled in one of the three monthly crawls. Thus, 320 million of the hosts represented in the graph are known only from links. (Host names are not wholly verified: host names that are obviously invalid are skipped; others are not resolved in DNS.)

 

Extraction of links and construction of the graph

Links are taken from WAT extracts but we also included redirects from WARC files of the redirect and 404 dataset. All types of links are included, including pure “technical” ones pointing to JavaScript libraries, web fonts, etc.

The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain. Node IDs are assigned sequentially to the the node list sorted by reversed host name. This keeps links between hosts of the same domain or in the same country-code top-level domain close together and allows for an efficient delta-compression of edges.

The extraction is done in three steps:

  • links are extracted, reduced to host-level links and stored as pairs 〈reversed host from, rev. host to〉
  • host names are assigned to IDs and edges are represented as 〈from id, to id〉 pairs
  • ranks are computed.

The first two steps are done by Spark and Python; the code is part of the project cc-pyspark. To compute the rankings the webgraph is loaded into the WebGraph framework.

 

Hosts ranked by Harmonic Centrality and PageRank

We provide a list of ranked nodes (host names) by

You can download the ranks of all 385 millions hosts. Below are the top 1000 hosts ranked by Harmonic Centrality.

Top 1000 hosts ranked by harmonic centrality

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
1380394081.00775205com.facebook
2349580843.00428814com.twitter
3335404402.00540973com.googleapis.fonts
4328874864.00286281com.youtube
5311288867.00167971com.google.plus
6303758126.00183995com.google
72853544212.00101167com.linkedin
82830483210.00118472com.instagram
9280868348.00158847com.blogger
102758349422.00055368com.pinterest
112721818840.00029768org.wikipedia.en
122707850415.00080352org.wordpress
132695958629.00043823com.apple.itunes
142666668226.00053618com.blogspot.bp.2
152664547225.00053637com.blogspot.bp.4
162662702668.00020569be.youtu
172659304624.00053727com.blogspot.bp.3
182657299620.00057575com.blogspot.bp.1
192652271058.00022707com.amazon
202649159633.00035486com.google.play
212646905619.00059326com.google.maps
222646742450.00024854com.vimeo
232644344845.00027360com.flickr
24264240445.00219270org.gmpg
252632463637.00031984com.google.mail
262631806043.00028895gl.goo
272628683644.00028723com.github
282628018865.00021446com.microsoft
292612469079.00018142com.google.support
302610349232.00043580com.adobe
3126085138100.00016162ly.bit
322604600298.00016696com.google.docs
3326023404108.00013636org.w3
3426014732152.00010471com.facebook.developers
352594901230.00043749me.wp
3625905092222.00007208com.nytimes
372584512472.00019664com.google.sites
3825833394186.00008012com.facebook.m
3925790176138.00011458com.weibo
4025732096147.00010760com.apple
412571563439.00029787com.paypal
422569229038.00030937co.t
4325689954407.00004073com.huffingtonpost
4425679134115.00013061org.creativecommons
4525657428188.00007975com.facebook.apps
462560992611.00112239com.blogblog.resources
4725583454414.00004019com.forbes
4825546588131.00012114com.imgur
4925535844416.00004002net.slideshare
5025508504669.00002026com.mashable
5125503482454.00003445com.tinyurl
522549839652.00024415com.etsy
5325478394659.00002054com.businessinsider
5425475376201.00007711com.google.drive
552547533427.00050381com.wordpress
5625458606481.00003153com.washingtonpost
5725411174189.00007959com.myspace
5825387732146.00010823com.medium
5925383148417.00003969uk.co.bbc
6025381774415.00004003com.imdb
6125381672603.00002224com.wired
6225361154572.00002431com.techcrunch
6325349038428.00003804com.bing
6425347080183.00008131com.google.groups
6525339734807.00001619com.time
6625325246473.00003265com.theguardian
6725313804447.00003533uk.co.amazon
6825313562307.00006529com.reddit
6925309096162.00009843com.feedburner.feeds
7025281640306.00006533com.tumblr
7125275352190.00007951com.soundcloud
7225266250694.00001917uk.co.dailymail
7325255004570.00002446com.surveymonkey
7425252200181.00008142org.archive.web
752525054076.00018708com.eventbrite
7625236658171.00008740com.google.developers
772523528659.00022308com.vimeo.player
7825226754613.00002197org.mozilla.addons
7925220086777.00001695com.wsj.online
8025203382316.00006277com.yahoo
81251928201356.00001042com.wsj.blogs
8225177614336.00005412com.issuu
8325174772828.00001586com.latimes
842517348266.00021367com.vk
8525164782628.00002133com.cnn
8625163448396.00004196com.dropbox
8725154430588.00002310me.about
8825149472698.00001900org.npr
8925121032504.00002963com.meetup
9025105294112.00013267com.gravatar
91251034881094.00001268com.theatlantic
9225100710652.00002072com.gmail
9325099516528.00002776org.wikimedia.upload
9425093888119.00012789com.imgur.i
9525090464215.00007402me.fb
96250899081156.00001216com.venturebeat
9725084918391.00004336com.google.picasaweb
9825083712526.00002804com.dropboxusercontent.dl
99250817581083.00001277com.ted
10025068848740.00001773com.reuters
101250673921439.00000979com.economist
10225060080971.00001483com.adweek
1032504074814.00084868com.macromedia.download
104250361481619.00000895com.thenextweb
10525029640681.00001990com.goodreads
106250259541168.00001199com.cnn.money
10725024186476.00003199com.google.translate
10825022650530.00002745com.dailymotion
10925022502170.00008762com.facebook.fr-fr
11025012510713.00001855gov.whitehouse
11124989682501.00002977gov.nih.nlm.ncbi
1122498434480.00018111com.twitter.mobile
113249653741457.00000967ca.cbc
11424959346718.00001832org.archive
11524943750216.00007377com.facebook.es-la
116249367721053.00001316org.pbs
11724936464219.00007270com.microsoft.windows
11824931290804.00001628com.scribd
119249308461067.00001298com.cnbc
120249307222132.00000684com.youtube.m
121249223601299.00001085com.buzzfeed
12224915482426.00003817org.wikimedia.commons
123249143681522.00000924com.newyorker
12424912590600.00002235com.microsoft.msdn
12524906018671.00002023com.geocities
126249006242899.00000490net.boingboing
127248888381318.00001070com.gizmodo
12824878576355.00005095com.netvibes
12924869944826.00001589com.prnewswire
130248648301824.00000811com.examiner
131248540061340.00001059com.engadget
1322485297854.00024174net.sourceforge
133248479881951.00000758com.storify
13424846402598.00002245com.stackoverflow
13524845912955.00001522uk.co.guardian
136248439041164.00001206uk.co.independent
13724836756168.00009144com.google.code
138248356421080.00001280com.nature
13924830762782.00001688com.bizjournals
1402482868242.00029373com.twimg.pbs
14124825094141.00011273com.google.feedburner
14224823614185.00008036com.macromedia
14324821860151.00010505org.mozilla
14424821036212.00007504com.facebook.web
145248200421785.00000829com.pcworld
14624815138589.00002307com.ebay
14724808886196.00007819com.qq.t
14824802454159.00010021com.amazonaws.s3
149247873041360.00001039com.foxnews
150247831681482.00000940com.slate
15124781108432.00003772net.behance
15224779900648.00002090com.ibm
153247748722022.00000727com.arstechnica
154247716082715.00000522com.indiatimes.timesofindia
155247693802004.00000732au.net.abc
156247687021660.00000885com.marketwatch
15724768442424.00003843de.amazon
15824768200472.00003274com.google.feedproxy
15924756046221.00007213com.facebook.en-gb
1602475310297.00017011com.facebook.l
16124753098373.00004743com.digg
162247514801407.00001004com.ft
16324747080356.00005021com.microsoft.support
164247469201621.00000895com.quora
165247455001929.00000769com.gigaom
166247418681728.00000861com.sfgate
167247401501121.00001241tv.ustream
168247273861454.00000968com.chicagotribune
169247260661432.00000981com.wikihow
17024719856174.00008585com.messenger
17124713146136.00011598com.istockphoto
17224711160344.00005293com.stumbleupon
173247066921055.00001315uk.co.bbc.news
174247056982007.00000732com.boston
175247022882028.00000723com.searchengineland
176246998041193.00001154com.cafepress
17724699314869.00001553fm.last
1782469125817.00069783com.google.accounts
17924686178609.00002203com.usatoday
180246823021538.00000919com.indiegogo
18124659484997.00001423com.google.books
18224658388963.00001506com.yahoo.finance
18324655636637.00002121com.yahoo.groups
18424652858452.00003483com.google.news
18524651408349.00005207com.googleusercontent.lh4
18624648530339.00005363jp.co.amazon
1872464673646.00026535com.wix
18824642202522.00002841to.amzn
189246386843434.00000408com.nationalgeographic
190246368461070.00001295org.mozilla.developer
191246292121838.00000805com.businessweek
192246261241186.00001167com.dropbox.dl
193246260581200.00001149com.fortune
194246249781812.00000816com.mtv
195246240802074.00000707com.go.espn
19624619310142.00011187com.facebook.de-de
197246172941492.00000936com.gofundme
19824615060517.00002905uk.gov
199246100441036.00001338com.cargocollective
200246093181059.00001310com.zazzle
20124609048462.00003391com.nbcnews
202246077041398.00001010ly.ow
203246011542048.00000714com.politico
204245974522010.00000731com.cnet.news
205245936142482.00000576au.com.smh
20624592606743.00001767com.kickstarter
20724589922105.00014729org.w3.validator
20824583110758.00001739ca.google
20924571000687.00001965com.delicious
210245685761376.00001028com.yahoo.news
211245648441444.00000975com.prweb
212245631081486.00000937com.technologyreview
213245622162925.00000486com.csmonitor
214245567061047.00001324com.go.abcnews
215245562603095.00000455com.merriam-webster
21624555618645.00002095com.spotify.open
217245554561153.00001218com.zdnet
218245536261052.00001319com.wiley.onlinelibrary
219245535522208.00000661com.yahoo.sports
220245483502209.00000661com.nymag
221245480341865.00000789net.researchgate
222245437841760.00000841com.cnn.edition
223245402161862.00000791com.angelfire
224245386845476.00000249com.thenation
2252453803487.00017752com.wp.i1
22624537674658.00002055uk.co.telegraph
22724536988366.00004837uk.co.google
228245346541151.00001221com.entrepreneur
2292453158034.00033395com.twitter.blog
23024531188372.00004753com.tripadvisor
231245287342999.00000473com.thedailybeast
232245273481099.00001264fr.amazon
233245259181286.00001099gov.nps
234245229101215.00001137tv.twitch
23524522716213.00007442com.facebook.pt-br
236245116202409.00000597uk.co.theregister
237245098821930.00000769com.prezi
238245095721333.00001065org.change
23924508852229.00007108com.google.chrome
24024507186368.00004829com.apple.support
2412450549867.00021293com.addthis
24224504130567.00002465com.google.video
24324498714345.00005293de.google
244244926704843.00000285au.com.theage
245244898282894.00000491com.salon
246244848261873.00000784org.arxiv
24724482800691.00001931org.wikipedia.fr
248244823101346.00001054com.microsoft.office
24924482184198.00007767jp.ameblo
250244765103078.00000459com.xkcd
251244747901694.00000872com.pcmag
252244745561459.00000965gov.nasa
253244739941969.00000747com.mixcloud
254244709586235.00000222com.reuters.blogs
255244693541649.00000886com.feedburner.feeds2
256244682461232.00001129com.cnet
25724463680204.00007689eu.europa.ec
258244552285903.00000232com.laughingsquid
259244543121324.00001067com.fastcompany
260244515445791.00000235com.forbes.blogs
261244509503224.00000442com.vox
262244494881088.00001270com.reverbnation
263244448141518.00000926ca.amazon
26424442382324.00005879com.weebly
265244421501747.00000847com.blogspot.googleblog
266244406343692.00000376com.google.images
267244398823248.00000437com.billboard
26824438356347.00005214com.googleusercontent.lh5
26924437560331.00005727com.yelp
270244341442445.00000585com.google.productforums
27124430074218.00007273com.facebook.business
27224429478489.00003099com.windowsphone
27324428674226.00007160me.m
274244282941443.00000975com.newsweek
27524426648206.00007652com.facebook.es-es
276244261884379.00000317com.theonion
277244256941499.00000935it.scoop
278244251103838.00000358com.pandora
27924423972546.00002567org.wikipedia.es
28024423302608.00002209com.bloomberg
2812442022257.00023139com.twitter.support
282244191781797.00000822com.adage
28324418438104.00014890com.adobe.get
284244174642056.00000711com.walmart
285244124422704.00000524com.rollingstone
28624409082223.00007206com.facebook.id-id
28724408340332.00005712com.deviantart
28824407960749.00001751jp.ne.hatena.d
289244078102058.00000711com.variety
290244048581635.00000891com.webmd
291244041342590.00000551com.thehill
292244035001784.00000829com.adobe.blogs
293244017702377.00000600com.usnews
294243965862450.00000583me.fb.on
29524396528764.00001718com.wsj
296243923602414.00000595com.bleacherreport
29724390596459.00003408com.technorati
298243888582251.00000646com.shutterstock
299243855302357.00000608com.qz
30024385054193.00007869com.facebook.it-it
301243843102420.00000594org.sciencemag
302243833585654.00000242com.esquire
303243831801198.00001150au.com.google
30424380792985.00001445com.foursquare
305243799162339.00000612edu.stanford
30624379554539.00002638jp.livedoor.blog
307243778481309.00001078com.theverge
3082437442213402.00000109com.hackaday
309243713321367.00001029co.vine
310243681501452.00000969com.msn.msnbc
311243679126479.00000213com.ted.blog
312243659963719.00000372gd.is
313243653184181.00000331com.vice
314243651062983.00000476com.nbc
31524363772679.00001994gov.cdc
31624363062191.00007939com.xing
317243622542891.00000492com.scientificamerican
318243617361069.00001297com.cbsnews
31924361436413.00004029us.icio.del
320243594706886.00000201com.scienceblogs
321243587443043.00000466com.microsoft.research
322243564203502.00000398com.bestbuy
32324356010988.00001441com.bbc
324243548865002.00000275com.gawker
325243528783988.00000346com.startribune
326243490142078.00000706fr.lemonde
327243480744191.00000330com.allrecipes
328243462745087.00000270com.space
329243447301878.00000783com.smashingmagazine
330243440185668.00000241com.treehugger
33124342878747.00001758es.google
332243408442655.00000535uk.co.huffingtonpost
333243359808358.00000164com.techdirt
334243340661068.00001297org.wikipedia
335243339344872.00000284com.nytimes.blogs.well
336243334361313.00001075br.com.google
33724332660542.00002617com.timeout
3382433253818.00063404com.googleusercontent.lh3
339243317382088.00000699com.redbubble
340243308043878.00000353com.miamiherald
341243299102057.00000711com.msdn.blogs
342243296883651.00000381com.refinery29
343243286282118.00000688com.ign
34424328504310.00006387com.livejournal
345243279564641.00000298com.panoramio
346243247182281.00000634edu.mit.web
347243244024910.00000281com.answers
34824323460980.00001451com.apple.developer
349243211422632.00000542com.apple.phobos
35024320832974.00001466com.example
35124318370155.00010226com.polyvore
352243148741003.00001410com.marriott
35324314410593.00002278com.dribbble
354243142923608.00000386org.greenpeace
35524314094667.00002032com.sxsw
356243109742928.00000485com.newscientist
35724310954205.00007659com.facebook.nl-nl
358243101765470.00000250com.dreamstime
359243096664235.00000326com.chronicle
36024308702480.00003159net.php
361243074645030.00000273org.moma
362243035367255.00000191org.grist
36324303532207.00007629com.facebook.pl-pl
364243034542426.00000592com.ehow
3652430249856.00023600com.wp.i2
3662430118828.00045486com.urbandictionary
367242998681190.00001160com.fb
36824297602246.00006938org.cwa-union
36924297130360.00004944com.disqus
37024290242441.00003574com.alexa
371242883441991.00000736com.lifehacker
372242873422687.00000528gov.fws
373242873142416.00000594uk.co.mirror
374242855944702.00000294com.rottentomatoes
37524283856722.00001816com.bitly
376242838562850.00000499gov.archives
377242808364042.00000343com.vogue
378242805804998.00000276com.patheos
379242769365141.00000268com.snopes
380242753383214.00000443com.zimbio
381242736449247.00000147com.infowars
382242710983246.00000438com.technet.blogs
383242679182140.00000683com.hubspot.blog
384242656363639.00000382com.marthastewart
38524265482235.00007000com.facebook.pt-pt
38624264716721.00001818com.salesforce
38724261210736.00001784com.nwsource.seattletimes
388242596345424.00000252com.gq
389242593345525.00000247uk.org.tate
39024255346614.00002189com.orkut
391242548383037.00000467com.gallup
392242534885260.00000261com.oregonlive
39324253108271.00006760com.facebook.zh-tw
394242516761023.00001376com.wunderground
395242514524356.00000318com.mlb.mlb
396242506824493.00000307com.motherjones
397242500281084.00001272com.inc
398242452121766.00000838com.target
399242450443265.00000435com.google.profiles
400242431066758.00000204com.sheknows
401242428964403.00000315li.paper
402242421001931.00000768com.lulu
403242408986136.00000225com.petapixel
40424240804154.00010339net.fbcdn.xx.scontent
405242404043693.00000376au.com.news
406242386466525.00000211com.ndtv
407242377061983.00000740gov.nih.nlm
4082423582610370.00000131com.cnn.blogs.politicalticker
40924234434352.00005179org.gnu
4102423339853.00024219us.peeep
411242324924365.00000317org.thinkprogress
412242324022681.00000529com.nba
41324230752559.00002517com.android.market
414242275129722.00000140com.kodak
415242244484797.00000289edu.brookings
416242237923583.00000389com.css-tricks
417242195204692.00000295com.latimes.latimesblogs
418242181627020.00000197com.dailykos
419242179222643.00000540com.popsugar
420242162024721.00000293com.rt
421242145641249.00001102in.co.google
42224210576263.00006836com.facebook.sv-se
423242055202501.00000571com.nfl
42424205488863.00001556org.doi.dx
425242035627641.00000181com.care2
426241990824847.00000285com.plurk
4272419900214486.00000101com.makezine.blog
42824194726519.00002888com.mozilla
42924192646642.00002103com.barnesandnoble
430241921844227.00000326org.raspberrypi
431241876025989.00000230com.jezebel
432241870161145.00001231org.python
433241844082410.00000596com.psychologytoday
434241832805080.00000271com.mediabistro
435241824283657.00000380com.instructables
436241818064487.00000308com.baltimoresun
437241796861354.00001044com.google.scholar
43824178630261.00006882net.akamaihd.fbcdn-sphotos-a-a
4392417861223.00054639com.bootstrapcdn.maxcdn
440241784721234.00001124com.linkedin.ca
441241779286438.00000214com.dezeen
442241768542474.00000578com.people
443241751481002.00001411com.mediafire
444241743522470.00000578com.indeed
445241719204428.00000312net.comcast.home
446241714282657.00000535com.readwriteweb
447241684783905.00000349com.macworld
448241676301481.00000940com.box
449241652563575.00000390es.elmundo
45024165178746.00001760com.microsoft.technet
451241637381844.00000801com.500px
452241622768600.00000159com.consumerist
453241611108423.00000163com.uproxx
4542416003613745.00000107com.dawn
455241574682049.00000714com.sciencedaily
456241571227100.00000195org.alternet
4572415672412559.00000117com.chicagonow
45824156126959.00001516com.photobucket
459241546886392.00000216com.designboom
460241537803864.00000355com.blurb
461241518304664.00000296com.weheartit
46224148122502.00002977com.opera
463241443021063.00001300gov.epa
4642414313811507.00000127com.wbir
465241406464335.00000319com.foreignpolicy
46624139430458.00003416org.ietf
467241362524034.00000343com.nikkei
4682413211031.00043664com.statcounter
46924131460237.00006978com.facebook.th-th
470241308566366.00000217com.geekwire
471241262661625.00000893com.linkedin.in
472241241588222.00000167com.appleinsider
473241237307141.00000194com.avclub
47424121082252.00006926net.akamaihd.fbcdn-profile-a
4752412008215904.00000093com.wonkette
476241186943871.00000354com.chron
47724114758977.00001461com.houzz
47824114164269.00006781com.facebook.tr-tr
47924112388623.00002160gov.ftc
480241114508902.00000153com.reason
481241097103459.00000403tv.blip
482241068461484.00000938com.google.photos
48324106650543.00002611com.oracle
484241053707602.00000182com.pastemagazine
485241033281483.00000938gov.copyright
486241007863055.00000464org.aclu
487240947583076.00000460com.philly
488240932521685.00000878com.squareup
489240893561086.00001272com.samsung
490240879783118.00000451com.me.web
491240876601231.00001129com.cdbaby
492240875646720.00000205com.deseretnews
493240831766367.00000217com.io9
49424081856402.00004145org.wikipedia.de
4952408112411338.00000128org.peta
4962408090812982.00000113com.hongkiat
497240806804684.00000295com.tmz
49824077924818.00001605com.amazon.aws
499240772568766.00000156org.pri
500240741862270.00000638com.oreilly
501240741102291.00000630com.freewebs
502240740881235.00001123org.wikipedia.it
503240733903848.00000356com.azcentral
504240731926245.00000221com.mentalfloss
50524069232523.00002830fr.google
5062406884817311.00000085com.tor
507240671322904.00000490org.worldbank
508240670541762.00000841de.heise
509240664648159.00000169com.liveleak
510240660388772.00000155com.gothamist
511240647423396.00000415com.latimes.articles
512240646269244.00000147com.extremetech
513240616403705.00000374com.yahoo.answers
5142406143032734.00000046com.wreg
515240611668511.00000161com.nybooks
516240607805379.00000254com.pbase
517240598345049.00000272edu.nap
518240585546572.00000210com.cnn.sportsillustrated
519240576566644.00000208com.grantland
520240565721639.00000887gov.loc
521240558164418.00000313org.nobelprize
522240544944260.00000324com.eonline
523240533807372.00000189com.haaretz
524240531225199.00000264com.bhphotovideo
525240529644827.00000287com.esri
526240521529995.00000136org.commondreams
527240521485161.00000266com.glamour
528240514561460.00000960com.fineartamerica
529240499806715.00000205edu.uchicago.press
530240484329103.00000149gov.nasa.science
5312404689435700.00000042com.bossip
5322404663219728.00000075com.neatorama
53324044690720.00001819org.acm
534240444624081.00000340org.weforum
535240443081897.00000777it.amazon
536240440061959.00000752me.flavors
537240433949459.00000143com.howstuffworks
538240418726466.00000213com.9to5mac
539240401424517.00000306com.uber
54024039284680.00001992com.bloglovin
5412403788216205.00000091com.highsnobiety
542240373164560.00000303com.audible
543240364907371.00000189com.complex
5442403617615916.00000093com.time.swampland
545240346544025.00000344com.lonelyplanet
5462403393411839.00000123com.dilbert
547240334363125.00000450com.deezer
548240332524168.00000333com.lynda
549240332229648.00000141com.discovermagazine.blogs
550240308404421.00000313com.cbs
551240306163761.00000367net.daringfireball
552240298601933.00000766com.patreon
553240271647181.00000194com.deadspin
554240236928065.00000171com.bostonherald
555240234346210.00000223com.cosmopolitan
55624022760829.00001585jp.ne.goo.blog
5572402138418750.00000079com.hotair
558240212386632.00000208com.librarything
559240210146000.00000230cc.arduino
560240191026652.00000207com.logitech
561240173823257.00000437com.asahi
562240154144050.00000342com.nationalgeographic.news
5632401532220178.00000073com.matadornetwork
564240152704638.00000298com.observer
565240122145943.00000231com.copyblogger
566240114445514.00000247com.seekingalpha
56724010644227.00007141mp.j
56824010056236.00006995com.xiami
569240097863429.00000408com.elpais
570240080402638.00000541com.ew
571240077307976.00000173com.bonappetit
572240069444514.00000306org.lds
573240067983385.00000416com.cbssports
574240062427137.00000194com.cbslocal.newyork
5752400479411215.00000129com.modelmayhem
57624004760982.00001449eu.europa
57724003960731.00001790com.google.hangouts
578240035506718.00000205com.vancouversun
579240023207253.00000191com.talkingpointsmemo
580240013521975.00000743com.google.spreadsheets
581240009842570.00000556cn.com.sina.blog
582240004023222.00000443com.ravelry
583239994401795.00000824com.amazon.astore
584239979021755.00000843org.eff
58523997856573.00002431com.adobe.helpx
586239974887093.00000195uk.ac.vam
587239972204633.00000299com.vice.motherboard
5882399603616595.00000089com.thesmokinggun
5892399400011829.00000124com.imore
59023993422101.00016094com.tinypic
59123993226545.00002574com.msn
592239931624385.00000316ca.globalnews
593239931188243.00000167com.discovery.dsc
594239927025617.00000244com.pitchfork
595239914245858.00000233com.blogspot.youtube-global
596239902888750.00000156com.realclearpolitics
59723987906678.00001996it.google
598239878907105.00000195com.scmp
599239861882777.00000508jp.or.nhk
60023982278686.00001969com.hubpages
601239818722779.00000508gov.uspto
602239816981374.00001028com.timeanddate
603239814388335.00000165com.christianitytoday
604239801684066.00000341net.faz
605239788987230.00000192com.theweek
6062397790028318.00000054com.gottabemobile
607239773827199.00000193org.plos.blogs
608239772044627.00000299com.howtogeek
609239754942009.00000731com.getpocket
610239731825047.00000272com.kotaku
611239727043324.00000425cc.tiny
6122397104612039.00000122com.perezhilton
6132396942613727.00000107com.mcclatchydc
61423968350651.00002075com.aol
615239681127323.00000190com.lmgtfy
61623966254822.00001598com.businesswire
617239660924345.00000319org.ibiblio
618239654783362.00000420org.unicef
619239654422417.00000594com.hollywoodreporter
620239605501041.00001333int.who
621239595801026.00001373com.android.developer
622239572945198.00000264edu.cmu
623239569306817.00000203com.sbnation
624239565487619.00000182com.marvel
625239564966517.00000211edu.harvard.law.blogs
626239520023640.00000382com.fiverr
627239505542718.00000520gov.dhs
628239499642653.00000535com.smashwords
62923948856209.00007583com.facebook.ja-jp
630239487605633.00000243com.stagram.web
6312394868014134.00000104com.nytimes.blogs.thelede
632239486766768.00000204com.nme
633239466664902.00000281com.hbo
6342394627213062.00000112org.counterpunch
6352394603811613.00000126com.cultofmac
636239439941543.00000913com.evernote
63723943942270.00006766com.360doc
638239422289626.00000141com.cracked
639239408202486.00000575com.blogtalkradio
64023939540357.00004971com.gravatar.en
64123937964389.00004423org.icann
64223937816684.00001977com.ggpht.lh3
643239377868491.00000161com.teenvogue
6442393734216658.00000088com.flickriver
645239364804027.00000344com.smithsonianmag
646239350705480.00000249com.codeproject
64723934200260.00006883net.fbcdn.ak.static
648239341021816.00000815gov.census
64923933432657.00002057com.linkedin.uk
65023932600577.00002400com.w3schools
651239317404463.00000310com.mac.homepage
6522392912210155.00000134com.rawstory
653239276623404.00000414com.squidoo
654239247462044.00000715com.dell
65523922980488.00003104com.4shared
6562392281414131.00000104org.mediamatters
657239208028760.00000156com.parents
658239205267287.00000191com.opera.my
659239204828124.00000169org.ieee.spectrum
660239201501038.00001337jp.geocities
661239152446512.00000212com.townhall
66223913292399.00004167org.mozilla.support
663239127922006.00000732org.oecd
664239118041048.00001324org.eclipse
6652391158420701.00000072com.hellogiggles
666239102988868.00000154com.clarin
66723909680827.00001589com.symantec
668239090407558.00000184org.aaas
669239087522311.00000621com.justgiving
670239084444671.00000296org.coursera
671239064821938.00000763com.nydailynews
67223905082343.00005302com.googleusercontent.lh6
67323904092387.00004468com.soundcloud.w
674239022501027.00001368gov.irs
6752390223215133.00000097com.craveonline
676239021364011.00000345com.channel4
6772389991620101.00000074com.nature.blogs
678238972189313.00000146com.myspace.blog
679238955269205.00000147com.klout
680238942081412.00000996com.steampowered.store
6812389400210218.00000133com.boredpanda
68223893512707.00001860com.friendster
6832389350436.00032755com.godaddy
684238930522950.00000481com.amzn
6852389242413276.00000110ca.globalresearch
6862389129817990.00000082org.calacademy
687238907566671.00000207net.box
688238896364900.00000282com.fanpop
689238890945845.00000234com.datacenterknowledge
6902388719627479.00000056com.americanrhetoric
691238865565185.00000265com.threadless
692238849664246.00000325ms.1drv
6932388353010188.00000134com.barackobama
6942388309812688.00000116com.spin
6952388309213664.00000107com.yahoo.pipes
696238829568684.00000157com.comedycentral
697238828961066.00001299com.googleartproject
698238827882652.00000535com.computerworld
6992388186216912.00000087com.giantbomb
70023881530276.00006705com.weibo.vdisk
7012388152023165.00000064com.wattsupwiththat
702238791463942.00000347com.screencast
7032387763210235.00000133org.tvtropes
704238774704460.00000310com.megaupload
7052387676416116.00000092com.catholicnewsagency
706238767001476.00000947org.hbr
7072387655224017.00000062com.cnn.blogs.religion
70823874712161.00009861com.mailchimp
709238743682437.00000589com.alibaba
710238740602992.00000474com.ezinearticles
711238739443058.00000463uk.co.ebay
712238737421146.00001228org.un
713238730621710.00000869org.iso
71423867822999.00001415com.snapchat
715238677766315.00000219com.victoriassecret
7162386613613917.00000105com.washingtonian
7172386596025817.00000059com.humanevents
718238659381722.00000864com.newgrounds
719238648401677.00000882com.biblegateway
720238601262516.00000567com.friendfeed
7212385830418020.00000082com.moddb
7222385770827085.00000057com.singularityhub
723238545124265.00000324com.pixlr
724238536949675.00000140com.marieclaire
72523851816242.00006954com.facebook.ar-ar
726238511448505.00000161org.ams
727238511322494.00000572com.createspace
728238497981615.00000899com.ebay.stores
729238497901169.00001199com.sciencedirect
730238492385291.00000259com.tampabay
731238481644113.00000337com.ibtimes
732238476928114.00000170com.oreilly.radar
7332384672216808.00000088com.escapistmagazine
734238466901233.00001125org.seomoz
73523845474140.00011415com.ytimg.i
736238454381618.00000895com.net-a-porter
737238453121829.00000810com.cnet.download
7382384500813611.00000108org.brooklynmuseum
739238448268231.00000167net.fanfiction
7402384427015039.00000097com.flavorwire
741238434543472.00000402com.modcloth
7422384174030133.00000050org.jihadwatch
743238410661160.00001210com.weather
744238410004818.00000287to.gplus
745238409461537.00000919com.viadeo
746238406907902.00000174org.edutopia
747238405143327.00000425org.apa
748238394465982.00000230de.tagesschau
749238378202994.00000474me.paypal
7502383753018909.00000078edu.hawaii
75123837158576.00002411com.images-amazon.ecx
752238368262544.00000561gov.fbi
753238367405216.00000263com.manta
754238363066318.00000219uk.org.nationaltrust
755238349925593.00000244com.googleusercontent.webcache
756238349168158.00000169org.truth-out
757238348584989.00000276com.typepad.sethgodin
7582383343221393.00000069org.spectator
759238324567393.00000188com.mendeley
760238308083077.00000460tr.com.google
761238305882961.00000480org.cancer
762238300483361.00000421com.networkworld
7632382996814271.00000103com.topix
764238294705491.00000249com.starwars
765238291282735.00000517com.hulu
766238268527672.00000181com.discovery.news
767238264502600.00000550org.dmoz
768238256626630.00000208com.villagevoice
769238241605362.00000255com.dpreview
770238239265207.00000263edu.cmu.cs
7712382317615187.00000097com.dazeddigital
772238225083709.00000374org.mozilla.wiki
773238218301167.00001201gov.fda
774238208921433.00000981gov.justice
775238208263603.00000387gov.cia
77623820332439.00003639com.posterous
7772382032812812.00000114au.com.sbs
778238203247682.00000180com.gamasutra
779238198706665.00000207com.epicurious
780238192663771.00000365com.socialmediaexaminer
781238192109190.00000148org.sierraclub
782238192003366.00000420net.earthlink.home
783238181362005.00000732com.gartner
784238172462354.00000608com.theglobeandmail
785238171361336.00001063org.wikipedia.pt
786238165126963.00000198com.suntimes
787238163923390.00000416nl.xs4all
7882381469225030.00000060com.elephantjournal
789238122646672.00000207com.cntraveler
790238120861004.00001409com.linkedin.fr
791238119864753.00000291com.nationalreview
792238116461999.00000735com.thefreedictionary
79323811380280.00006674com.facebook.zh-cn
794238092644474.00000309uk.ac.ucl
795238064064613.00000299com.denverpost
79623806360231.00007094kr.flic
797238063227693.00000180com.instyle
7982380559223068.00000064edu.usra.lpi
7992380412210350.00000131com.scobleizer
800238040224182.00000331uk.co.metro
80123803192233.00007056jp.co.google
802238030184971.00000277com.nvidia
803238028703887.00000352com.irishtimes
804238020701287.00001099co.g
805238017726674.00000207edu.washington.depts
806238014187466.00000187com.tennessean
807238006582439.00000589com.hp
808238003223644.00000382com.aljazeera
8092380013816549.00000089com.wmagazine
81023798820579.00002383uk.co.eventbrite
81123798338421.00003858com.googleadservices
8122379780610117.00000134com.eater
813237973908292.00000166com.yahoo.movies
814237956406997.00000197com.bhg
8152379449826069.00000058org.fair
816237936162379.00000600com.bostonglobe
817237931366353.00000217com.autoblog
8182379282810025.00000136com.linuxjournal
8192379266621056.00000070com.thisiscolossal
8202379140215130.00000097com.radaronline
8212379117014444.00000101com.fourhourworkweek
822237904003416.00000412com.thestar
823237896308992.00000151gov.nasa.apod
824237891083558.00000391com.twitpic
8252378903885.00017941com.twitter.status
826237877507183.00000194com.financialexpress
827237871787036.00000197mx.com.eluniversal
828237871704196.00000330edu.princeton
8292378695412725.00000115edu.uvm
830237864421244.00001108com.skype
831237856201350.00001049com.flickr.static.farm3
8322378405019151.00000077com.slashfilm
833237831028462.00000162org.nypl
8342378131413888.00000106com.associatedcontent
835237806903787.00000363org.gutenberg
83623779958217.00007285org.bbb
837237798367685.00000180com.macrumors
8382377809413271.00000110com.theroot
839237765105072.00000271com.akamai
840237758803305.00000428au.com.theaustralian
8412377571413415.00000109com.factmag
8422377436818373.00000080com.marykay
843237740287646.00000181com.viddler
844237713081937.00000763com.android
845237702027311.00000191gov.loc.memory
846237696961765.00000840com.yahoo.search
847237696281149.00001223com.feedburner
848237695161142.00001233com.google.adwords
849237694165410.00000253com.diigo
8502376825610007.00000136com.mattcutts
8512376742615826.00000093org.davidsuzuki
8522376643811933.00000123com.break
85323766120475.00003242org.drupal
8542376576633614.00000045com.animalnewyork
8552376575419037.00000078com.crooksandliars
856237657261920.00000772com.steamcommunity
8572376570213001.00000113com.weeklystandard
8582376545210549.00000130com.tuaw
8592376414021194.00000070com.inthesetimes
860237619765530.00000247org.hrc
861237619561860.00000793com.networkedblogs
862237605343114.00000452com.theknot
8632376036833353.00000045com.littlegreenfootballs
864237602603635.00000382com.barnesandnoble.search
865237599843656.00000380com.globo.g1
866237587748577.00000159com.smittenkitchen
867237587262079.00000705es.amazon
8682375872214721.00000100com.ktla
8692375859219539.00000076com.rediff
8702375823821443.00000069com.artofmanliness
8712375822017645.00000083org.whitney
872237576649843.00000138com.menshealth
873237572822211.00000660com.nypost
874237563902958.00000481com.gstatic.t0
8752375636213499.00000109com.ffffound
876237558049503.00000143com.inhabitat
877237553144301.00000322edu.columbia
8782375439010407.00000131com.hypebeast
879237535344665.00000296com.thinkgeek
880237531367586.00000183com.foodandwine
881237523609750.00000139org.wikibooks.en
8822375050811492.00000127com.gocomics
88323750416153.00010465ru.yandex
8842374960621336.00000069edu.rochester
8852374939043018.00000035com.2dopeboyz
8862374787619010.00000078org.nycgovparks
8872374772617755.00000083com.justjared
8882374630015964.00000092com.blogspot.googlemobile
889237455886710.00000206com.wwd
89023745272333.00005701org.fedoraproject
8912374247613660.00000107com.hollywoodlife
892237423109994.00000136mil.navy
89323741808967.00001490com.staticflickr.farm8
8942374136620244.00000073com.coolhunting
8952374052217885.00000082com.magcloud
896237395442770.00000509com.gettyimages
8972373755815787.00000093com.washingtonpost.articles
89823737374272.00006752net.akamaihd.fbstatic-a
899237373343662.00000380uk.co.thesun
900237354705484.00000249edu.yale
9012373454419149.00000077org.labnol
90223734358449.00003503nl.google
9032373371625945.00000058org.globalvoicesonline
904237324982489.00000574dk.google
905237324921345.00001055com.staticflickr.farm9
90623732404228.00007126com.facebook.da-dk
90723732036350.00005193us.imageshack
908237318083550.00000392com.mercurynews
9092373158824915.00000060com.celebuzz
910237315245769.00000236com.yahoo.groups.tech
911237308445249.00000261int.esa
912237304686716.00000205com.linkedin.blog
9132373043015982.00000092com.eurasiareview
91423730204163.00009658com.blogblog.img1
9152372960416528.00000089com.redstate
9162372868013070.00000112com.torrentfreak
9172372820623033.00000065com.movieweb
9182372774413113.00000112com.seroundtable
919237262141733.00000856edu.cornell.law
920237260803120.00000451com.nifty.homepage3
9212372547819732.00000075com.craphound
92223724518700.00001890com.ggpht.lh6
923237231621737.00000851com.hotmail
9242372312831743.00000048org.thesocietypages
925237228982015.00000729com.spotify.play
926237222149125.00000149com.scientificamerican.blogs
927237220223445.00000405com.digitaltrends
928237214463659.00000380com.jamendo
929237211562859.00000498com.netflix
9302372058215182.00000097com.stereogum
9312371990684.00017961com.twitter.business
93223719456180.00008153com.blogblog.img2
933237190808438.00000163com.dailydot
934237177944118.00000337edu.bu
935237170004558.00000303com.zara
936237153548040.00000171net.asp.weblogs
937237145462453.00000583com.ebay.rover
938237131828138.00000169com.marketingprofs
9392371289011959.00000122com.takepart
940237111626443.00000214org.propublica
941237089124866.00000284com.makeuseof
9422370630413298.00000110com.models
943237058088228.00000167com.sportingnews
944237050426501.00000212com.digitaljournal
945237049585382.00000254com.active
946237044108643.00000158ar.com.lanacion
9472370422411282.00000129com.ssrn
9482370417614358.00000102com.gazette
949237041563007.00000471org.pewinternet
9502370378812315.00000119org.caringbridge
951237022686669.00000207fm.ask
952237018067727.00000179com.politifact
9532370145212920.00000113com.theoatmeal
9542370069277.00018590com.twitter.api
9552370006811585.00000126org.brainpickings
956236995706414.00000215com.harpercollins
9572369925019689.00000075net.360cities
9582369916655038.00000028nz.co.sciblogs
959236974584465.00000310com.starbucks
960236973825868.00000233com.elle
9612369681228485.00000054com.listverse
96223695400564.00002484com.booking
963236936682996.00000473com.dallasnews
964236931603621.00000384com.pastebin
965236922266156.00000225com.purevolume
966236919541310.00001077com.amazon.smile
9672369136232716.00000046com.truthdig
968236912865706.00000240com.knowyourmeme
9692368758225990.00000058com.babble.blogs
970236860803355.00000421com.vanityfair
9712368569832746.00000046net.fubiz
97223685224562.00002501com.giphy
973236844822113.00000689com.intel
974236838543139.00000447com.livescience
9752368306013332.00000110uk.org.iwm
976236827605190.00000265com.randomhouse
977236815201191.00001159es.google.maps
9782368117637675.00000040com.tucsonweekly
979236801429980.00000136com.gilt
980236784642799.00000503com.gstatic.t2
981236784088966.00000152org.thisamericanlife
9822367816422636.00000066uk.co.creativereview
983236763485792.00000235com.microsofttranslator
984236761901633.00000891gov.sec
9852367450017492.00000084com.penny-arcade
986236739961210.00001140com.springer.link
987236736262412.00000596com.redhat
9882367312619650.00000075org.newsbusters
98923672884238.00006976com.facebook.el-gr
990236728188628.00000158com.heavy
991236721449097.00000150com.globalpost
9922367149614074.00000104com.wisegeek
993236709107992.00000172com.animoto
99423670052778.00001694com.naver.blog
995236693228386.00000164com.time.techland
996236691908652.00000158com.jalopnik
997236686843562.00000391com.indiatimes.economictimes
99823667394474.00003251ru.vkontakte
999236665521487.00000937com.msnbc
10002366609416350.00000090com.rockpapershotgun

 

Data and download instructions

The host-level graph as well as the rankings are placed on AWS S3 on the path

s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

Alternatively, you can use

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph

SizeFileDescription
2.72 GBvertices.txt.gznodes ⟨id, rev host⟩
9.42 GBedges.txt.gzedges ⟨from_id, to_id⟩
4.51 GBbvgraph.graphgraph in BVGraph format
0.22 GBbvgraph.offsets
1 kBbvgraph.properties
5.06 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
0.47 GBbvgraph-t.offsets
1 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
6.26 GBranks.txt.gzharmonic centrality and pagerank

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl’s Google Group!

 

Credits

Thanks to

  • Web Data Commons, for their web graph data set and everything related.
  • Common Search; we first used their web graph to expand the crawler frontier, and Common Search’s cosr-back project was an important source of inspiration how to process our data using PySpark.
  • the authors of the WebGraph framework, whose software simplifies the computation of rankings.

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Sebastian has a PhD in Computational Linguistics and several years of experience as a programmer working in search and data. In addition to hands-on experience maintaining and improving a Nutch-based crawler like that of Common Crawl, Sebastian is a core committer to and current chair of the open-source Apache Nutch project. Sebastian’s knowledge of machine learning techniques and natural language processing components of web crawling will help Common Crawl continually improve on and optimize the crawl process and its results.

With Sebastian on board, we have both the competence and momentum to take Common Crawl to the next level.

Web image size prediction for efficient focused image crawling

Katerina Andreadou
This is a guest blog post by Katerina Andreadou.
Katerina is a research assistant at CERTH, where she specializes in multimedia analysis and web crawling.


In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

  • 3.6 billion unique images
  • 78.5 million unique domains
  • ≈8% of the images are big (width and height bigger than 400 pixels)
  • ≈40% of the images are small (width and height smaller than 200 pixels)
  • ≈20% of the images have no dimension information

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

Textual Features approach

An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency.

In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

Non-textual features approach

Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

Hybrid approach

The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.

Evaluation

For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time.

The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1. The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

  • Doubling the number of n-gram features improves the performance in all cases.
  • So does adding more trees to the Random Forest classifier.
  • The hybrid method outperforms all standalone methods, its best F-score being 4% higher than the best textual features score.

Table 1: Comparative results (F-measure)

Method

RF trees

Features

F1small

F1big

F1avg

TOKENS -trf

10

1000

0.876

0.867

0.871

TOKENS -trf

30

1000

0.887

0.883

0.885

TOKENS -trf

100

1000

0.894

0.891

0.893

TOKENS -trf

10

2000

0.875

0.864

0.870

TOKENS -trf

30

2000

0.888

0.828

0.885

TOKENS -trf

100

2000

0.897

0.892

0.895

NG-tsrf-idf

10

1000

0.876

0.872

0.874

NG-tsrf-idf

30

1000

0.883

0.881

0.882

NG-tsrf-idf

100

1000

0.886

0.884

0.885

NG-tsrf-idf

10

2000

0.883

0.878

0.881

NG-tsrf-idf

30

2000

0.891

0.888

0.890

NG-tsrf-idf

100

2000

0.894

0.891

0.892

features

10

23

0.848

0.846

0.847

features

30

23

0.852

0.852

0.852

features

100

23

0.853

0.853

0.853

hybrid

0.935

0.935

0.935

Acknowledgement

This work was carried out at the Multimedia Knowledge and Social Media Analytics Lab in collaboration with Symeon Papadopoulos in the context of the REVEAL FP7 project.

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Ross FairbanksThis is a guest blog post by Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project wikireverse.org and why he built it.



What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.


[1] https://wikireverse.org/
[2] https://commoncrawl.org/2011/12/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html

The Promise of Open Government Data & Where We Go Next

One of the biggest boons for the Open Data movement in recent years has been the enthusiastic support from all levels of government for releasing more, and higher quality, datasets to the public. In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information–which anyone is free to contribute to–that help government agencies release data that is “available, discoverable, and usable.”

Since 2013, many enterprising government leaders across the United States at the federal, state, and local levels have responded to the President’s call to see just how far Open Data can take us in the 21st century. Following the White House’s groundbreaking appointment in 2009 of Aneesh Chopra as the country’s first Chief Technology Officer, many local and state governments across the United States have created similar positions. San Francisco last year named its first Chief Data Officer, Joy Bonaguro, and released a strategic plan to institutionalize Open Data in the city’s government. Los Angeles’ new Chief Data Officer, Abhi Nemani, was formerly at Code for America and hopes to make LA a model city for open government. His office recently launched an Open Data portal along with other programs aimed at fostering a vibrant data community in Los Angeles.1

Open government data is powerful because of its potential to reveal information about major trends and to inform questions pertaining to the economic, demographic, and social makeup of the United States. A second, no less important, reason why open government data is powerful is its potential to help shift the culture of government toward one of greater collaboration, innovation, and transparency.

These gains are encouraging, but there is still room for growth. One pressing issue is for more government leaders to establish Open Data policies that specify the type, format, frequency, and availability of the data  that their offices release. Open Data policy ensures that government entities not only release data to the public, but release it in useful and accessible formats.

Only nine states currently have formal Open Data policies, although at least two dozen have some form of informal policy and/or an Open Data portal.2 Agencies and state and local governments should not wait too long to standardize their policies about releasing Open Data. Doing so will severely limit Open Data’s potential. There is not much that a data analyst can do with a PDF.

One area of great potential is for data whizzes to pair open government data with web crawl data. Government data makes for a natural complement to other big datasets, like Common Crawl’s corpus of web crawl data, that together allow for rich educational and research opportunities. Educators and researchers should find Common Crawl data a valuable complement to government datasets when teaching data science and analysis skills. There is also vast potential to pair web crawl data with government data to create innovative social, business, or civic ventures.

Innovative government leaders across the United States (and the world!) and enterprising organizations like Code for America have laid an impressive foundation that others can continue to build upon as more and more government data is released to the public in increasingly usable formats. Common Crawl is encouraged by the rapid growth of a relatively new movement and we are excited to see the collaborations to come as Open Government and Open Data grow together.

 

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. She is currently pursuing a master’s degree in public policy from the Goldman School of Public Policy at the University of California, Berkeley.

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert MeuselThis is a guest blog post by Robert Meusel.
Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.


The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

Navigating the WARC file format

Wait, what’s WAT, WET and WARC?

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

WARC/1.0
WARC-Type: response
WARC-Date: 2013-12-04T16:47:32Z
WARC-Record-ID: 
Content-Length: 73873
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 23.0.160.82
WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Content-Encoding: gzip
Date: Wed, 04 Dec 2013 16:47:32 GMT
Content-Length: 18953
Connection: close


...HTML Content...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

  • Envelope
    • WARC-Header-Metadata
    • Payload-Metadata
      • HTTP-Response-Metadata
        • Headers
          • HTML-Metadata
            • Head
              • Title
              • Scripts
              • Metas
              • Links
            • Links
    • Container

WET Response Format

As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
WARC-Date: 2013-12-04T15:30:35Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
Content-Type: text/plain
Content-Length: 5765


...Text Content...

Processing the file format

We’ve provided three introductory examples in Java for the Hadoop framework. The code also contains wrapper tools for making working with the Web Archive Commons library easier in Hadoop.

These introductory examples include:

  • Count the number of times varioustags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Word count over the extracted plaintext found in the WET files

If you’re using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

If in doubt, the tools provided as part of the IIPC’s Web Archive Commons library are the preferred implementation.

Stephen MerityThis is a guest blog post by Stephen Merity.
Stephen Merity is a Computational Science and Engineering master’s candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.

Common Crawl’s Move to Nutch

Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Our old crawler was highly tuned to our data center environment where every machine was identical with large amounts of memory, hard drives and fast networking.

We needed something that would allow us to do web-scale crawls of billions of webpages and would work in a cloud environment where we might run on a heterogenous machines with differing amounts of memory, CPU and disk space depending on the price plus VMs that might go up and down and varying levels of networking performance.

About Nutch

Apache Nutch has an interesting past. In 2002 Mike Cafarella and Doug Cutting started the Nutch project in order to build a web crawler for the Lucene search engine. When looking for ways to scale Nutch to allow it to crawl the whole web, Google released a paper on GFS. Less than a year later, the Nutch Distributed File System was born and in 2005, Nutch had a working implementation of MapReduce. This implementation would later become the foundation for Hadoop.

Benefits of Nutch

Nutch runs completely as a small number of Hadoop MapReduce jobs that delegate most of the core work of fetching pages, filtering  and normalizing URLs and parsing responses to plug-ins.

The plug-in architecture of Nutch allowed us to isolate most of the customizations we needed for our own particular processes into plug-ins without making changes to the Nutch code itself. This makes life a lot easier when it comes to merging in changes from the larger Nutch community which in turn simplifies maintenance.

The performance of Nutch is comparable to our old crawler. For our Spring 2013 crawl for instance, we’d regularly crawl at aggregate speeds of 40,000 pages per second. Our performance is limited largely by the politeness policy we set to minimize our impact on web servers and the number of simultaneous machines we run on.

Drawbacks

There are some drawbacks to Nutch. The URLs that Nutch fetches is determined ahead of time. This means that while you’re fetching documents, it won’t discover new URLs and immediately fetch them within the same job. Instead after the fetch job is complete, you run a parse job, extract the URLs, add them to the crawl database and then generate a new batch of URLs to crawl.

Unfortunately when you’re dealing with billions of URLs, reading and writing this crawl database quickly becomes a large job. The Nutch 2.x branch is supposed to help with this, but it isn’t quite there yet.

Conclusion

Overall the transition to Nutch has been a fantastically positive experience for Common Crawl. We look forward to a long happy future with Nutch.

Notes

If you want to take a look at some of the changes we’ve made to Nutch, they code is available on github at https://github.com/Aloisius/nutch in the cc branch. The official Nutch project is hosted at Apache at http://nutch.apache.org/.

Lexalytics Text Analysis Work with Common Crawl Data

Oskar Singer

This is a guest blog post by Oskar Singer

Oskar Singer is a Software Developer and Computer Science student at University of Massachusetts Amherst.  He recently did some very interesting text analytics work during his internship at Lexalytics . The post below describes the work, how Common Crawl data was used, and includes a link to code.

At Lexalytics, I have been working with our head of software engineering, Paul Barba, on improving our accuracy with Twitter data for POS-tagging, entity extraction, parsing and ultimately sentiment analysis through building an interesting model-based approach for handling misspelled words.

Our approach involves a spell checker that automatically corrects the input text internally for the benefit of the engine and outputs the original text for the benefit of the engine user, so this must be a different kind of automated spell-correction.

The First Attempt:

Our first attempt was to take the top scoring word from the list of unranked correction suggestions provided by Hunspell, an open-source spell checking library. We calculated each suggestion’s score as word frequency from Common Crawl data divided by string edit distance with consideration for keyboard distance.

The resulting corrections were scored against hand-corrected tweets by counting the number of tokens that differed. Hunspell scored worse than the original tweets. It corrected usernames and hashtags and gave totally unreasonable suggestions. My favorite Hunspell correction was the mapping from “ur” (as in the short-form for “your” or “you’re”) to “Ur” (as in the ancient Mesopotamian city-state).

Hunspell also missed mistakes like misused homophones, which did not count as a misspelling when considered in isolation. This last issue seemed to be the primary issue with our data, so the problem required a method with the ability to consider context.

The Second (and final) Attempt:

We title the next attempt “the Switchabalizer”, and it can be summarized as a multinomial, sliding-window, Naive-Bayes word classifier. On a high level, we classify each of the target words in a piece of text, based on the preceding and succeeding words, as itself or one of its homophones.

The training process starts with a list of bigrams from the Common Crawl data paired with their occurrence counts. We use this data to calculate P(wi-1 | wi) = #(wi-1wi)/#(wi-1) and P(wi+1 | wi) = #(wiwi+1)/#(wi+1) where wi is the current word, wi-1 is the preceding word and wi+1 is the succeeding word. These probabilities are serialized and archived so they can be deserialized into C++ data structures instead of recalculated for each instantiation of the spell check object.  In other words, we’re building a set of probabilities that each switchable “generated” the words preceding and succeeding wi.

The inference process starts with a set S of sets and an inverted index. Each s ∈ S represents a group of commonly confused homophones (e.g. two, too, 2, to), and no word is a member of multiple s ∈ S. The inverted index maps each word w in the union of all s ∈ S to the s in which w holds membership. Each word wi in the ordered sequence of words in a document is checked for an entry in the inverted index. If an entry V is found, the algorithm replaces wi with argmaxv∈V P(v) = P(wi-1 | v) + P(wi+1 | v).

Testing:

As a matter of efficiency, we assumed that Wikipedia articles have perfect use of the target homophones. I wrote a Python script that took in text, randomly replaced target homophones with members of their switchable set, then output the result.

We ran the Switchabalizer on this data and compared to the original Wikipedia data. Comparing the corrections to the words changed by our test generator, Hunspell, even when forced to ignore usernames, had a 216% error rate (i.e. it made false corrections), and the Switchabalizer had a 20% error rate. Although the test data does not match the target data, the massive and varied data set provided by Common Crawl should ensure good results from the Switchabalizer on many types of data, hopefully even the near-nonsense from the bowels of Twitter.

Conclusion:

The Switchabalizer approach is clearly superior to a traditional spell checker for our targeted issues, but still requires significant testing, tuning and improvement. The following section provides some possibilities for improvement and expansion. We hope this approach can be of use to other people with the same problem, and we would like to thank Common Crawl for the fantastic resource that they provide!

Future Work:

Possible future experiments include further testing on different types of data, integration of higher-order n-gram features, implementation of a discriminative model, implementation for other languages, and corrections of common misspellings like “ur”, which cannot be included in sets of switchables without risking the model mapping words to non-words.

The commented Python scripts that generate the testing data and perform feature extraction/training/feature selection can be found on my github account at https://github.com/oskarsinger/PythonScriptsFromLexalytics/tree/master/AutomatedSpellCheck/

Hyperlink Graph from Web Data Commons

The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.

Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.

They have published resulting graph today together with some results from the analysis of the graph.

http://webdatacommons.org/hyperlinkgraph/
http://webdatacommons.org/hyperlinkgraph/topology.html

To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public!