August 2017 Crawl Archive Now Available

August 2017 Crawl Archive Now Available

The crawl archive for August 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-34/. It contains 3.28 billion+ web pages and over 280 TiB of uncompressed content.

To improve coverage and freshness we used the top 50 million ranked hosts from the May/June/July 2017 webgraph data set and added over 800 million new URLs (not contained in any crawl archive before), of which

  • 300 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 40 million hosts;
  • 525 million URLs are a random sample extracted from sitemaps (if provided by any of the top 50 million hosts).

The following improvements affect the WAT and WET extraction:

  • improved spacing / word segmentation in WET extracts, see issue #13
  • extract URLs from JavaScript code in onClick attributes (issue #8)

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-34/segment.paths.gz100
WARC filesCC-MAIN-2017-34/warc.paths.gz7200065.14
WAT filesCC-MAIN-2017-34/wat.paths.gz7200022.18
WET filesCC-MAIN-2017-34/wet.paths.gz720009.81
Robots.txt filesCC-MAIN-2017-34/robotstxt.paths.gz720000.12
Non-200 responses filesCC-MAIN-2017-34/non200responses.paths.gz720001.49
URL index filesCC-MAIN-2017-34/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-34/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Now Available: Host- and Domain-Level Web Graphs

We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017. These graphs, along with ranked lists of hosts and domains, follow on our first host-level web graph (February, March, April 2017). Detailed information about the data formats, the processing pipeline, our objectives, and credits can be found in the prior announcement.

Host-level graph

The graph consists of 1.3 billion nodes and 5.25 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 1.3 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr/hostgraph/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr/hostgraph/ as prefix to access the files from everywhere.

The following files and formats are provided:

SizeFileDescription
7.8 GBvertices.txt.gznodes ⟨id, rev host⟩
20.4 GBedges.txt.gzedges ⟨from_id, to_id⟩
10.0 GBbvgraph.graphgraph in BVGraph format
0.59 GBbvgraph.offsets
2 kBbvgraph.properties
14.5 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
1.6 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
18.9 GBranks.txt.gzharmonic centrality and pagerank

Note that differences in the rankings and the structure of the web graph are due our objective to make monthly crawls more diverse and to reduce the overlap between consecutive crawls. During both February/March/April and May/June/July we crawled about 9 billion pages. As an indicator of less overlap the number of unique URLs increased from 5.0 to 6.2 billion and the number of unique hosts went up from 70 to 90 million (or from 65 to 82 million with leading “www.” removed). The largest strongly connected component contains now 59 million nodes/hosts (45 million in the February/March/April graph). However, the May/June/July host-level graph has doubled its size in terms of edges and more than tripled in terms of nodes. This growth is caused by a significant increase in the number of dangling nodes. 1.2 billion dangling nodes provide a solid foundation to extend the next crawls but we need a closer look at the distribution of these hosts among domains and TLDs.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted, “private” domains are not (cf. section “divisions” in the documentation on publicsuffix.org). In short, foo.blogspot.com and commoncrawl.s3.amazonaws.com are not accepted as pay-level domains, they are aggregated as domains blogspot.com resp. amazonaws.com.

The domain-level graph has 91 million nodes and 1,071 million edges. 55% are dangling nodes, the largest strongly connected component covers 30 million or 33% of the nodes.

All files related to the domain graph are placed on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr/domaingraph/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr/domaingraph/.

Download files of the Common Crawl May/June/July 2017 domain-level webgraph

SizeFileDescription
0.63 GBvertices.txt.gznodes ⟨id, rev host⟩
4.3 GBedges.txt.gzedges ⟨from_id, to_id⟩
2.4 GBbvgraph.graphgraph in BVGraph format
0.09 GBbvgraph.offsets
2 kBbvgraph.properties
2.6 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
0.13 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
1.8 GBranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (May/June/July 2017)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12498995210.0155264576161686com.facebook
22246088030.00866038900847366com.twitter
32209751420.0128827315785546com.googleapis
42130161640.00741843063139714com.youtube
52123440250.00706371663388533com.google
71899088260.00545683515949591org.gmpg
818904882100.00280210492427644com.linkedin
91884345090.00307727244449465com.instagram
1018259830250.00126167740795298org.wikipedia
1118053094380.00101731276962237com.blogspot
1217970588340.00109405966554137com.wordpress
1317919956220.00141960022500196com.pinterest
1417835014160.0020636325734346com.apple
1517697268140.00222054582589118org.wordpress
1717561416120.00240805753831021com.macromedia
1817508598320.00111771803587573com.gravatar
1917446398550.000661340427978196be.youtu
2017384778520.000701194639286359com.amazon
2117353492170.00187094862083697com.adobe
2217343286560.000637906122981968com.flickr
2317323658470.000801430611613196com.vimeo
2417279624490.000729089638101731gl.goo
2517250688420.000933859058158277com.paypal
2617229298460.000826236952032279com.microsoft
2817103198700.000474840687818822ly.bit
2917089526660.000507632417477531com.tumblr
3017033012500.000719273960345051com.github
3116996118430.000881255954914393com.amazonaws
3216946912770.000406272523681585org.creativecommons
3416863562670.000501297307358693com.yahoo
3516862378930.000299154154651788org.mozilla
3716812384830.00037282011840036org.w3
38167524831250.000207624449137057com.weebly
39167294641150.000221622522886096com.googleusercontent
40167176691180.000215230097627401com.blogger
41166976241530.000171757229153436com.myspace
44166469341850.000141243073946399org.wikimedia
4616644900790.000396589141351638com.medium
4716631223750.000433568676537io.github
4816612200990.000268805299302926com.android
4916590272180.00162368845735663com.bootstrapcdn
50165584341660.000157623382042305org.apache
51165254282290.000115801379415941com.photobucket
52165044261950.000131254056913804com.businessinsider
55164765533048.18438859138264e-05com.ebay
56164697211370.000187664601325017com.issuu
5716460853370.00102123707091179com.bing
58164527291820.000142450905972814com.imdb
5916427950820.000374838485018944eu.europa
6016411357480.000746154314145809com.statcounter
61163903382450.000101622215034005com.appspot
62163799522090.000124338192701991com.about
63163737421400.00018454904883009com.nytimes
64163638213527.06416235739108e-05gov.nasa
6516353372590.000571194421273779com.cloudflare
6616346221130.00239759215021479net.fbcdn
6716338312440.000857327043073336me.wp
6816326390710.000465848580411827com.gstatic
69163200413477.16053406333361e-05com.theverge
70163194141220.000209662262430235com.yelp
72162795242868.69073948101929e-05org.npr
73162720833257.56332706724679e-05com.googleblog
74162563042908.48972545255239e-05me.about
75162481991990.000128879026140158org.ietf
76162412552729.24645494623427e-05com.mozilla
77162404791560.00016832163499487org.gnu
7816239722920.000305401671067505co.t
79162393872160.000120371938835806com.oracle
80162325262280.000116114540948403com.typepad
81162310844535.44944627802185e-05edu.ucla
8216227427270.00121558207856905net.akamaihd
84162143092320.000114818535982162uk.co.bbc
85162142351360.000189983194885236com.jquery
86162118142210.000118993796467497com.imgur
87161974204725.2567183006594e-05edu.princeton
88161968553367.32728429381573e-05gov.noaa
8916191891740.000440473851712633org.schema
90161901451780.000144231760365021net.slideshare
9116190073620.000564081957805812net.cloudfront
92161837302848.92999121057943e-05com.wsj
93161759631880.000140208643596679com.forbes
9416172502760.00041917417746326com.paypalobjects
95161702941210.000212087858171349com.soundcloud
96161650482000.000128726042287979com.spotify
97161613672130.000122378879977989edu.stanford
98161581142020.000127454157734194com.disqus
99161566084815.14517415590491e-05com.mysql
100161486911670.000157054492258209com.dropbox
101161466722769.11987772125017e-05com.tinyurl
102161416981390.000184631238862252com.constantcontact
103161414203317.44922001669146e-05edu.mit
104161399952739.20851285715325e-05com.nbcnews
106161348911030.000249718966343057com.wix
107161313904405.63178297481015e-05com.googlecode
108161265241870.00014024908309338com.theguardian
10916123314950.000285005386439374com.huffingtonpost
111161129413537.01996718671122e-05gov.loc
112161050051550.000170093991134689net.sourceforge
113161016004016.14791022911939e-05org.python
114160888603796.60560437018202e-05com.alexa
115160886454105.97561569096326e-05co.g
116160790282529.88994061572069e-05com.go
118160774533646.82295063450322e-05com.foursquare
12116066757450.000829281063065997com.squarespace
122160619463936.23091312037812e-05com.sun
124160580794425.61406400406246e-05com.withgoogle
125160572732609.56090724221574e-05com.washingtonpost
12716035956630.000537473792465926com.vk
12816035411860.000352598947727924net.doubleclick
129160352394895.06581788427578e-05org.chromium
130160284985844.26316314674836e-05edu.gatech
132160270224165.92122336053588e-05au.gov.nsw
133160256841260.000207293393080606com.reddit
134160251065094.91710961371229e-05org.sciencemag
135160240413896.27321618308479e-05org.nodejs
136160239486054.19017591329544e-05edu.cuny
138160128991330.000193040982037554com.feedburner
140160001325384.57515027163085e-05edu.illinois
141159997784994.96494716668815e-05edu.berkeley
142159989123996.16482775444537e-05au.com.google
144159872285064.93463645755921e-05com.pixabay
145159760132589.60743680804667e-05uk.co.amazon
146159729071110.000230813335083292com.cnn
148159675585444.54457479041577e-05com.libsyn
149159524053716.7120499269601e-05com.wired
150159496182599.59425814050923e-05com.surveymonkey
151159467483836.53396206082701e-05uk.co.dailymail
152159464834525.45761550958247e-05com.variety
15315930068240.0013474030796477com.fb
154159258422150.000120638726611947com.etsy
155159248893916.24685427611247e-05uk.co.blogspot
156159123953117.90671790542201e-05com.dailymotion
157159069052110.000123239102782434com.digg
158159010971810.000142453252268271gov.nih
159158942063706.75180807177585e-05com.reuters
160158868207013.89256402003242e-05edu.columbia
162158729577873.751174278726e-05com.nike
163158680552270.000116596591295635com.live
165158647523547.01156504511129e-05com.livejournal
166158647386573.96129599599408e-05com.sap
167158626656643.95013568833938e-05com.discogs
168158579752828.9964923414358e-05com.aol
169158557504455.5748966776129e-05org.mediawiki
17015845942290.00114281853454816me.fb
171158429473517.07813033957103e-05com.usatoday
172158422836334.05686573540759e-05edu.utah
173158371385494.51699741643719e-05net.daringfireball
174158366744964.9999960509278e-05org.eclipse
175158348733686.76642625755714e-05com.npmjs
176158336385504.49681211898885e-05com.netflix
177158324293377.30215403346833e-05com.tripod
178158310388783.3656387101054e-05com.diigo
180158178718043.67639700948407e-05uk.org.tate
181158162003696.75600758631612e-05edu.harvard
182158154224695.3001847136341e-05com.time
183158103448233.59360502579382e-05id.co.blogspot
184158090261140.000221797473311586com.mashable
185158074608073.6447534726932e-05com.hbo
187158024577693.84109849525509e-05com.chron
188158015577783.79843092102341e-05edu.washington
189157950114265.81220523939956e-05com.gmail
190157931211130.000225135156397275com.jimdo
191157928473557.00169347047329e-05com.meetup
192157925006294.06926572016734e-05org.ampproject
193157913995964.21539576611305e-05com.jetbrains
194157909483227.68171654135211e-05com.bloomberg
195157893621520.000173965702149204de.google
196157878322888.59046288833472e-05to.amzn
197157861526374.04886202931351e-05org.virtualbox
198157802866244.10339488441819e-05uk.co.guardian
199157777173137.90555572975389e-05com.techcrunch
200157769754115.95014769939008e-05uk.co.telegraph
201157764775484.52609403123979e-05com.cnbc
202157751692509.94157457855262e-05uk.co.google
204157716034685.30605297385559e-05com.goodreads
205157661753976.18619648011229e-05com.msn
206157615983926.23915895238641e-05com.kickstarter
207157613701710.000149517430613129com.twimg
208157578999583.10644934362954e-05in.blogspot
209157569033018.3087741732879e-05com.images-amazon
21015755871210.00144087639474043com.wixstatic
211157401203606.91064104263346e-05io.codepen
212157341602250.000117332550512604com.eventbrite
213157331304135.93919728037878e-05com.bbc
214157316218893.33158516532838e-05com.newyorker
215157308893966.2001076998058e-05com.latimes
216157296934475.53136069293824e-05com.webs
219157271191740.000147912563107984org.archive
220157232754076.02624636349819e-05com.bizjournals
221157224936483.98018712364008e-05com.wikia
222157220632230.000118348799632694org.drupal
223157167402968.4089596881278e-05net.php
224157167036434.02154313123456e-05com.venturebeat
225157165766194.12625555463031e-05org.pbs
226157125199143.26392230464033e-05com.marketwatch
227157121447703.83803578491251e-05com.exacttarget
229157060938333.55961710127199e-05com.sublimetext
230157036011980.000129275694487554com.stumbleupon
232156973212719.26344464771574e-05fr.free
233156943734655.327297946833e-05gov.whitehouse
234156937609683.09182306870607e-05com.gizmodo
235156923787963.71950102005915e-05com.chrome
236156863849773.06408953569782e-05com.instapaper
237156822343088.07156735886469e-05gov.cdc
2381568201010342.90083132906254e-05com.ning
239156791914765.17732111584554e-05com.office
240156773834195.86742754402436e-05com.xrea
242156669756324.05994257035936e-05com.ted
2431566682914672.32876139970216e-05edu.usc
2441566626410852.82676357406144e-05edu.umn
2451566584610722.84598053412361e-05com.arstechnica
246156642435354.58793673858625e-05de.blogspot
247156633668833.35673118772062e-05com.chicagotribune
24815663034280.0011763022119838com.atdmt
249156625372619.51380930957369e-05com.baidu
250156609531500.000174749380463391com.list-manage
251156603266134.14680955973278e-05com.scribd
2521565907314302.40444714698486e-05edu.uchicago
253156577835754.32697253018267e-05com.inc
2541565566610052.98439224149668e-05com.evernote
25515654762690.000490371412441887com.wp
256156542523397.26444892171607e-05gov.ca
257156514995914.22531065519201e-05com.dropboxusercontent
258156512804095.98380198468418e-05com.cnet
259156488672858.75060881994027e-05gov.ftc
260156459378803.36512290531188e-05com.nationalgeographic
261156412916513.97372609578435e-05uk.co.independent
262156411759273.20977639678945e-05edu.yale
2631564038911522.76178324420893e-05edu.wisc
264156402908423.50248903988879e-05com.nintendo
265156380436593.96001518252408e-05us.imageshack
266156368239073.27312169013512e-05com.indiatimes
267156355208003.69358708077931e-05com.cbslocal
2681563516317001.8722596335099e-05edu.msu
269156340899443.15472671606861e-05com.naturalnews
270156327933986.16851659136072e-05com.ibm
2711563097113322.62606272746041e-05org.ieee
272156294522579.65661295436327e-05net.behance
274156291728753.37107150206004e-05com.slate
275156260204805.14543056738778e-05edu.cornell
276156244679763.07073787371343e-05org.altervista
27715622049300.00112494056494761me.m
27815620427230.00135395152637574com.messenger
280156171139263.21020959728217e-05com.mysanantonio
281156143272988.37226642358646e-05com.opera
2831561218915442.13674763862979e-05edu.virginia
285156106502350.000109224765834484de.amazon
286156099005974.20971145241691e-05com.bandsintown
287156096892689.38629534487172e-05com.fc2
289156066133078.08602374785154e-05com.staticflickr
290156062055034.94721115017751e-05com.zdnet
291156032069243.21430290644091e-05com.googledrive
292155968515864.24190156027991e-05com.theatlantic
29315595738350.00106767518499836ru.yandex
294155910535644.39081259808879e-05com.fortune
295155889978613.43835713600411e-05org.iso
296155887919473.15155421032288e-05com.redhat
298155878284875.06700299908967e-05com.w3schools
2991558743813452.59590471932613e-05it.blogspot
300155866237863.75300358233661e-05com.nature
301155849114315.74012367083115e-05com.git-scm
302155781166913.89761753170653e-05com.ggpht
303155766271930.000132159977180025com.ytimg
305155754641060.000239116241712999jp.co.google
306155748568913.33060387457206e-05org.caringbridge
307155708446663.94027427723104e-05com.buzzfeed
308155670179203.2305470209502e-05com.vagrantup
3091556671314232.41810395096844e-05com.bostonglobe
310155664325704.36715547457013e-05com.cisco
312155618467733.82831940182663e-05org.imagemagick
3131556157113182.67799890705726e-05edu.si
3141556048110003.00286031998172e-05com.economist
315155591614036.07417952206027e-05edu.nyu
317155577118163.60704732755116e-05com.engadget
318155576482070.000124999566076152com.youtube-nocookie
32015555868970.000277346301246524com.qq
321155530429213.22905898022667e-05com.deepmind
322155484767763.8040845245793e-05com.uk
323155466355324.65404973239769e-05com.cbsnews
324155464189123.26484283331076e-05com.sfgate
325155462618683.40855431726984e-05edu.umich
326155458672370.000108426094503134com.zendesk
3271554166410592.86845309523356e-05edu.upenn
328155414177843.76122537114407e-05com.tinypic
329155413483008.31191255523891e-05uk.gov
330155393807953.72790288125804e-05gov.nps
331155387637683.84226974381513e-05es.com.blogspot
3331553478510272.91722456460812e-05org.amnesty
334155336406304.06382514774634e-05org.un
3351553352410282.91342711800371e-05com.businessweek
336155305908413.51095688807082e-05com.prweb
337155303393726.6861234129418e-05com.hp
338155290943167.85334205313627e-05com.smugmug
339155268923656.80643039930844e-05io.atom
3401552601618641.66783616152529e-05com.wikispaces
341155254365614.43833209085508e-05com.deviantart
3421552454313482.59183160819967e-05com.quora
343155236993127.90560957094403e-05com.stackoverflow
34415523357110.00266026928788743com.godaddy
345155232658673.41028601949806e-05fr.blogspot
3461552146721771.43407953172429e-05edu.hawaii
348155208002300.000115300591313306com.googleadservices
349155156261000.000260210965158318com.addthis
350155117312140.00012109708512216com.weibo
351155102711890.000134991236934387com.tripadvisor
352155100564665.31867573470599e-05edu.cmu
3531550813716541.94505252040299e-05edu.academia
354155070995464.54186649903759e-05com.samsung
355155067155224.73185887798753e-05com.delicious
356155063063107.96162415466004e-05com.salesforce
3571550462313892.47638521688997e-05edu.psu
3581550176710622.8590454491023e-05edu.utexas
359155003554605.37255274210099e-05com.businesswire
360154996713287.50250431392317e-05fr.google
3611549949810542.87241689600589e-05com.thenextweb
362154993435674.37356148766312e-05com.skype
3631549931710182.94461794513264e-05com.thedenverchannel
364154990314185.87309989696444e-05com.booking
365154982609023.30560434754304e-05com.ft
3661549756314452.37713091163388e-05com.storify
367154973201380.0001870649963529org.bbb
368154966771620.000163189952970138com.bandcamp
369154943087943.73092071789939e-05com.giphy
370154942841720.000149178020840425com.eepurl
371154929507523.86710102667409e-05org.rubyonrails
372154925939803.05867053615036e-05ly.ow
373154917793507.09328555378064e-05com.technorati
3751549031114402.39101606822674e-05uk.ac.ox
3761548928717531.80336311984752e-05edu.ucdavis
377154889363068.09041634071766e-05mp.j
3781548843816691.92668234479678e-05edu.umd
380154869781960.000130600081415146org.joomla
381154868035214.74575675113086e-05ca.google
3821548601213002.71672578237322e-05com.foxnews
3831548493511382.77285208514453e-05com.ubuntu
3841548229415182.19662629504982e-05com.hootsuite
385154819813856.4295300561396e-05com.barnesandnoble
3861548104319391.59034639194696e-05edu.bu
387154810118493.47893789322901e-05br.com.uol
3881548041310522.87377660159276e-05org.change
3891547981413902.47562201385043e-05net.researchgate
3901547973210462.87858158612307e-05ca.cbc
391154778622789.10597679558766e-05com.myshopify
3921547760411892.73974389493186e-05gov.uspto
3931547392516621.93123862526904e-05au.com.blogspot
3941547158415582.11443170255096e-05com.ibtimes
395154688671270.000201494587730564com.windowsphone
3961546855316701.92511697630888e-05au.com.smh
397154671718633.43166520951593e-05tv.ustream
3981546417817641.78996714682684e-05edu.asu
3991546305513722.53582202622928e-05com.vice
400154622579353.19137383273714e-05ca.blogspot
401154621077643.85046143189563e-05com.msdn
4021546155414872.27938995549172e-05com.gigaom
4031545642418181.70999687690856e-05fr.lemonde
404154554648863.3487653580818e-05gov.arts
4051545278514192.42221672388526e-05au.net.abc
4061545266910082.97794424811164e-05com.pcworld
407154525496064.18331091885497e-05com.geocities
4081545208811032.8074916778745e-05com.over-blog
4101545169917901.74633853037583e-05edu.arizona
4111545035013802.49714498779882e-05uk.co.theregister
412154500055574.45739926153101e-05com.symantec
413154476499553.10938947847226e-05com.intel
414154468672799.09687835769951e-05com.wufoo
4151544668317251.83123804765686e-05it.scoop
41615445503410.00094307691355208com.fbsbx
4171544464513552.57197109861248e-05com.indiegogo
418154437958973.32227150997552e-05fm.last
4191544368010112.96862133554023e-05com.searchengineland
420154434046384.03884046551975e-05org.tigris
421154419431200.000213563106056296com.xing
422154417229063.28011039522532e-05com.steampowered
423154400812659.44501307131196e-05com.wixsite
424154376228933.32776293807694e-05in.co.google
426154356346963.89468179483598e-05org.kernel
4271543524315792.08114538791667e-05edu.rutgers
428154344103906.26848449332716e-05com.hubspot
429154320569453.1541523599974e-05ly.snip
4311542999716561.94377049562573e-05com.qz
432154290034375.67323209565391e-05com.example
4331542796610332.90523344374742e-05com.pinimg
4341542416627861.10264831868485e-05org.moma
4351542377715222.18888107535011e-05com.nymag
436154231821940.00013209008750952org.icann
4371542206414702.32323539012823e-05com.examiner
4381542152010662.85530482795166e-05com.communitywalk
439154180616444.01838728589206e-05com.googlegroups
440154177965054.94289151194946e-05com.wiley
441154175763626.8485542721816e-05us.icio
442154167429713.07993617237965e-05org.wish
443154159848273.57195346814776e-05com.brightcove
444154150334645.3283954538054e-05com.fastcompany
4451541428015992.03032488006835e-05br.com.blogspot
4461541366213152.68653930493553e-05jp.blogspot
4471541314015132.21687586861463e-05edu.duke
448154128169383.17991908431134e-05com.tqn
4491541154220731.50117732512574e-05com.denverpost
4501541148718901.64147347916477e-05edu.ufl
451154094249503.13211107727205e-05com.jsbin
452154089099593.10548447850379e-05com.bibliocommons
453154087001770.000144292920751828jp.ne.hatena
4541540760315672.10091681230242e-05edu.ucsd
4551540714112692.72490186529387e-05com.shutterstock
456154065635204.74694841820413e-05com.nwsource
457154041989163.25322651787121e-05org.lls
458154029114865.06747050151017e-05com.sxsw
4591540258113132.6876013908712e-05com.box
4601540187317981.73668495255569e-05edu.tamu
4611540080713632.55335729572364e-05com.lifehacker
462154002346773.91846367320029e-05org.bitbucket
463154002148903.33135211093082e-05com.reverbnation
4641539904710382.89493063164984e-05com.intuit
4661539768914262.41475123446898e-05com.theglobeandmail
4671539531115272.17995803147687e-05com.boston
4681539493526391.16533507648765e-05edu.wsu
4691539486425081.2433445453334e-05com.codecademy
470153948563038.18811211827657e-05com.mailchimp
4711539468935998.40481697313659e-06edu.case
473153941588543.4663109492416e-05org.vim
4741539322713612.55650930767044e-05gov.usgs
4751539291829381.03733739575562e-05edu.udel
4761539164015882.06360756674507e-05com.lulu
478153889245714.35997856907777e-05com.herokuapp
4791538857928331.07911954379239e-05edu.usf
4801538738514142.42826197261023e-05org.arxiv
4811538649114582.3550335270772e-05com.hollywoodreporter
482153855028293.56957822588539e-05com.squareup
483153847015604.45055236043777e-05com.rawgit
484153845989943.01587098357822e-05int.wipo
485153845445134.89234029936009e-05int.who
4861538425314792.30586570568822e-05com.mlb
487153807584235.84115059504074e-05nl.google
4881537975415142.21563978466527e-05uk.ac.cam
489153790704355.69413663780201e-05it.google
4901537783324571.27170779759503e-05edu.rochester
491153770855834.26581830659939e-05com.sciencedirect
492153769525404.5590233520292e-05es.google
493153759304925.04565655944046e-05com.prnewswire
496153740635124.89486762586997e-05com.netdna-cdn
497153736408953.32748831426672e-05gov.census
498153727075264.70109683913423e-05org.acm
499153722369973.00968502339631e-05uk.co.eventbrite
500153716252649.45864958803737e-05com.dribbble
502153695524385.66258761382857e-05com.bigcartel
503153692067773.80078034824024e-05com.blackberry
504153689553237.62773882788658e-05it.placehold
5051536802825231.23363276509568e-05com.freewebs
507153653418473.491720143606e-05gov.house
508153650355424.5495955558977e-05com.moz
5091536467910372.89595958340174e-05com.pcmag
5101536345010902.82004663551755e-05th.co.lazada
511153632665954.21623397499087e-05com.java
5121536322024461.27866233856021e-05edu.brown
5131536250614902.27546025150466e-05org.unesco
5141535711910392.89434233736272e-05com.timeanddate
5161535521420621.50466348954934e-05com.me
517153546565564.45769549760707e-05com.wunderground
518153546084785.15403521918125e-05com.naver
5191535429710202.93180988863568e-05com.googlelabs
5201535420025451.21756108703223e-05edu.dartmouth
5211535411610692.84725561625317e-05com.cafepress
5221535324616581.93811871032173e-05nl.blogspot
5231535222133619.06439332694793e-06com.blog
524153492065664.37902161051386e-05net.openid
5251534890711812.74418894971913e-05gov.state
526153487835874.2359471534077e-05com.campaign-archive2
527153483945024.95388543949701e-05com.snapchat
5281534756335308.63918778341324e-06com.answers
5291534724131269.74923014967996e-06com.panoramio
530153464267623.85484763743062e-05org.doi
532153435275194.77202432693482e-05gov.usda
533153434279953.01325901940305e-05com.nydailynews
5341534301126781.14107241890089e-05com.fiverr
535153429044046.05683558445266e-05gov.irs
5361534045815862.06527553264387e-05com.mediafire
5371533876620491.51263773021797e-05com.yolasite
5381533873917931.74163462409128e-05edu.northwestern
5391533864530489.99423133900803e-06au.edu.unimelb
5401533826429551.02945568869793e-05pt.sapo
5411533783622521.3890799881422e-05edu.uci
5421533730421561.45372776850819e-05com.vox
5431533711213432.60056262188797e-05de.spiegel
5441533655218311.6951961616819e-05edu.indiana
5451533640714992.24037156386934e-05com.mixcloud
546153360924026.13443915079197e-05com.gotowebinar
5471533596114862.27988200023977e-05com.politico
5481533508019281.60404649507357e-05uk.ac.ed
5491533454924531.27409920227876e-05org.coursera
5501533373727281.12118871484912e-05edu.iastate
5511533332613202.67149480639566e-05com.gofundme
5521533137513112.69569796358108e-05com.sciencedaily
5531532864125611.20420514958202e-05edu.wustl
5541532666313852.48050571742303e-05com.com
555153249206274.09333910649588e-05org.postgresql
5561532403827421.11586867111723e-05com.aljazeera
5571532232833888.99605195071346e-06cc.tiny
5581532226420221.52880655359817e-05edu.jhu
559153218088693.39056838936811e-05tv.twitch
5601532093340597.40539131056082e-06org.edublogs
5611532047216062.01774638677572e-05com.scientificamerican
5621532022722531.38879432187941e-05edu.georgetown
563153202158313.56813161908985e-05fr.amazon
5641531838015032.23612984764625e-05org.eff
5651531740134378.86702554218844e-06com.metacafe
566153171001910.000133568759135424jp.ameblo
5671531706716931.8888419168614e-05edu.purdue
5681531586613412.60218077289035e-05com.techtarget
5691531550117961.7382847608717e-05com.computerworld
570153153023586.92385615991618e-05com.list-manage1
5711531494226301.16835618607635e-05com.voanews
573153131598103.63928853611669e-05org.filezilla-project
5751531273915842.07088116839661e-05com.globo
5761531253116951.88078663306853e-05com.espn
57715312290610.000565420241560954com.googletagmanager
5791531035919531.57895689797867e-05com.econsultancy
5801530998321991.42058821135541e-05edu.caltech
5811530966840067.50369095396379e-06ca.yorku
5841530881714932.26547145145621e-05uk.co.mirror
5851530723614052.44901711292864e-05gov.wa
586153066711490.000177379963368227com.bleacherreport
587153063141120.000229383265424507ru.mail
588153061445554.46855914261337e-05com.wikihow
5891530589025521.20966873341379e-05com.posterous
5901530571534508.82359793937048e-06edu.getty
5911530526131539.66215408282903e-06edu.iu
5921530519818021.73123902085213e-05com.networkworld
5931530393429901.01379864079042e-05edu.rice
594153036674935.03591727892366e-05com.force
5951530283531659.62217258755941e-06com.popsci
5961530220520571.50768807284092e-05com.blogs
597153016809663.09598932353572e-05ca.amazon
5981530101814922.26668010052997e-05com.prezi
599153008856753.92213914102238e-05org.openstreetmap
6001530043814332.39976511418591e-05org.videolan
6011530039026921.13711151099246e-05edu.oregonstate
602153002639053.28686767582425e-05com.mckinsey
6031529966313742.53508968618347e-05co.vine
6041529799521711.44041517200603e-05com.udemy
6051529672616901.90166303335655e-05com.indeed
6061529647713032.71367812776429e-05com.500px
6071529522817701.77233586297493e-05com.airbnb
6081529471917561.79860216972368e-05com.us
6091529467628961.05399264773873e-05edu.buffalo
6101529363416841.90790330006778e-05com.codeplex
611152929342539.81881014385652e-05jp.co.amazon
6121529228815422.14433652955457e-05mil.navy
613152917603576.9641067198322e-05net.themeforest
6141529174818751.65847048348261e-05com.searchenginewatch
6151529113610012.9937094462355e-05com.weather
6161529080224981.24886423946348e-05com.instructables
6171528998815982.04147744275967e-05com.infoworld
6181528886513382.60730524379888e-05org.postimg
6191528865510142.96274588622905e-05gov.nist
6201528765416551.94498068426432e-05me.flavors
6211528715321891.42494598478046e-05hk.com.google
6221528562315692.09919187442072e-05org.worldbank
623152854484465.57049729509609e-05jp.ne.sakura
624152852119193.24107060440747e-05gov.senate
6251528462913302.62832178308331e-05com.dell
6261528382711372.77346639122896e-05com.alternion
6271528359418581.67084293512309e-05org.weforum
6281528228825781.19657096495962e-05edu.vanderbilt
6291528180010482.87643509027424e-05com.istockphoto
6301528178213192.67520978971456e-05org.unicef
6311528137118191.70846042948261e-05com.blogtalkradio
6321528130113682.54329071778898e-05com.psychologytoday
6331528064116042.02133971936181e-05com.digitaltrends
6341527972111072.80258341333577e-05com.bitballoon
635152795222010.000128219206443624org.purl
63615279264360.00104599466571899com.parallels
6371527920028801.0610438448301e-05au.edu.anu
6381527914713792.4993271977574e-05com.walmart
6421527823819171.61582921968463e-05com.howstuffworks
6431527822724101.30002357541342e-05au.com.theaustralian
6441527796620191.52926412575938e-05ca.utoronto
6451527790031089.82306713797465e-06com.seekingalpha
646152778587673.84259493608188e-05com.cargocollective
6471527731210362.89853194713678e-05com.yarnpkg
6491527504625461.21728936988854e-05com.patch
650152731445304.65890979908163e-05com.marriott
6511527296025641.20339218653302e-05nz.co.stuff
6521527283614612.34742927248467e-05com.bufferapp
653152725342330.000113347706063239com.shopify
6541527228728611.0673622554462e-05br.com.abril
655152721725794.31163726460518e-05gov.fda
6561527211126541.15546716330409e-05org.metmuseum
6571527074817711.7720478702479e-05com.mac
6581527005219211.61418125507527e-05edu.unc
659152685728853.35402797098504e-05com.webmd
6611526828321211.47329526454372e-05com.nokia
6621526803314832.2878848034952e-05org.plos
6631526775440467.42719003848336e-06edu.ua
6641526700122381.3967856594105e-05org.wiktionary
665152645344365.67508362709212e-05com.bitly
6661526452117391.8211369785099e-05com.xkcd
6671526446019301.59995476208952e-05com.amzn
6681526445324691.26617143551831e-05com.allthingsd
6691526282528561.06933839212956e-05edu.ucr
6701526263615112.21802765343256e-05gov.fbi
6711526260811012.81053814869449e-05com.oreilly
6721526063919751.56605882146644e-05net.comcast
6741525959123911.30968680707361e-05com.pastebin
6751525913924761.26118695308562e-05net.boingboing
6761525873317841.75043751947912e-05com.playstation
6771525866623981.30425871141071e-05com.canalblog
678152582566953.89580808757773e-05net.launchpad
679152579334775.1625634539935e-05com.youku
6801525699934128.92246050318757e-06cc.co
6811525682831409.69399959688503e-06com.twitpic
6831525670416132.00790892379684e-05com.today
6841525621311162.79029401808774e-05org.cmlibrary
6861525544612222.73108581082248e-05de.heise
6871525511822281.40200068711249e-05com.googlepages
6881525491816721.92155195280816e-05com.technet
689152540683028.22175508543349e-05com.getbootstrap
6901525381234338.87454964811042e-06edu.temple
6911525299925561.20687755115901e-05com.ign
6931525247431519.66350917742778e-06com.upi
6941525227620741.50071783868689e-05com.nba
695152520148483.48867990264086e-05cn.com.sina
6961525158618701.66343745503257e-05uk.co.thesun
6971525077821051.48179911042675e-05com.discovery
6981525016913042.71164863688636e-05com.getpocket
6991524953019321.5985973562373e-05com.thedailybeast
700152494568943.32758542409795e-05org.gnupg
701152493389863.03560667284895e-05net.azurewebsites
7021524929513592.56231659003632e-05gov.fcc
7031524923313252.6533125002868e-05ru.google
704152490918093.64185428454994e-05com.hilton
7051524809411212.78672834461891e-05gov.ny
7061524687341557.2064158065949e-06edu.syr
7071524634614482.37502348483709e-05br.com.google
7081524626515232.18628778111356e-05com.zazzle
710152452586044.19341270866239e-05com.entrepreneur
7111524512223481.34109723726176e-05ch.ethz
7121524463942447.05352140883181e-06com.space
7131524461830629.9552002600227e-06com.indiewire
7141524326836898.20683368467091e-06it.repubblica
7151524248711592.75783380257729e-05jp.exblog
7161524170716271.97344047444238e-05com.vmware
7171524115317721.77159273715622e-05gov.nyc
7181524059744696.70690376470189e-06edu.fiu
719152402655854.24602948328796e-05com.typeform
7201523977911192.7880890311812e-05edu.alamo
721152393575764.32670162572188e-05com.metafilter
7221523910018741.66039378221903e-05gov.utah
7231523856627561.10983574160883e-05edu.pitt
724152382138373.54670748199058e-05com.technologyreview
7251523745527431.11564239851983e-05ms.1drv
7261523697120011.54095292080742e-05com.mercurynews
7271523558725281.23026764563188e-05com.urbandictionary
7281523521848126.26441533285652e-06uk.ac.st-andrews
731152342405004.95978355797237e-05org.freecsstemplates
7321523411923101.36176106587469e-05edu.ncsu
733152332987743.81419726467888e-05com.newsweek
7341523311325151.23694068280388e-05edu.vt
7351523292822421.39511923347534e-05org.gimp
7361523253521251.47205731999793e-05com.ehow
7371523242227661.10740320522725e-05org.greenpeace
7391523178013862.4798751939996e-05com.steamcommunity
7401523175916811.90934768631241e-05com.mcafee
7411523005925791.19584044094577e-05com.irishtimes
7421522909515972.04331098882185e-05com.livestream
7431522882813642.55057605268654e-05com.stackexchange
7441522845213772.51087702775784e-05com.feedly
7451522808043056.93390584748876e-06edu.pdx
746152279492440.000102784590967636jp.co.yahoo
7471522789618791.65447602576108e-05com.zillow
7481522713523991.30401089444467e-05com.foxsports
7491522693712172.73204489809514e-05com.redbubble
7511522591515712.09669925267844e-05com.reference
7521522580817751.76300650079329e-05se.google
753152251665184.78475650600298e-05gov.copyright
7541522493520461.51404543018914e-05uk.co.wired
7551522470810532.87281963287335e-05com.templatemonster
7561522378626031.18487163717953e-05com.tutsplus
7571522314822751.38238195338617e-05edu.uiuc
7581522199551345.8713899338997e-06edu.brandeis
7591522172425631.20373170102433e-05com.observer
7601522065914442.37776101286514e-05com.justgiving
7611522063530171.00492793873175e-05gd.is
762152202381430.000182177431084415com.people
764152182724545.43767285380984e-05com.nypost
7651521779220171.52947227103117e-05org.openoffice
7661521766415212.19029938843643e-05net.yahoo
767152172395154.86242565462417e-05com.photoshelter
7681521619911052.80631360388941e-05org.stopidentityfraud
7691521595629411.03557602124572e-05edu.nd
7701521582710762.83930562753374e-05com.sagepub
771152156197923.73306683068506e-05com.cdbaby
7721521514630809.89422877978489e-06com.pbworks
7731521507831319.73273115724507e-06ch.epfl
7741521473010162.95913720075838e-05net.docdroid
7751521434435388.59892849115483e-06com.nationalpost
7761521430618361.6890525363002e-05nl.xs4all
7771521313025311.22807841365904e-05edu.osu
7781521239017541.80225680175084e-05com.target
7791521173329961.01271424887745e-05ca.ualberta
7801521025222641.38516035718976e-05com.britannica
7811520885535218.66915050464592e-06com.readwriteweb
7821520875623621.32853529783148e-05tv.blip
783152084496623.95683024324097e-05com.webnode
784152076435334.61857724018269e-05com.informit
785152075765884.23206675622496e-05com.houzz
786152066819173.25050735019443e-05com.atlassian
787152064177033.88738057972146e-05gov.epa
7881520530514362.39702540982123e-05com.patreon
7891520487519131.62263884951756e-05org.pnas
7901520452646026.54795414698615e-06com.plurk
7911520449525651.20309774714031e-05com.deadline
7921520411724611.2694191531719e-05com.csmonitor
7931520324525581.20590026406928e-05com.thefreedictionary
7941520285614822.29792000644675e-05com.techradar
795152016543666.79713039346072e-05gov.export
7961520154613022.71523901829089e-05jp.jugem
7971520117139147.66665907938012e-06com.patheos
798152009657973.70188361564265e-05com.springer
7991520063331619.63918862375184e-06com.fifa
8011519988229101.04940988571428e-05uk.co.metro
8021519982827961.09820826242647e-05com.podbean
8031519934728421.07590520070566e-05com.eu
8041519886523411.34651543708415e-05com.9to5mac
8051519876638937.72829568187162e-06edu.ucsc
8061519836613662.54906334920141e-05gov.sba
8071519763825291.22968407272032e-05org.khanacademy
8081519688333209.15893198278876e-06com.discovermagazine
8091519604616301.97102959583915e-05com.bloglovin
8101519582313122.68820751313068e-05gov.dot
8111519493924791.26030313566866e-05gov.cia
8121519483639137.66754130260334e-06edu.missouri
813151945217413.86928597648577e-05gov.sec
814151941064625.34455013327675e-05com.adage
8161519405917481.81076747100426e-05org.owasp
818151933618143.62833594255235e-05org.maven
820151931825904.22817895742876e-05com.clicky
8221519258719861.5564837127422e-05com.netvibes
8231519150317011.86858940694352e-05uk.org.greenend
8241519119230799.8973806614334e-06edu.uic
8251519049315562.11535012115229e-05net.seesaa
8271518958418711.66248944998599e-05com.livescience
8281518905823291.3520097935398e-05com.newscientist
8291518901124341.28219860937364e-05edu.umass
8301518828344816.69072560608157e-06edu.byu
8311518820523431.34543166872523e-05edu.gwu
8321518744025051.24388641430322e-05com.getsatisfaction
8331518675035638.49826407908942e-06edu.ucar
8341518604219671.57161284089379e-05com.bestbuy
8351518560126181.17677918287584e-05org.slashdot
8361518556033029.20889985322655e-06com.lynda
8371518502515522.12374312638794e-05com.itunes
838151839568983.31541720640316e-05com.linksynergy
839151820442200.000119236860219388com.elegantthemes
840151818954715.27435047435429e-05com.zenfolio
841151816142998.36816972799617e-05kr.flic
8421518148647876.29442653768555e-06com.treehugger
8431518140749066.14399747187076e-06com.4shared
8441518127727751.10589713260247e-05sg.com.google
845151812568703.38511869472674e-05com.uservoice
8461518071725531.20874453751689e-05au.com.news
8471518023029591.02691773243877e-05edu.uoregon
84915179697390.000963859187439238com.atlassolutions
8501517927614152.42376980846539e-05com.calameo
851151787656943.89619798369146e-05gov.ed
8531517801514412.38849164666484e-05net.edgesuite
8541517682218101.72249794008617e-05com.angelfire
8551517664128641.06660801729435e-05com.friendfeed
8561517645933469.09973283027648e-06com.squidoo
8571517623143626.86729655781333e-06edu.sjsu
858151754825344.59747379453658e-05com.list-manage2
859151752826394.03760068683785e-05org.w
8601517527619761.56554998174229e-05com.billboard
861151752708883.33229976472426e-05org.jenkins-ci
8621517478418151.71639406507681e-05org.craigslist
8631517309814132.43275295847814e-05ch.google
8641517291216241.98193414781252e-05com.norton
8651517275924991.24788748858098e-05org.gutenberg
8661517259835428.58529237980699e-06edu.gmu
8671517207943326.90179812828641e-06com.urbanoutfitters
869151702499633.0982348094362e-05com.warriorplus
8701517006528921.05536267729891e-05gov.lbl
8711516991524301.28272703930632e-05ca.ubc
8721516964736708.24518633348136e-06edu.emory
8731516963519601.57546904235563e-05com.madmimi
8741516959127771.10579126365468e-05com.flipboard
8751516879326491.158087404906e-05uk.co.express
8761516856130449.9973459042887e-06com.groupon
8771516852965964.61066717637004e-06fm.ask
8781516847217581.79602590370848e-05org.dyndns
879151681219403.16738878962008e-05org.osgeo
8801516797134028.95368354470453e-06edu.unh
8811516756837028.16829795347617e-06com.dreamstime
8821516744721741.43790430825757e-05de.mpg
8831516721723521.33549876043573e-05com.timeout
8841516707522911.37930561320996e-05com.trello
8851516688544346.75650883193862e-06net.inquirer
8871516378833978.97470907356599e-06tt.db
8881516340325681.2028676812975e-05com.smashwords
8891516304540097.5009851913861e-06org.lifehack
890151627906114.15506626046456e-05com.mapbox
8921516262740887.33799753523127e-06com.theta360
8941516250826501.15685409144789e-05com.elpais
8951516218939847.5452462328069e-06mx.unam
8971516207417151.84504055261042e-05com.hyatt
8981516158922991.37457361006392e-05ca.qc.gouv
8991516129329151.04695633262643e-05edu.uga
9001516091228901.05582698397468e-05tv.periscope
9011516084622391.39665182554598e-05com.ikea
9021516070032319.40684172196742e-06com.rt
9031516013115082.22312003369202e-05com.netscape
9041515991640847.3412350461331e-06com.dpreview
9051515955729691.02451694615442e-05com.waze
9061515947031749.58306422626019e-06edu.tufts
9081515824814652.32953650777979e-05com.usps
9091515820726331.16729003842566e-05org.wnyc
9101515688946206.52677225898427e-06com.pbase
9111515688023271.35354017520418e-05uk.co.huffingtonpost
9121515687622041.41810128439695e-05com.socialmediaexaminer
9131515648524521.27547105359095e-05com.miamiherald
9141515635116771.91454793535634e-05com.163
9151515526219291.60291831429757e-05com.nfl
9161515399115252.18455988672043e-05com.fedex
9171515395918341.69239070862306e-05com.suntimes
9181515390734358.87050857955755e-06net.uk
9191515376528461.07327970998889e-05net.earthlink
9201515338925751.19824685546719e-05org.eu
9211515183510432.88478957939967e-05jp.ac.kobe-u
9221515142940147.48637493537742e-06com.macrumors
923151508876224.12161691159013e-05com.friendster
9251515065849386.09241171910296e-06li.paper
9261515059310172.94689613731383e-05org.aclweb
9291514778842487.04318153363729e-06org.nypl
9301514625036618.25885549831049e-06com.seattlepi
931151462249323.19355693878378e-05edu.utep
9321514612940917.33275777840132e-06com.esquire
9331514561516511.94907985965323e-05com.alibaba
9341514525419461.58403729551741e-05com.gallup
935151451919753.07094414631654e-05com.gartner
9361514468730321.00243908229847e-05org.ibiblio
9371514460422061.41652445483039e-05ru.habrahabr
938151443555694.36758910514297e-05com.orkut
9391514401037658.05161795367949e-06com.psychcentral
941151427119993.00349367316673e-05com.tandfonline
9421514229640757.3622724322498e-06be.blogspot
9431514182616361.96036095572849e-05ru.spb
9441514094211912.73934146878194e-05jp.ne.goo
9451514070251995.81304350399957e-06io.soup
9461514043963684.80285781528238e-06com.autoblog
9471514009647696.31731333379226e-06gr.blogspot
9481514003017311.82779854197245e-05com.marketingland
9491513977233559.0739118458068e-06org.propublica
9501513973113822.49154931051116e-05com.unsplash
9511513922436728.24261668907916e-06uk.ac.vam
95215138529540.000677034331007159net.ovh
9531513839118531.67490179527703e-05org.ap
9541513838217471.81116122309211e-05de.bayern
9551513831726141.1780491106467e-05com.ezinearticles
9561513827535328.63374503998574e-06at.ac.univie
9571513807619021.62814600358615e-05com.lww
958151379048263.58116025428148e-05org.whatbrowser
9591513780219091.62391225244165e-05ru.narod
9601513749517991.73626417117781e-05de.welt
9611513712825271.23125015700162e-05com.vogue
9631513676826671.14903880876818e-05com.sacbee
9641513649522231.40422839026896e-05com.salon
9661513519333709.04444445804001e-06com.howtogeek
9671513505638777.75786443249133e-06com.scienceblogs
9681513483025761.19753053105799e-05com.speakerdeck
9691513463537408.0925036532909e-06edu.utk
9701513462221151.47665516324275e-05edu.colorado
9711513436226631.15263452785689e-05ca.ctvnews
9721513333332169.46266143255436e-06edu.rit
9731513314950215.99011002035218e-06com.spreaker
9741513287231729.58906484578881e-06com.sbnation
9751513282944746.70057653000476e-06com.ravelry
9761513246191193.36255538243955e-06com.purevolume
9771513221314292.40986597355896e-05org.redcross
9781513195110122.96536226245788e-05it.amazon
9791513053038307.87841169754985e-06jp.co.japantimes
9801513052339477.60521847167761e-06edu.neu
9811512996329841.01806930403748e-05org.notepad-plus-plus
9821512985412492.72621515187053e-05us.tx.state
9831512983858065.24730127406629e-06edu.stonybrook
984151291908793.36537032047091e-05com.githubusercontent
9851512875333569.07349412732171e-06nz.co.nzherald
9861512868244666.71188460059978e-06au.edu.uq
9871512820217431.81424145083871e-05com.hulu
9891512737037228.11380321181256e-06com.kotaku
9901512719728031.09596426379084e-05com.pixlr
9911512683225361.22403614813801e-05org.plosone
9921512681923191.35869429391614e-05be.skynet
9931512662852855.73383523576963e-06org.kiva
9941512624558895.19218457452368e-06com.dailykos
9951512556519471.58394285740344e-05com.windows
9971512464843806.83359980472686e-06com.appleinsider
9981512444572724.16137009681623e-06com.zimbio
9991512431234938.72400390257565e-06com.sony
10001512423115162.20203522611855e-05com.usnews
10011512411822131.41046155978645e-05org.jstor
10021512406335788.44147236834851e-06br.usp
10031512396216921.89005659232104e-05com.hotmail
10041512394535918.42266358295749e-06org.publicradio
10051512392342996.94773919350873e-06ca.queensu
10071512291930949.85256523885283e-06com.techsmith
10081512284613922.47451015821843e-05com.starwoodhotels
10091512266923321.35082248225311e-05int.itu
10101512244147216.38885267967483e-06ru.msk
10111512232119181.61560935000156e-05com.slack
10121512226735188.67216939620092e-06com.ocregister
10131512152815612.11146320845391e-05com.att
10141511959424191.29223390974961e-05ru.com
10161511876127261.12330945363111e-05fr.lefigaro
10171511827019361.59672085303341e-05com.merriam-webster
1018151179088153.62635751772531e-05org.sonatype
10191511790615762.08815868514927e-05com.domain
1020151177098193.60499120933233e-05jp.livedoor
10211511753320471.51327792586241e-05com.gettyimages
10221511744827801.10566824510243e-05org.pewresearch
10231511724821461.46293466334101e-05com.xbox
10241511720423361.34965990334014e-05com.philly
10251511713519981.54386617149377e-05com.macworld
10261511677029351.03852509028486e-05com.pandora
10271511600522351.39818040735583e-05com.screencast
10281511577829131.04753710838801e-05cn.com.chinadaily
10291511523817811.75423937882377e-05co.angel
10301511493836118.37665878835035e-06com.softpedia
10311511481834388.86482430240979e-06com.sun-sentinel
10321511456237618.0617698796768e-06com.gq
10331511397547606.33100426632562e-06uk.ac.lse
10341511345656225.41405793070476e-06ie.dcu
10351511306642427.05731974728055e-06cz.cvut
10361511222719051.62743170924513e-05us.mn.state
10371511199740007.52083022965871e-06com.gamespot
10381511143916801.90947838704124e-05net.battle
10391511108243266.90589418244931e-06org.kqed
10401511060025131.23787698420637e-05com.deezer
10411511059063084.84743641510709e-06com.weheartit
10421511040624361.28163659354189e-05us.fed.fs
10431511000560605.04386130379262e-06edu.boisestate
10441510971636588.26486260361324e-06com.itv
10451510961935758.44807075839288e-06com.chromeexperiments
10461510922224271.28446958304232e-05com.ecwid
10471510922218171.71096481543741e-05com.rollingstone
10481510909021621.45052881285571e-05com.lg
1049151090346493.97925746370024e-05gov.usa
10501510872913842.48508143917158e-05com.mediapost
10511510864218941.63687473890889e-05org.7-zip
10531510752834518.82152744851336e-06uk.ac.ucl
10561510603914182.42315485369215e-05org.oecd
10571510517145876.56637741524037e-06edu.uconn
10581510489219921.54943353581812e-05com.oup
10591510409721871.42705509195955e-05com.fastcodesign
10601510406917611.7949821530063e-05org.fao
10611510368919661.57209150085303e-05com.ssrn
1062151030236124.15348394643097e-05com.ea
10631510270111972.73652354802987e-05site.tenerifeforum
10641510223424561.27338596207715e-05org.gnome
10651510213228121.09082842009225e-05com.dallasnews
10661510189247036.41996854624355e-06edu.baylor
10671510095726251.17066663780132e-05com.upwork
1068151009385684.3688041709467e-05com.udacity
10691510080338107.92556546534424e-06com.marthastewart
1071150999031080.000233897006385183com.scorecardresearch
10721509984072274.18439407525323e-06com.smore
10731509976421851.42753327167622e-05gov.uscourts
10741509948723311.35150828730417e-05com.thestar
10751509938637967.95835809109849e-06com.mediabistro
10761509855624671.26718746433665e-05com.washingtontimes
10771509828721731.43821141052024e-05gov.fws
10781509824726981.13376477961827e-05de.fu-berlin
10791509693833948.97977597433521e-06org.thinkprogress
10801509641523781.31988303928714e-05uk.co.ebay
10811509603115342.16177620913569e-05org.cancer
10821509593021921.42362632550556e-05com.tunein
10841509538245286.63269521044456e-06net.fusion
10851509532334838.75212816606324e-06com.shutterfly
10861509490128451.07360892958791e-05edu.unl
10871509450216331.96647199416434e-05gov.archives
10881509448135558.53632335646553e-06ca.uwaterloo
10891509433041957.13316306266006e-06com.rebelmouse
10901509422970954.27197287772721e-06nr.co
10911509408410732.84596850945921e-05com.wsoctv
10921509323424401.28051282516683e-05ch.cern
10931509296713762.51450016917556e-05com.magentocommerce
10941509286230051.00907350023168e-05com.blurb
10951509269127301.12020560201465e-05com.medscape
10981509162748916.17105152283906e-06com.thingiverse
10991509151516122.00799886698044e-05com.biblegateway
11001509117431879.55435977880778e-06com.carbonmade
11011509015223281.35223797685066e-05gov.weather
11021508999522021.41915895238769e-05com.canva
11031508972416631.93009906939172e-05com.thehill
11041508904028831.06020285008986e-05com.twilio
11051508876832819.2762882038917e-06tv.arte
11061508864569714.35419930787964e-06sg.blogspot
11071508785240807.35568049360429e-06com.gawker
11081508739175074.03564477881975e-06fr.online
11091508717633968.97494140584996e-06edu.colostate
11101508694652595.75702090283342e-06edu.nau
11111508693731329.73185560248581e-06net.jsfiddle

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl’s Google Group!

July 2017 Crawl Archive Now Available

The crawl archive for July 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-30/. It contains 2.89 billion+ web pages and over 240 TiB of uncompressed content.

To improve coverage and freshness we used the top 50 million ranked hosts from the Feb/Mar/Apr 2017 webgraph data set and added over 550 million new URLs (not contained in any crawl archive before), of which:

  • 300 million URLs were found by a side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts;
  • 250 million URLs are a random sample extracted from sitemaps (if provided by any of these 50 million hosts).

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-30/segment.paths.gz100
WARC filesCC-MAIN-2017-30/warc.paths.gz7200057.62
WAT filesCC-MAIN-2017-30/wat.paths.gz7200018.58
WET filesCC-MAIN-2017-30/wet.paths.gz720008.19
Robots.txt filesCC-MAIN-2017-30/robotstxt.paths.gz720000.16
Non-200 responses filesCC-MAIN-2017-30/non200responses.paths.gz720005.03
URL index filesCC-MAIN-2017-30/cc-index.paths.gz3020.25

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-30/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

June 2017 Crawl Archive Now Available

The crawl archive for June 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-26/. It contains 3.16 billion+ web pages and over 260 TiB of uncompressed content.

To improve coverage and freshness we used the top 40 million ranked hosts from the Feb/Mar/Apr 2017 webgraph data set and added almost 800 million new URLs (not contained in any crawl archive before), of which:

  • 500 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 40 million hosts;
  • 300 million URLs are a random sample extracted from sitemaps (if provided by any of these 40 million hosts).

About 33% of the crawl archive’s 3.16 billion URLs overlap with the preceding May 2017 crawl.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-26/segment.paths.gz100
WARC filesCC-MAIN-2017-26/warc.paths.gz7183963.33
WAT filesCC-MAIN-2017-26/wat.paths.gz7184021.06
WET filesCC-MAIN-2017-26/wet.paths.gz718399.44
Robots.txt filesCC-MAIN-2017-26/robotstxt.paths.gz718400.13
Non-200 responses filesCC-MAIN-2017-26/non200responses.paths.gz718401.77
URL index filesCC-MAIN-2017-26/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-26/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May 2017 Crawl Archive Now Available

The crawl archive for May 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-22/. It contains 2.96 billion+ web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we used the top 25 million ranked hosts from the Feb/Mar/Apr 2017 webgraph data set and added about 500 million new URLs (not contained in any crawl archive before), of which:

  • 330 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 25 million hosts;
  • 160 million URLs are a random sample extracted from sitemaps (if provided by any of these 25 million hosts).

About 40% of the crawl archive’s 2.96 billion URLs overlap with the preceding April 2017 crawl.

The following changes have been made to WARC (also WAT and WET) files:

  • the timestamp in WARC filenames now indicates the capture time (fetch time) of the WARC content (see details)
  • WARC files and the URL index now contain the detected MIME type (based on the actual content) in addition to the “Content-Type” sent in the HTTP response (details).

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-22/segment.paths.gz100
WARC filesCC-MAIN-2017-22/warc.paths.gz5678857.98
WAT filesCC-MAIN-2017-22/wat.paths.gz
5678819.87
WET filesCC-MAIN-2017-22/wet.paths.gz
567888.95
Robots.txt filesCC-MAIN-2017-22/robotstxt.paths.gz
567880.11
Non-200 responsesCC-MAIN-2017-22/non200responses.paths.gz567881.38

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-22/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Common Crawl’s First In-House Web Graph

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

The following results from the development of this graph:

  • a ranked list of hosts to expand the crawl frontier;
  • pages ranked by Harmonic Centrality with less influence from spam, among other attributes (for comparison we include PageRank);
  • the template/process for Common Crawl to produce graphs and page rankings at regular intervals.

We produced this graph, and intend to produce similar graphs going forward, because the Common Crawl community has expressed a strong interest in using Common Crawl data for graph processing, particularly with respect to:

*Please note: the graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. Seventeen percent (65 million) of the hosts represented have been crawled in one of the three monthly crawls. Thus, 320 million of the hosts represented in the graph are known only from links. (Host names are not wholly verified: host names that are obviously invalid are skipped; others are not resolved in DNS.)

 

Extraction of links and construction of the graph

Links are taken from WAT extracts but we also included redirects from WARC files of the redirect and 404 dataset. All types of links are included, including pure “technical” ones pointing to JavaScript libraries, web fonts, etc.

The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain. Node IDs are assigned sequentially to the the node list sorted by reversed host name. This keeps links between hosts of the same domain or in the same country-code top-level domain close together and allows for an efficient delta-compression of edges.

The extraction is done in three steps:

  • links are extracted, reduced to host-level links and stored as pairs 〈reversed host from, rev. host to〉
  • host names are assigned to IDs and edges are represented as 〈from id, to id〉 pairs
  • ranks are computed.

The first two steps are done by Spark and Python; the code is part of the project cc-pyspark. To compute the rankings the webgraph is loaded into the WebGraph framework.

 

Hosts ranked by Harmonic Centrality and PageRank

We provide a list of ranked nodes (host names) by

You can download the ranks of all 385 millions hosts. Below are the top 1000 hosts ranked by Harmonic Centrality.

Top 1000 hosts ranked by harmonic centrality

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
1380394081.00775205com.facebook
2349580843.00428814com.twitter
3335404402.00540973com.googleapis.fonts
4328874864.00286281com.youtube
5311288867.00167971com.google.plus
6303758126.00183995com.google
72853544212.00101167com.linkedin
82830483210.00118472com.instagram
9280868348.00158847com.blogger
102758349422.00055368com.pinterest
112721818840.00029768org.wikipedia.en
122707850415.00080352org.wordpress
132695958629.00043823com.apple.itunes
142666668226.00053618com.blogspot.bp.2
152664547225.00053637com.blogspot.bp.4
162662702668.00020569be.youtu
172659304624.00053727com.blogspot.bp.3
182657299620.00057575com.blogspot.bp.1
192652271058.00022707com.amazon
202649159633.00035486com.google.play
212646905619.00059326com.google.maps
222646742450.00024854com.vimeo
232644344845.00027360com.flickr
24264240445.00219270org.gmpg
252632463637.00031984com.google.mail
262631806043.00028895gl.goo
272628683644.00028723com.github
282628018865.00021446com.microsoft
292612469079.00018142com.google.support
302610349232.00043580com.adobe
3126085138100.00016162ly.bit
322604600298.00016696com.google.docs
3326023404108.00013636org.w3
3426014732152.00010471com.facebook.developers
352594901230.00043749me.wp
3625905092222.00007208com.nytimes
372584512472.00019664com.google.sites
3825833394186.00008012com.facebook.m
3925790176138.00011458com.weibo
4025732096147.00010760com.apple
412571563439.00029787com.paypal
422569229038.00030937co.t
4325689954407.00004073com.huffingtonpost
4425679134115.00013061org.creativecommons
4525657428188.00007975com.facebook.apps
462560992611.00112239com.blogblog.resources
4725583454414.00004019com.forbes
4825546588131.00012114com.imgur
4925535844416.00004002net.slideshare
5025508504669.00002026com.mashable
5125503482454.00003445com.tinyurl
522549839652.00024415com.etsy
5325478394659.00002054com.businessinsider
5425475376201.00007711com.google.drive
552547533427.00050381com.wordpress
5625458606481.00003153com.washingtonpost
5725411174189.00007959com.myspace
5825387732146.00010823com.medium
5925383148417.00003969uk.co.bbc
6025381774415.00004003com.imdb
6125381672603.00002224com.wired
6225361154572.00002431com.techcrunch
6325349038428.00003804com.bing
6425347080183.00008131com.google.groups
6525339734807.00001619com.time
6625325246473.00003265com.theguardian
6725313804447.00003533uk.co.amazon
6825313562307.00006529com.reddit
6925309096162.00009843com.feedburner.feeds
7025281640306.00006533com.tumblr
7125275352190.00007951com.soundcloud
7225266250694.00001917uk.co.dailymail
7325255004570.00002446com.surveymonkey
7425252200181.00008142org.archive.web
752525054076.00018708com.eventbrite
7625236658171.00008740com.google.developers
772523528659.00022308com.vimeo.player
7825226754613.00002197org.mozilla.addons
7925220086777.00001695com.wsj.online
8025203382316.00006277com.yahoo
81251928201356.00001042com.wsj.blogs
8225177614336.00005412com.issuu
8325174772828.00001586com.latimes
842517348266.00021367com.vk
8525164782628.00002133com.cnn
8625163448396.00004196com.dropbox
8725154430588.00002310me.about
8825149472698.00001900org.npr
8925121032504.00002963com.meetup
9025105294112.00013267com.gravatar
91251034881094.00001268com.theatlantic
9225100710652.00002072com.gmail
9325099516528.00002776org.wikimedia.upload
9425093888119.00012789com.imgur.i
9525090464215.00007402me.fb
96250899081156.00001216com.venturebeat
9725084918391.00004336com.google.picasaweb
9825083712526.00002804com.dropboxusercontent.dl
99250817581083.00001277com.ted
10025068848740.00001773com.reuters
101250673921439.00000979com.economist
10225060080971.00001483com.adweek
1032504074814.00084868com.macromedia.download
104250361481619.00000895com.thenextweb
10525029640681.00001990com.goodreads
106250259541168.00001199com.cnn.money
10725024186476.00003199com.google.translate
10825022650530.00002745com.dailymotion
10925022502170.00008762com.facebook.fr-fr
11025012510713.00001855gov.whitehouse
11124989682501.00002977gov.nih.nlm.ncbi
1122498434480.00018111com.twitter.mobile
113249653741457.00000967ca.cbc
11424959346718.00001832org.archive
11524943750216.00007377com.facebook.es-la
116249367721053.00001316org.pbs
11724936464219.00007270com.microsoft.windows
11824931290804.00001628com.scribd
119249308461067.00001298com.cnbc
120249307222132.00000684com.youtube.m
121249223601299.00001085com.buzzfeed
12224915482426.00003817org.wikimedia.commons
123249143681522.00000924com.newyorker
12424912590600.00002235com.microsoft.msdn
12524906018671.00002023com.geocities
126249006242899.00000490net.boingboing
127248888381318.00001070com.gizmodo
12824878576355.00005095com.netvibes
12924869944826.00001589com.prnewswire
130248648301824.00000811com.examiner
131248540061340.00001059com.engadget
1322485297854.00024174net.sourceforge
133248479881951.00000758com.storify
13424846402598.00002245com.stackoverflow
13524845912955.00001522uk.co.guardian
136248439041164.00001206uk.co.independent
13724836756168.00009144com.google.code
138248356421080.00001280com.nature
13924830762782.00001688com.bizjournals
1402482868242.00029373com.twimg.pbs
14124825094141.00011273com.google.feedburner
14224823614185.00008036com.macromedia
14324821860151.00010505org.mozilla
14424821036212.00007504com.facebook.web
145248200421785.00000829com.pcworld
14624815138589.00002307com.ebay
14724808886196.00007819com.qq.t
14824802454159.00010021com.amazonaws.s3
149247873041360.00001039com.foxnews
150247831681482.00000940com.slate
15124781108432.00003772net.behance
15224779900648.00002090com.ibm
153247748722022.00000727com.arstechnica
154247716082715.00000522com.indiatimes.timesofindia
155247693802004.00000732au.net.abc
156247687021660.00000885com.marketwatch
15724768442424.00003843de.amazon
15824768200472.00003274com.google.feedproxy
15924756046221.00007213com.facebook.en-gb
1602475310297.00017011com.facebook.l
16124753098373.00004743com.digg
162247514801407.00001004com.ft
16324747080356.00005021com.microsoft.support
164247469201621.00000895com.quora
165247455001929.00000769com.gigaom
166247418681728.00000861com.sfgate
167247401501121.00001241tv.ustream
168247273861454.00000968com.chicagotribune
169247260661432.00000981com.wikihow
17024719856174.00008585com.messenger
17124713146136.00011598com.istockphoto
17224711160344.00005293com.stumbleupon
173247066921055.00001315uk.co.bbc.news
174247056982007.00000732com.boston
175247022882028.00000723com.searchengineland
176246998041193.00001154com.cafepress
17724699314869.00001553fm.last
1782469125817.00069783com.google.accounts
17924686178609.00002203com.usatoday
180246823021538.00000919com.indiegogo
18124659484997.00001423com.google.books
18224658388963.00001506com.yahoo.finance
18324655636637.00002121com.yahoo.groups
18424652858452.00003483com.google.news
18524651408349.00005207com.googleusercontent.lh4
18624648530339.00005363jp.co.amazon
1872464673646.00026535com.wix
18824642202522.00002841to.amzn
189246386843434.00000408com.nationalgeographic
190246368461070.00001295org.mozilla.developer
191246292121838.00000805com.businessweek
192246261241186.00001167com.dropbox.dl
193246260581200.00001149com.fortune
194246249781812.00000816com.mtv
195246240802074.00000707com.go.espn
19624619310142.00011187com.facebook.de-de
197246172941492.00000936com.gofundme
19824615060517.00002905uk.gov
199246100441036.00001338com.cargocollective
200246093181059.00001310com.zazzle
20124609048462.00003391com.nbcnews
202246077041398.00001010ly.ow
203246011542048.00000714com.politico
204245974522010.00000731com.cnet.news
205245936142482.00000576au.com.smh
20624592606743.00001767com.kickstarter
20724589922105.00014729org.w3.validator
20824583110758.00001739ca.google
20924571000687.00001965com.delicious
210245685761376.00001028com.yahoo.news
211245648441444.00000975com.prweb
212245631081486.00000937com.technologyreview
213245622162925.00000486com.csmonitor
214245567061047.00001324com.go.abcnews
215245562603095.00000455com.merriam-webster
21624555618645.00002095com.spotify.open
217245554561153.00001218com.zdnet
218245536261052.00001319com.wiley.onlinelibrary
219245535522208.00000661com.yahoo.sports
220245483502209.00000661com.nymag
221245480341865.00000789net.researchgate
222245437841760.00000841com.cnn.edition
223245402161862.00000791com.angelfire
224245386845476.00000249com.thenation
2252453803487.00017752com.wp.i1
22624537674658.00002055uk.co.telegraph
22724536988366.00004837uk.co.google
228245346541151.00001221com.entrepreneur
2292453158034.00033395com.twitter.blog
23024531188372.00004753com.tripadvisor
231245287342999.00000473com.thedailybeast
232245273481099.00001264fr.amazon
233245259181286.00001099gov.nps
234245229101215.00001137tv.twitch
23524522716213.00007442com.facebook.pt-br
236245116202409.00000597uk.co.theregister
237245098821930.00000769com.prezi
238245095721333.00001065org.change
23924508852229.00007108com.google.chrome
24024507186368.00004829com.apple.support
2412450549867.00021293com.addthis
24224504130567.00002465com.google.video
24324498714345.00005293de.google
244244926704843.00000285au.com.theage
245244898282894.00000491com.salon
246244848261873.00000784org.arxiv
24724482800691.00001931org.wikipedia.fr
248244823101346.00001054com.microsoft.office
24924482184198.00007767jp.ameblo
250244765103078.00000459com.xkcd
251244747901694.00000872com.pcmag
252244745561459.00000965gov.nasa
253244739941969.00000747com.mixcloud
254244709586235.00000222com.reuters.blogs
255244693541649.00000886com.feedburner.feeds2
256244682461232.00001129com.cnet
25724463680204.00007689eu.europa.ec
258244552285903.00000232com.laughingsquid
259244543121324.00001067com.fastcompany
260244515445791.00000235com.forbes.blogs
261244509503224.00000442com.vox
262244494881088.00001270com.reverbnation
263244448141518.00000926ca.amazon
26424442382324.00005879com.weebly
265244421501747.00000847com.blogspot.googleblog
266244406343692.00000376com.google.images
267244398823248.00000437com.billboard
26824438356347.00005214com.googleusercontent.lh5
26924437560331.00005727com.yelp
270244341442445.00000585com.google.productforums
27124430074218.00007273com.facebook.business
27224429478489.00003099com.windowsphone
27324428674226.00007160me.m
274244282941443.00000975com.newsweek
27524426648206.00007652com.facebook.es-es
276244261884379.00000317com.theonion
277244256941499.00000935it.scoop
278244251103838.00000358com.pandora
27924423972546.00002567org.wikipedia.es
28024423302608.00002209com.bloomberg
2812442022257.00023139com.twitter.support
282244191781797.00000822com.adage
28324418438104.00014890com.adobe.get
284244174642056.00000711com.walmart
285244124422704.00000524com.rollingstone
28624409082223.00007206com.facebook.id-id
28724408340332.00005712com.deviantart
28824407960749.00001751jp.ne.hatena.d
289244078102058.00000711com.variety
290244048581635.00000891com.webmd
291244041342590.00000551com.thehill
292244035001784.00000829com.adobe.blogs
293244017702377.00000600com.usnews
294243965862450.00000583me.fb.on
29524396528764.00001718com.wsj
296243923602414.00000595com.bleacherreport
29724390596459.00003408com.technorati
298243888582251.00000646com.shutterstock
299243855302357.00000608com.qz
30024385054193.00007869com.facebook.it-it
301243843102420.00000594org.sciencemag
302243833585654.00000242com.esquire
303243831801198.00001150au.com.google
30424380792985.00001445com.foursquare
305243799162339.00000612edu.stanford
30624379554539.00002638jp.livedoor.blog
307243778481309.00001078com.theverge
3082437442213402.00000109com.hackaday
309243713321367.00001029co.vine
310243681501452.00000969com.msn.msnbc
311243679126479.00000213com.ted.blog
312243659963719.00000372gd.is
313243653184181.00000331com.vice
314243651062983.00000476com.nbc
31524363772679.00001994gov.cdc
31624363062191.00007939com.xing
317243622542891.00000492com.scientificamerican
318243617361069.00001297com.cbsnews
31924361436413.00004029us.icio.del
320243594706886.00000201com.scienceblogs
321243587443043.00000466com.microsoft.research
322243564203502.00000398com.bestbuy
32324356010988.00001441com.bbc
324243548865002.00000275com.gawker
325243528783988.00000346com.startribune
326243490142078.00000706fr.lemonde
327243480744191.00000330com.allrecipes
328243462745087.00000270com.space
329243447301878.00000783com.smashingmagazine
330243440185668.00000241com.treehugger
33124342878747.00001758es.google
332243408442655.00000535uk.co.huffingtonpost
333243359808358.00000164com.techdirt
334243340661068.00001297org.wikipedia
335243339344872.00000284com.nytimes.blogs.well
336243334361313.00001075br.com.google
33724332660542.00002617com.timeout
3382433253818.00063404com.googleusercontent.lh3
339243317382088.00000699com.redbubble
340243308043878.00000353com.miamiherald
341243299102057.00000711com.msdn.blogs
342243296883651.00000381com.refinery29
343243286282118.00000688com.ign
34424328504310.00006387com.livejournal
345243279564641.00000298com.panoramio
346243247182281.00000634edu.mit.web
347243244024910.00000281com.answers
34824323460980.00001451com.apple.developer
349243211422632.00000542com.apple.phobos
35024320832974.00001466com.example
35124318370155.00010226com.polyvore
352243148741003.00001410com.marriott
35324314410593.00002278com.dribbble
354243142923608.00000386org.greenpeace
35524314094667.00002032com.sxsw
356243109742928.00000485com.newscientist
35724310954205.00007659com.facebook.nl-nl
358243101765470.00000250com.dreamstime
359243096664235.00000326com.chronicle
36024308702480.00003159net.php
361243074645030.00000273org.moma
362243035367255.00000191org.grist
36324303532207.00007629com.facebook.pl-pl
364243034542426.00000592com.ehow
3652430249856.00023600com.wp.i2
3662430118828.00045486com.urbandictionary
367242998681190.00001160com.fb
36824297602246.00006938org.cwa-union
36924297130360.00004944com.disqus
37024290242441.00003574com.alexa
371242883441991.00000736com.lifehacker
372242873422687.00000528gov.fws
373242873142416.00000594uk.co.mirror
374242855944702.00000294com.rottentomatoes
37524283856722.00001816com.bitly
376242838562850.00000499gov.archives
377242808364042.00000343com.vogue
378242805804998.00000276com.patheos
379242769365141.00000268com.snopes
380242753383214.00000443com.zimbio
381242736449247.00000147com.infowars
382242710983246.00000438com.technet.blogs
383242679182140.00000683com.hubspot.blog
384242656363639.00000382com.marthastewart
38524265482235.00007000com.facebook.pt-pt
38624264716721.00001818com.salesforce
38724261210736.00001784com.nwsource.seattletimes
388242596345424.00000252com.gq
389242593345525.00000247uk.org.tate
39024255346614.00002189com.orkut
391242548383037.00000467com.gallup
392242534885260.00000261com.oregonlive
39324253108271.00006760com.facebook.zh-tw
394242516761023.00001376com.wunderground
395242514524356.00000318com.mlb.mlb
396242506824493.00000307com.motherjones
397242500281084.00001272com.inc
398242452121766.00000838com.target
399242450443265.00000435com.google.profiles
400242431066758.00000204com.sheknows
401242428964403.00000315li.paper
402242421001931.00000768com.lulu
403242408986136.00000225com.petapixel
40424240804154.00010339net.fbcdn.xx.scontent
405242404043693.00000376au.com.news
406242386466525.00000211com.ndtv
407242377061983.00000740gov.nih.nlm
4082423582610370.00000131com.cnn.blogs.politicalticker
40924234434352.00005179org.gnu
4102423339853.00024219us.peeep
411242324924365.00000317org.thinkprogress
412242324022681.00000529com.nba
41324230752559.00002517com.android.market
414242275129722.00000140com.kodak
415242244484797.00000289edu.brookings
416242237923583.00000389com.css-tricks
417242195204692.00000295com.latimes.latimesblogs
418242181627020.00000197com.dailykos
419242179222643.00000540com.popsugar
420242162024721.00000293com.rt
421242145641249.00001102in.co.google
42224210576263.00006836com.facebook.sv-se
423242055202501.00000571com.nfl
42424205488863.00001556org.doi.dx
425242035627641.00000181com.care2
426241990824847.00000285com.plurk
4272419900214486.00000101com.makezine.blog
42824194726519.00002888com.mozilla
42924192646642.00002103com.barnesandnoble
430241921844227.00000326org.raspberrypi
431241876025989.00000230com.jezebel
432241870161145.00001231org.python
433241844082410.00000596com.psychologytoday
434241832805080.00000271com.mediabistro
435241824283657.00000380com.instructables
436241818064487.00000308com.baltimoresun
437241796861354.00001044com.google.scholar
43824178630261.00006882net.akamaihd.fbcdn-sphotos-a-a
4392417861223.00054639com.bootstrapcdn.maxcdn
440241784721234.00001124com.linkedin.ca
441241779286438.00000214com.dezeen
442241768542474.00000578com.people
443241751481002.00001411com.mediafire
444241743522470.00000578com.indeed
445241719204428.00000312net.comcast.home
446241714282657.00000535com.readwriteweb
447241684783905.00000349com.macworld
448241676301481.00000940com.box
449241652563575.00000390es.elmundo
45024165178746.00001760com.microsoft.technet
451241637381844.00000801com.500px
452241622768600.00000159com.consumerist
453241611108423.00000163com.uproxx
4542416003613745.00000107com.dawn
455241574682049.00000714com.sciencedaily
456241571227100.00000195org.alternet
4572415672412559.00000117com.chicagonow
45824156126959.00001516com.photobucket
459241546886392.00000216com.designboom
460241537803864.00000355com.blurb
461241518304664.00000296com.weheartit
46224148122502.00002977com.opera
463241443021063.00001300gov.epa
4642414313811507.00000127com.wbir
465241406464335.00000319com.foreignpolicy
46624139430458.00003416org.ietf
467241362524034.00000343com.nikkei
4682413211031.00043664com.statcounter
46924131460237.00006978com.facebook.th-th
470241308566366.00000217com.geekwire
471241262661625.00000893com.linkedin.in
472241241588222.00000167com.appleinsider
473241237307141.00000194com.avclub
47424121082252.00006926net.akamaihd.fbcdn-profile-a
4752412008215904.00000093com.wonkette
476241186943871.00000354com.chron
47724114758977.00001461com.houzz
47824114164269.00006781com.facebook.tr-tr
47924112388623.00002160gov.ftc
480241114508902.00000153com.reason
481241097103459.00000403tv.blip
482241068461484.00000938com.google.photos
48324106650543.00002611com.oracle
484241053707602.00000182com.pastemagazine
485241033281483.00000938gov.copyright
486241007863055.00000464org.aclu
487240947583076.00000460com.philly
488240932521685.00000878com.squareup
489240893561086.00001272com.samsung
490240879783118.00000451com.me.web
491240876601231.00001129com.cdbaby
492240875646720.00000205com.deseretnews
493240831766367.00000217com.io9
49424081856402.00004145org.wikipedia.de
4952408112411338.00000128org.peta
4962408090812982.00000113com.hongkiat
497240806804684.00000295com.tmz
49824077924818.00001605com.amazon.aws
499240772568766.00000156org.pri
500240741862270.00000638com.oreilly
501240741102291.00000630com.freewebs
502240740881235.00001123org.wikipedia.it
503240733903848.00000356com.azcentral
504240731926245.00000221com.mentalfloss
50524069232523.00002830fr.google
5062406884817311.00000085com.tor
507240671322904.00000490org.worldbank
508240670541762.00000841de.heise
509240664648159.00000169com.liveleak
510240660388772.00000155com.gothamist
511240647423396.00000415com.latimes.articles
512240646269244.00000147com.extremetech
513240616403705.00000374com.yahoo.answers
5142406143032734.00000046com.wreg
515240611668511.00000161com.nybooks
516240607805379.00000254com.pbase
517240598345049.00000272edu.nap
518240585546572.00000210com.cnn.sportsillustrated
519240576566644.00000208com.grantland
520240565721639.00000887gov.loc
521240558164418.00000313org.nobelprize
522240544944260.00000324com.eonline
523240533807372.00000189com.haaretz
524240531225199.00000264com.bhphotovideo
525240529644827.00000287com.esri
526240521529995.00000136org.commondreams
527240521485161.00000266com.glamour
528240514561460.00000960com.fineartamerica
529240499806715.00000205edu.uchicago.press
530240484329103.00000149gov.nasa.science
5312404689435700.00000042com.bossip
5322404663219728.00000075com.neatorama
53324044690720.00001819org.acm
534240444624081.00000340org.weforum
535240443081897.00000777it.amazon
536240440061959.00000752me.flavors
537240433949459.00000143com.howstuffworks
538240418726466.00000213com.9to5mac
539240401424517.00000306com.uber
54024039284680.00001992com.bloglovin
5412403788216205.00000091com.highsnobiety
542240373164560.00000303com.audible
543240364907371.00000189com.complex
5442403617615916.00000093com.time.swampland
545240346544025.00000344com.lonelyplanet
5462403393411839.00000123com.dilbert
547240334363125.00000450com.deezer
548240332524168.00000333com.lynda
549240332229648.00000141com.discovermagazine.blogs
550240308404421.00000313com.cbs
551240306163761.00000367net.daringfireball
552240298601933.00000766com.patreon
553240271647181.00000194com.deadspin
554240236928065.00000171com.bostonherald
555240234346210.00000223com.cosmopolitan
55624022760829.00001585jp.ne.goo.blog
5572402138418750.00000079com.hotair
558240212386632.00000208com.librarything
559240210146000.00000230cc.arduino
560240191026652.00000207com.logitech
561240173823257.00000437com.asahi
562240154144050.00000342com.nationalgeographic.news
5632401532220178.00000073com.matadornetwork
564240152704638.00000298com.observer
565240122145943.00000231com.copyblogger
566240114445514.00000247com.seekingalpha
56724010644227.00007141mp.j
56824010056236.00006995com.xiami
569240097863429.00000408com.elpais
570240080402638.00000541com.ew
571240077307976.00000173com.bonappetit
572240069444514.00000306org.lds
573240067983385.00000416com.cbssports
574240062427137.00000194com.cbslocal.newyork
5752400479411215.00000129com.modelmayhem
57624004760982.00001449eu.europa
57724003960731.00001790com.google.hangouts
578240035506718.00000205com.vancouversun
579240023207253.00000191com.talkingpointsmemo
580240013521975.00000743com.google.spreadsheets
581240009842570.00000556cn.com.sina.blog
582240004023222.00000443com.ravelry
583239994401795.00000824com.amazon.astore
584239979021755.00000843org.eff
58523997856573.00002431com.adobe.helpx
586239974887093.00000195uk.ac.vam
587239972204633.00000299com.vice.motherboard
5882399603616595.00000089com.thesmokinggun
5892399400011829.00000124com.imore
59023993422101.00016094com.tinypic
59123993226545.00002574com.msn
592239931624385.00000316ca.globalnews
593239931188243.00000167com.discovery.dsc
594239927025617.00000244com.pitchfork
595239914245858.00000233com.blogspot.youtube-global
596239902888750.00000156com.realclearpolitics
59723987906678.00001996it.google
598239878907105.00000195com.scmp
599239861882777.00000508jp.or.nhk
60023982278686.00001969com.hubpages
601239818722779.00000508gov.uspto
602239816981374.00001028com.timeanddate
603239814388335.00000165com.christianitytoday
604239801684066.00000341net.faz
605239788987230.00000192com.theweek
6062397790028318.00000054com.gottabemobile
607239773827199.00000193org.plos.blogs
608239772044627.00000299com.howtogeek
609239754942009.00000731com.getpocket
610239731825047.00000272com.kotaku
611239727043324.00000425cc.tiny
6122397104612039.00000122com.perezhilton
6132396942613727.00000107com.mcclatchydc
61423968350651.00002075com.aol
615239681127323.00000190com.lmgtfy
61623966254822.00001598com.businesswire
617239660924345.00000319org.ibiblio
618239654783362.00000420org.unicef
619239654422417.00000594com.hollywoodreporter
620239605501041.00001333int.who
621239595801026.00001373com.android.developer
622239572945198.00000264edu.cmu
623239569306817.00000203com.sbnation
624239565487619.00000182com.marvel
625239564966517.00000211edu.harvard.law.blogs
626239520023640.00000382com.fiverr
627239505542718.00000520gov.dhs
628239499642653.00000535com.smashwords
62923948856209.00007583com.facebook.ja-jp
630239487605633.00000243com.stagram.web
6312394868014134.00000104com.nytimes.blogs.thelede
632239486766768.00000204com.nme
633239466664902.00000281com.hbo
6342394627213062.00000112org.counterpunch
6352394603811613.00000126com.cultofmac
636239439941543.00000913com.evernote
63723943942270.00006766com.360doc
638239422289626.00000141com.cracked
639239408202486.00000575com.blogtalkradio
64023939540357.00004971com.gravatar.en
64123937964389.00004423org.icann
64223937816684.00001977com.ggpht.lh3
643239377868491.00000161com.teenvogue
6442393734216658.00000088com.flickriver
645239364804027.00000344com.smithsonianmag
646239350705480.00000249com.codeproject
64723934200260.00006883net.fbcdn.ak.static
648239341021816.00000815gov.census
64923933432657.00002057com.linkedin.uk
65023932600577.00002400com.w3schools
651239317404463.00000310com.mac.homepage
6522392912210155.00000134com.rawstory
653239276623404.00000414com.squidoo
654239247462044.00000715com.dell
65523922980488.00003104com.4shared
6562392281414131.00000104org.mediamatters
657239208028760.00000156com.parents
658239205267287.00000191com.opera.my
659239204828124.00000169org.ieee.spectrum
660239201501038.00001337jp.geocities
661239152446512.00000212com.townhall
66223913292399.00004167org.mozilla.support
663239127922006.00000732org.oecd
664239118041048.00001324org.eclipse
6652391158420701.00000072com.hellogiggles
666239102988868.00000154com.clarin
66723909680827.00001589com.symantec
668239090407558.00000184org.aaas
669239087522311.00000621com.justgiving
670239084444671.00000296org.coursera
671239064821938.00000763com.nydailynews
67223905082343.00005302com.googleusercontent.lh6
67323904092387.00004468com.soundcloud.w
674239022501027.00001368gov.irs
6752390223215133.00000097com.craveonline
676239021364011.00000345com.channel4
6772389991620101.00000074com.nature.blogs
678238972189313.00000146com.myspace.blog
679238955269205.00000147com.klout
680238942081412.00000996com.steampowered.store
6812389400210218.00000133com.boredpanda
68223893512707.00001860com.friendster
6832389350436.00032755com.godaddy
684238930522950.00000481com.amzn
6852389242413276.00000110ca.globalresearch
6862389129817990.00000082org.calacademy
687238907566671.00000207net.box
688238896364900.00000282com.fanpop
689238890945845.00000234com.datacenterknowledge
6902388719627479.00000056com.americanrhetoric
691238865565185.00000265com.threadless
692238849664246.00000325ms.1drv
6932388353010188.00000134com.barackobama
6942388309812688.00000116com.spin
6952388309213664.00000107com.yahoo.pipes
696238829568684.00000157com.comedycentral
697238828961066.00001299com.googleartproject
698238827882652.00000535com.computerworld
6992388186216912.00000087com.giantbomb
70023881530276.00006705com.weibo.vdisk
7012388152023165.00000064com.wattsupwiththat
702238791463942.00000347com.screencast
7032387763210235.00000133org.tvtropes
704238774704460.00000310com.megaupload
7052387676416116.00000092com.catholicnewsagency
706238767001476.00000947org.hbr
7072387655224017.00000062com.cnn.blogs.religion
70823874712161.00009861com.mailchimp
709238743682437.00000589com.alibaba
710238740602992.00000474com.ezinearticles
711238739443058.00000463uk.co.ebay
712238737421146.00001228org.un
713238730621710.00000869org.iso
71423867822999.00001415com.snapchat
715238677766315.00000219com.victoriassecret
7162386613613917.00000105com.washingtonian
7172386596025817.00000059com.humanevents
718238659381722.00000864com.newgrounds
719238648401677.00000882com.biblegateway
720238601262516.00000567com.friendfeed
7212385830418020.00000082com.moddb
7222385770827085.00000057com.singularityhub
723238545124265.00000324com.pixlr
724238536949675.00000140com.marieclaire
72523851816242.00006954com.facebook.ar-ar
726238511448505.00000161org.ams
727238511322494.00000572com.createspace
728238497981615.00000899com.ebay.stores
729238497901169.00001199com.sciencedirect
730238492385291.00000259com.tampabay
731238481644113.00000337com.ibtimes
732238476928114.00000170com.oreilly.radar
7332384672216808.00000088com.escapistmagazine
734238466901233.00001125org.seomoz
73523845474140.00011415com.ytimg.i
736238454381618.00000895com.net-a-porter
737238453121829.00000810com.cnet.download
7382384500813611.00000108org.brooklynmuseum
739238448268231.00000167net.fanfiction
7402384427015039.00000097com.flavorwire
741238434543472.00000402com.modcloth
7422384174030133.00000050org.jihadwatch
743238410661160.00001210com.weather
744238410004818.00000287to.gplus
745238409461537.00000919com.viadeo
746238406907902.00000174org.edutopia
747238405143327.00000425org.apa
748238394465982.00000230de.tagesschau
749238378202994.00000474me.paypal
7502383753018909.00000078edu.hawaii
75123837158576.00002411com.images-amazon.ecx
752238368262544.00000561gov.fbi
753238367405216.00000263com.manta
754238363066318.00000219uk.org.nationaltrust
755238349925593.00000244com.googleusercontent.webcache
756238349168158.00000169org.truth-out
757238348584989.00000276com.typepad.sethgodin
7582383343221393.00000069org.spectator
759238324567393.00000188com.mendeley
760238308083077.00000460tr.com.google
761238305882961.00000480org.cancer
762238300483361.00000421com.networkworld
7632382996814271.00000103com.topix
764238294705491.00000249com.starwars
765238291282735.00000517com.hulu
766238268527672.00000181com.discovery.news
767238264502600.00000550org.dmoz
768238256626630.00000208com.villagevoice
769238241605362.00000255com.dpreview
770238239265207.00000263edu.cmu.cs
7712382317615187.00000097com.dazeddigital
772238225083709.00000374org.mozilla.wiki
773238218301167.00001201gov.fda
774238208921433.00000981gov.justice
775238208263603.00000387gov.cia
77623820332439.00003639com.posterous
7772382032812812.00000114au.com.sbs
778238203247682.00000180com.gamasutra
779238198706665.00000207com.epicurious
780238192663771.00000365com.socialmediaexaminer
781238192109190.00000148org.sierraclub
782238192003366.00000420net.earthlink.home
783238181362005.00000732com.gartner
784238172462354.00000608com.theglobeandmail
785238171361336.00001063org.wikipedia.pt
786238165126963.00000198com.suntimes
787238163923390.00000416nl.xs4all
7882381469225030.00000060com.elephantjournal
789238122646672.00000207com.cntraveler
790238120861004.00001409com.linkedin.fr
791238119864753.00000291com.nationalreview
792238116461999.00000735com.thefreedictionary
79323811380280.00006674com.facebook.zh-cn
794238092644474.00000309uk.ac.ucl
795238064064613.00000299com.denverpost
79623806360231.00007094kr.flic
797238063227693.00000180com.instyle
7982380559223068.00000064edu.usra.lpi
7992380412210350.00000131com.scobleizer
800238040224182.00000331uk.co.metro
80123803192233.00007056jp.co.google
802238030184971.00000277com.nvidia
803238028703887.00000352com.irishtimes
804238020701287.00001099co.g
805238017726674.00000207edu.washington.depts
806238014187466.00000187com.tennessean
807238006582439.00000589com.hp
808238003223644.00000382com.aljazeera
8092380013816549.00000089com.wmagazine
81023798820579.00002383uk.co.eventbrite
81123798338421.00003858com.googleadservices
8122379780610117.00000134com.eater
813237973908292.00000166com.yahoo.movies
814237956406997.00000197com.bhg
8152379449826069.00000058org.fair
816237936162379.00000600com.bostonglobe
817237931366353.00000217com.autoblog
8182379282810025.00000136com.linuxjournal
8192379266621056.00000070com.thisiscolossal
8202379140215130.00000097com.radaronline
8212379117014444.00000101com.fourhourworkweek
822237904003416.00000412com.thestar
823237896308992.00000151gov.nasa.apod
824237891083558.00000391com.twitpic
8252378903885.00017941com.twitter.status
826237877507183.00000194com.financialexpress
827237871787036.00000197mx.com.eluniversal
828237871704196.00000330edu.princeton
8292378695412725.00000115edu.uvm
830237864421244.00001108com.skype
831237856201350.00001049com.flickr.static.farm3
8322378405019151.00000077com.slashfilm
833237831028462.00000162org.nypl
8342378131413888.00000106com.associatedcontent
835237806903787.00000363org.gutenberg
83623779958217.00007285org.bbb
837237798367685.00000180com.macrumors
8382377809413271.00000110com.theroot
839237765105072.00000271com.akamai
840237758803305.00000428au.com.theaustralian
8412377571413415.00000109com.factmag
8422377436818373.00000080com.marykay
843237740287646.00000181com.viddler
844237713081937.00000763com.android
845237702027311.00000191gov.loc.memory
846237696961765.00000840com.yahoo.search
847237696281149.00001223com.feedburner
848237695161142.00001233com.google.adwords
849237694165410.00000253com.diigo
8502376825610007.00000136com.mattcutts
8512376742615826.00000093org.davidsuzuki
8522376643811933.00000123com.break
85323766120475.00003242org.drupal
8542376576633614.00000045com.animalnewyork
8552376575419037.00000078com.crooksandliars
856237657261920.00000772com.steamcommunity
8572376570213001.00000113com.weeklystandard
8582376545210549.00000130com.tuaw
8592376414021194.00000070com.inthesetimes
860237619765530.00000247org.hrc
861237619561860.00000793com.networkedblogs
862237605343114.00000452com.theknot
8632376036833353.00000045com.littlegreenfootballs
864237602603635.00000382com.barnesandnoble.search
865237599843656.00000380com.globo.g1
866237587748577.00000159com.smittenkitchen
867237587262079.00000705es.amazon
8682375872214721.00000100com.ktla
8692375859219539.00000076com.rediff
8702375823821443.00000069com.artofmanliness
8712375822017645.00000083org.whitney
872237576649843.00000138com.menshealth
873237572822211.00000660com.nypost
874237563902958.00000481com.gstatic.t0
8752375636213499.00000109com.ffffound
876237558049503.00000143com.inhabitat
877237553144301.00000322edu.columbia
8782375439010407.00000131com.hypebeast
879237535344665.00000296com.thinkgeek
880237531367586.00000183com.foodandwine
881237523609750.00000139org.wikibooks.en
8822375050811492.00000127com.gocomics
88323750416153.00010465ru.yandex
8842374960621336.00000069edu.rochester
8852374939043018.00000035com.2dopeboyz
8862374787619010.00000078org.nycgovparks
8872374772617755.00000083com.justjared
8882374630015964.00000092com.blogspot.googlemobile
889237455886710.00000206com.wwd
89023745272333.00005701org.fedoraproject
8912374247613660.00000107com.hollywoodlife
892237423109994.00000136mil.navy
89323741808967.00001490com.staticflickr.farm8
8942374136620244.00000073com.coolhunting
8952374052217885.00000082com.magcloud
896237395442770.00000509com.gettyimages
8972373755815787.00000093com.washingtonpost.articles
89823737374272.00006752net.akamaihd.fbstatic-a
899237373343662.00000380uk.co.thesun
900237354705484.00000249edu.yale
9012373454419149.00000077org.labnol
90223734358449.00003503nl.google
9032373371625945.00000058org.globalvoicesonline
904237324982489.00000574dk.google
905237324921345.00001055com.staticflickr.farm9
90623732404228.00007126com.facebook.da-dk
90723732036350.00005193us.imageshack
908237318083550.00000392com.mercurynews
9092373158824915.00000060com.celebuzz
910237315245769.00000236com.yahoo.groups.tech
911237308445249.00000261int.esa
912237304686716.00000205com.linkedin.blog
9132373043015982.00000092com.eurasiareview
91423730204163.00009658com.blogblog.img1
9152372960416528.00000089com.redstate
9162372868013070.00000112com.torrentfreak
9172372820623033.00000065com.movieweb
9182372774413113.00000112com.seroundtable
919237262141733.00000856edu.cornell.law
920237260803120.00000451com.nifty.homepage3
9212372547819732.00000075com.craphound
92223724518700.00001890com.ggpht.lh6
923237231621737.00000851com.hotmail
9242372312831743.00000048org.thesocietypages
925237228982015.00000729com.spotify.play
926237222149125.00000149com.scientificamerican.blogs
927237220223445.00000405com.digitaltrends
928237214463659.00000380com.jamendo
929237211562859.00000498com.netflix
9302372058215182.00000097com.stereogum
9312371990684.00017961com.twitter.business
93223719456180.00008153com.blogblog.img2
933237190808438.00000163com.dailydot
934237177944118.00000337edu.bu
935237170004558.00000303com.zara
936237153548040.00000171net.asp.weblogs
937237145462453.00000583com.ebay.rover
938237131828138.00000169com.marketingprofs
9392371289011959.00000122com.takepart
940237111626443.00000214org.propublica
941237089124866.00000284com.makeuseof
9422370630413298.00000110com.models
943237058088228.00000167com.sportingnews
944237050426501.00000212com.digitaljournal
945237049585382.00000254com.active
946237044108643.00000158ar.com.lanacion
9472370422411282.00000129com.ssrn
9482370417614358.00000102com.gazette
949237041563007.00000471org.pewinternet
9502370378812315.00000119org.caringbridge
951237022686669.00000207fm.ask
952237018067727.00000179com.politifact
9532370145212920.00000113com.theoatmeal
9542370069277.00018590com.twitter.api
9552370006811585.00000126org.brainpickings
956236995706414.00000215com.harpercollins
9572369925019689.00000075net.360cities
9582369916655038.00000028nz.co.sciblogs
959236974584465.00000310com.starbucks
960236973825868.00000233com.elle
9612369681228485.00000054com.listverse
96223695400564.00002484com.booking
963236936682996.00000473com.dallasnews
964236931603621.00000384com.pastebin
965236922266156.00000225com.purevolume
966236919541310.00001077com.amazon.smile
9672369136232716.00000046com.truthdig
968236912865706.00000240com.knowyourmeme
9692368758225990.00000058com.babble.blogs
970236860803355.00000421com.vanityfair
9712368569832746.00000046net.fubiz
97223685224562.00002501com.giphy
973236844822113.00000689com.intel
974236838543139.00000447com.livescience
9752368306013332.00000110uk.org.iwm
976236827605190.00000265com.randomhouse
977236815201191.00001159es.google.maps
9782368117637675.00000040com.tucsonweekly
979236801429980.00000136com.gilt
980236784642799.00000503com.gstatic.t2
981236784088966.00000152org.thisamericanlife
9822367816422636.00000066uk.co.creativereview
983236763485792.00000235com.microsofttranslator
984236761901633.00000891gov.sec
9852367450017492.00000084com.penny-arcade
986236739961210.00001140com.springer.link
987236736262412.00000596com.redhat
9882367312619650.00000075org.newsbusters
98923672884238.00006976com.facebook.el-gr
990236728188628.00000158com.heavy
991236721449097.00000150com.globalpost
9922367149614074.00000104com.wisegeek
993236709107992.00000172com.animoto
99423670052778.00001694com.naver.blog
995236693228386.00000164com.time.techland
996236691908652.00000158com.jalopnik
997236686843562.00000391com.indiatimes.economictimes
99823667394474.00003251ru.vkontakte
999236665521487.00000937com.msnbc
10002366609416350.00000090com.rockpapershotgun

 

Data and download instructions

The host-level graph as well as the rankings are placed on AWS S3 on the path

s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

Alternatively, you can use

https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph

SizeFileDescription
2.72 GBvertices.txt.gznodes ⟨id, rev host⟩
9.42 GBedges.txt.gzedges ⟨from_id, to_id⟩
4.51 GBbvgraph.graphgraph in BVGraph format
0.22 GBbvgraph.offsets
1 kBbvgraph.properties
5.06 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
0.47 GBbvgraph-t.offsets
1 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
6.26 GBranks.txt.gzharmonic centrality and pagerank

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl’s Google Group!

 

Credits

Thanks to

  • Web Data Commons, for their web graph data set and everything related.
  • Common Search; we first used their web graph to expand the crawler frontier, and Common Search’s cosr-back project was an important source for detail on processing our data with PySpark.
  • the authors of the WebGraph framework, whose software simplifies the computation of ranks.

April 2017 Crawl Archive Now Available

The crawl archive for April 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-17/. It contains 2.94 billion+ web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we ranked all hosts found in the February and March 2017 crawls by Harmonic Centrality, and

  • added 390 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 16 million hosts;
  • used sitemaps (if provided by any of these 16 million hosts) to take a random sample and add further 160 million URLs.

About 56% of the 2.94 billion URLs overlap with the preceding March 2017 crawl, with 550 million URLs not contained in any crawl archive before.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-17/segment.paths.gz100
WARC filesCC-MAIN-2017-17/warc.paths.gz6470055.03
WAT filesCC-MAIN-2017-17/wat.paths.gz
6470019.77
WET filesCC-MAIN-2017-17/wet.paths.gz
647008.95
Robots.txt filesCC-MAIN-2017-17/robotstxt.paths.gz
647000.11
Non-200 responsesCC-MAIN-2017-17/non200responses.paths.gz647000.84

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-17/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March 2017 Crawl Archive Now Available

The crawl archive for March 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-13/. It contains 3.07 billion+ web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we ranked all hosts found in the February 2017 crawl by Harmonic Centrality, and

  • added 600 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 8 million hosts;
  • used sitemaps (if provided by any of these 8 million hosts) to take a random sample and add further 100 million URLs.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-13/segment.paths.gz100
WARC filesCC-MAIN-2017-13/warc.paths.gz6650060.74
WAT filesCC-MAIN-2017-13/wat.paths.gz
6650020.86
WET filesCC-MAIN-2017-13/wet.paths.gz
665009.30
Robots.txt filesCC-MAIN-2017-13/robotstxt.paths.gz
665000.10
Non-200 responsesCC-MAIN-2017-13/non200responses.paths.gz665000.82

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-13/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2017 Crawl Archive Now Available

The crawl archive for February 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-09/. It contains 3.08 billion+ web pages and over 250 TiB of uncompressed content.

To extend the coverage of the crawl we

  • continued to use sitemaps to find fresh URLs for known hosts;
  • added 250 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 5 million hosts. We also ranked these hosts by Harmonic Centrality calculated on Common Search’s host-level webgraph using HyperBall;
  • again, used verified, DNS-resolvable domain names of European country-code TLDs (.eu, .fr, .be, .de, .ch, .nl, .pl, .ru, .dk), thanks to the continued donation of seed data from webxtrakt;
  • included 3 million URLs from dmoz.org (formerly, the Open Directory Project).

The link extraction for WAT generation has been improved to include links to embedded content. See issues 7 and 9 on GitHub for further information.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-09/segment.paths.gz100
WARC filesCC-MAIN-2017-09/warc.paths.gz6520055.88
WAT filesCC-MAIN-2017-09/wat.paths.gz
6520020.67
WET filesCC-MAIN-2017-09/wet.paths.gz
652009.14
Robots.txt filesCC-MAIN-2017-09/robotstxt.paths.gz
652000.13
Non-200 responsesCC-MAIN-2017-09/non200responses.paths.gz652001.98

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-09/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

January 2017 Crawl Archive Now Available

The crawl archive for January 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-04/. It contains more than 3.14 billion web pages and about 250 TiB of uncompressed content.

To extend the coverage of the crawl we

  • continued to use sitemaps to achieve fresh URLs for already known hosts;
  • added all accessible URLs from the top-million domains from Alexa (within 2 “hops”);
  • again, used verified, DNS-resolvable domain names of European country-code TLDs (.eu, .fr, .be, .de, .ch, .nl, .pl, .ru, .dk), thanks to the continued donation of this data from webxtrakt.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-04/segment.paths.gz100
WARC filesCC-MAIN-2017-04/warc.paths.gz5780053.95
WAT filesCC-MAIN-2017-04/wat.paths.gz
5780020.34
WET filesCC-MAIN-2017-04/wet.paths.gz
578008.97
Robots.txt filesCC-MAIN-2017-04/robotstxt.paths.gz
578000.10
Non-200 responsesCC-MAIN-2017-04/non200responses.paths.gz578000.56

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-04/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.