Index to WARC Files and URLs in Columnar Format

We’re happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast. So far, we’ve tested two of them: Apache Spark and AWS Athena. The latter makes it possible to run SQL queries on the columnar data even without launching a server. Below you’ll find examples how to query the data with Athena. Examples and instructions for SparkSQL are in preparation. But you are free to use any other tool: the columnar index is free to access or download for anybody. You’ll find all files on:
s3://commoncrawl/cc-index/table/cc-main/warc/

Running SQL Queries with Athena

AWS Athena is a serverless service to analyze data on S3 using SQL. With Presto under the hood you even get a long list of extra functions including lambda expressions. Usage of Athena is not free but it has an attractive price model, you pay only for the scanned data (currently $5.0 per TiB). The index table of a single monthly crawl has about 300 GB. That defines the upper bound, but most queries require only part of the data to be scanned.

Let’s start and register the Common Crawl index as database table in Athena:

1. open the Athena query editor. Make sure you’re in the us-east-1 region where all the Common Crawl data is located. You need an AWS account to access Athena, please follow the AWS Athena user guide how to register and set up Athena.

2. to create a database (here called “ccindex”) enter the command

CREATE DATABASE ccindex


and press “Run query”

3. make sure that the database “ccindex” is selected and proceed with “New Query”

4. create the table by executing the following SQL statement:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';


It will create a table “ccindex” with a schema that fits the data on S3. The two “PARTITIONED BY” columns are actually subdirectories, one for every monthly crawl and the WARC subset. Partitions allow us to update the table every month and also help to limit the costs to query the data. Please note that the table schema may evolve over time, the most recent schema version is available on github.

5. to make Athena recognize the data partitions on S3, you have to execute the SQL statement:

MSCK REPAIR TABLE ccindex


Note that this command is also necessary to make newer crawls appear in the table. Every month we’ll add a new partition (a “directory”, e.g., crawl=CC-MAIN-2018-09/). The new partition is not visible and searchable unless it has been discovered by the repair table command. If you run the command you’ll see which partitions have been newly discovered, e.g.:

Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=crawldiagnostics
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=robotstxt
Repair: Added partition to metastore ccindex:crawl=CC-MAIN-2018-09/subset=warc

Now you’re ready to run the first query. We’ll count the number of pages per domain within a single top-level domain. As before press “Run query” after you’ve entered the query into the query editor frame:

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC


The result appears seconds later and only 2.12 MB of data have been scanned! Pretty fine, the query has cost less than one cent. We’ve filtered the data by a partition (a monthly crawl) and selected a small (.no) top-level domain. It’s a good practice to start developing more complex queries with such filters applied to keep the costs for trials low.

But let’s continue with a second example which demonstrates the power of Presto functions – we try to find domains which provide multi-lingual content. On possible way is to look for ISO-639-1 language codes in the URL, e.g., en in https://example.com/about/en/page.html. You can find the full SQL expression on github. For demonstration purposes we restrict the search to a single and small TLD (.va for Vatican State). The magic is done by

UNNEST(regexp_extract_all(url_path, '(?<=/)(?:[a-z][a-z])(?=/)')) AS t (url_path_lang)


which first extracts all two-letter path elements (e.g., /en/) and unrolls the elements into a new column “url_path_lang” (if two or more path elements are found, you get multiple rows). Now we count pages and unique languages and let Presto/Athena also create a histogram of language codes:

You can find more SQL examples and resources on the cc-index-table project page on github. We’ll also working to provide examples to process the table using SparkSQL. First experiments are also promising: you get results within minutes even on a small Spark cluster. That’s not seconds as for Athena but you’re more flexible, esp. regarding the output format – Athena supports only CSV. Please also check the Athena release notes and the current list of limitations to find out which Presto version is used and which functions are supported.

We hope the new data format will help you to get value from the Common Crawl archives, in addition to the existing services.

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017). Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the preceding announcements.

Please note that the first released version (released 2018-02-08, withdrawn 2018-02-21) contained only links from the January 2018 crawl, see the notice on the Common Crawl user group. On 2018-02-28 a fix has been provided with graphs or rankings containing all links, hosts and/or domains over all 3 crawls. We also provide the erroneously released graphs and rankings from the January 2018 crawl.

What’s new?

Here is a summary of notable aspects and changes of this web graph release:

  • a bug has been fixed which caused that relative links pointing to a different host (//www.example.com/index.html) are not added as edges of the host/domain-level webgraphs
  • the domain graph now contains the number of hosts per domain as additional column in the vertices and rankings files
  • the naming scheme has changed – the release name is now part of the file name
  • webgraph offset files are not released any more, they can be created by running

    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-host
    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-domain

Host-level graph

The graph consists of 2.75 billion nodes and 8.6 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 2.67 billion dangling nodes (97%) and the largest strongly connected component contains only 65 million (2.3%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 2.75 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Nov/Dec/Jan 2017-18 host-level webgraph

SizeFileDescription
15.9 GBcc-main-2017-18-nov-dec-jan-host-vertices.paths.gznodes ⟨id, rev host⟩, paths of 28 vertices files
40.0 GBcc-main-2017-18-nov-dec-jan-host-edges.paths.gzedges ⟨from_id, to_id⟩, paths of 28 edges files
16.4 GBcc-main-2017-18-nov-dec-jan-host.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-host.properties
24.2 GBcc-main-2017-18-nov-dec-jan-host-t.graphtranspose of the graph (outlinks inverted to inlinks)
2 kBcc-main-2017-18-nov-dec-jan-host-t.properties
1 kBcc-main-2017-18-nov-dec-jan-host.statsWebGraph statistics
38.1 GBcc-main-2017-18-nov-dec-jan-host-ranks.txt.gzharmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted; “private” domains are not accepted (cf. section “divisions” in the documentation on publicsuffix.org). For example, foo.blogspot.com and commoncrawl.s3.amazonaws.com are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com and stored in the reversed form com.blogspot.

The domain-level graph has 94 million nodes and 1.44 billion edges. 59% or 56 million nodes are dangling nodes, the largest strongly connected component covers 33 million or 35% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2017-18 domain-level webgraph

SizeFileDescription
0.67 GBcc-main-2017-18-nov-dec-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
5.7 GBcc-main-2017-18-nov-dec-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
3.1 GBcc-main-2017-18-nov-dec-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2017-18-nov-dec-jan-domain.properties
3.3 GBcc-main-2017-18-nov-dec-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2017-18-nov-dec-jan-domain-t.properties
1 kBcc-main-2017-18-nov-dec-jan-domain.statsWebGraph statistics
2.0 GBcc-main-2017-18-nov-dec-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 94 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (Nov/Dec/Jan 2017-2018)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12607321020.013220com.facebook
22550183210.016444com.googleapis
32371825630.009278com.google
42337153440.008406com.twitter
52283219250.007823com.youtube
62165337660.006112org.w
72032463670.004710org.gmpg
82004592880.003501com.instagram
919837996100.002871com.linkedin
1019439618120.002753org.wordpress
1119334234140.002070com.wordpress
1219214522170.001665com.pinterest
1319145770270.001242org.wikipedia
1419121822230.001462com.gravatar
1518842296330.000966com.blogspot
1618810990110.002837com.bootstrapcdn
1718718320190.001594com.apple
1818626224260.001255com.vimeo
1918434062150.001863com.adobe
2018419880440.000691be.youtu
2118397832340.000964com.amazon
2218350614130.002084com.macromedia
2318323552290.001015com.microsoft
2418321908410.000757gl.goo
2518302296310.001009com.flickr
2618270630460.000657com.tumblr
2718183288590.000540com.yahoo
2818136014200.001531net.doubleclick
2918074436700.000464ly.bit
3018072284320.000988com.amazonaws
3118039506180.001618com.googletagmanager
3217994916350.000913com.paypal
3317957448780.000417eu.europa
3417950818250.001280com.cloudflare
3517880136870.000397com.weebly
3617863816300.001012com.github
3717859140810.000412org.mozilla
3817838500400.000769net.cloudfront
3917830430950.000348co.t
4017794416800.000414org.creativecommons
41177732261020.000289com.googleusercontent
4217757566570.000562org.w3
4317751372390.000782io.github
4417703562970.000340com.soundcloud
45176746261180.000226com.blogger
46176734861380.000182net.slideshare
47176663841080.000265com.reddit
4817650506510.000617com.bing
49176226781470.000171com.myspace
5017614686650.000474com.medium
51176003021170.000233org.archive
52175976521360.000187com.imgur
5317581558660.000474com.list-manage
5417545184370.000804org.apache
55174990741550.000154com.imdb
56174933162400.000097com.about
5717491778280.001104com.gstatic
58174715561690.000144com.wsj
59174641361260.000218com.jimdo
60174622402340.000101com.livejournal
6117450286470.000649com.wp
62174478361290.000206com.issuu
63174452441300.000204com.android
64174435181220.000222com.yelp
6517419300430.000721com.statcounter
6617406774500.000626me.wp
67173928921790.000138com.oracle
68173725701620.000148com.digg
69173686322310.000102me.about
70173673182990.000078com.scribd
71173619462550.000091org.python
72173596881270.000210uk.co.google
7317357006610.000525com.cnn
74173420141240.000220com.nytimes
75173399683190.000073com.quora
76173296202490.000092com.ted
77173214501530.000161com.spotify
78173019981480.000168com.wixsite
79173005002330.000101com.dailymotion
80172979082080.000118com.staticflickr
81172889543900.000062org.chromium
82172764041060.000273com.ytimg
83172698902590.000089com.webs
84172653061450.000177org.ietf
85172553422220.000109com.mozilla
86172436661870.000133net.behance
87172430481910.000130com.disqus
88172424762730.000085com.mysql
89172400421580.000152com.stumbleupon
90172364102680.000085com.foursquare
91172312723140.000075gov.loc
92172130101510.000164org.gnu
93172101181460.000171com.tripadvisor
94172033743610.000066org.nodejs
95172018823780.000064com.storify
96171787901560.000153com.forbes
9717177956600.000527com.huffingtonpost
98171684641330.000196com.dropbox
99171640121990.000125com.typepad
100171565222410.000097com.example
101171501881660.000146uk.co.bbc
102171485284790.000051edu.virginia
10317142618890.000384com.paypalobjects
10417140226480.000645net.fbcdn
105171305684030.000060com.pixabay
106171263163830.000063ca.blogspot
107171184922000.000124org.wikimedia
108171163582970.000079com.githubusercontent
109171156763630.000066com.sun
11017111592360.000863com.squarespace
111171065722920.000079com.goodreads
11217105500560.000574com.fb
113171037684590.000053kr.flic
114170942264310.000057org.ampproject
115170863965300.000048edu.gatech
116170863561800.000137com.theguardian
11717085768960.000344com.wix
118170830325180.000049it.scoop
119170813824270.000057org.sciencemag
120170721061390.000182net.sourceforge
121170622585150.000049com.nike
122170567084000.000060org.eclipse
123170547704320.000056co.g
124170524382690.000085com.tinyurl
12517052256620.000509net.akamaihd
126170479044370.000055org.kernel
12717045616680.000467com.mashable
128170454604890.000051au.com.blogspot
12917042294640.000480org.schema
130170410626200.000043com.discogs
131170382341410.000181com.youtube-nocookie
132170372623700.000065com.npmjs
133170346182980.000079com.symantec
134170235881960.000126com.live
135170186123280.000072com.googlecode
136170166203960.000061com.git-scm
137170121303940.000061com.500px
138170113101980.000126edu.stanford
139170107824160.000058com.unity3d
140170105326860.000042com.wikidot
141169924943340.000071com.alexa
142169836104470.000054com.sap
143169781462500.000092com.businessinsider
144169767262720.000085com.cnet
145169763663720.000064com.getpocket
146169716982520.000092com.go
147169644562320.000101com.washingtonpost
148169570345670.000046com.chrome
1491695597690.003080com.godaddy
150169541521400.000182com.sharethis
151169535422110.000115com.ebay
152169495565060.000050edu.berkeley
153169488123770.000064au.gov.nsw
154169425942890.000080com.msn
155169383343330.000072com.time
156169377522130.000114com.nbcnews
15716934364750.000429edu.utexas
158169300385320.000048com.jetbrains
159169277123170.000074edu.harvard
160169247705450.000047ms.1drv
161169178501890.000130com.etsy
162169148381760.000140gov.nih
163169117526640.000043com.klout
164169050583270.000072edu.mit
165169039283160.000074com.reuters
166168989462350.000098com.mapquest
167168988763180.000074com.wired
168168933645700.000046com.crunchbase
169168932704010.000060gov.nasa
170168901307220.000040com.4shared
171168857702810.000082io.codepen
172168828822950.000079com.photobucket
173168759322570.000090com.udacity
174168656923090.000076com.aol
175168581684080.000059com.cnbc
176168538162930.000079com.tripod
177168486765170.000049org.aarp
178168477205630.000046edu.utah
179168469283420.000070org.npr
180168441287460.000039com.diigo
181168420743030.000077com.meetup
182168409241200.000223com.mailchimp
183168400963670.000065com.gmail
18416835606240.001310ru.yandex
185168346124250.000057com.appspot
186168335562870.000080com.ibm
187168270303380.000071gov.ca
188168262022420.000095com.surveymonkey
189168255322760.000083com.usatoday
190168249887780.000038com.googledrive
191168228467490.000039com.naturalnews
192168199907640.000038io.soup
193168158803400.000070uk.co.telegraph
194168142361630.000148com.eventbrite
195168138842060.000119com.opera
196168133066760.000043com.zappos
19716811868880.000394com.jquery
198168117966920.000042com.wholefoodsmarket
199168095085350.000048com.createspace
200168090723220.000073com.images-amazon
201168075923040.000077com.bloomberg
202167965021930.000128com.twimg
203167933644140.000058com.kickstarter
204167927301030.000285com.addthis
205167918442510.000092com.techcrunch
206167910748040.000037edu.washington
207167908526890.000042com.abebooks
208167906942940.000079com.googlesyndication
209167902465110.000049edu.cornell
210167852605290.000048com.buzzfeed
211167831304120.000059org.un
212167811322630.000087com.stackoverflow
213167809581490.000166com.feedburner
214167792506080.000044com.theverge
215167751307960.000037com.pearltrees
21616774700670.000473com.vk
217167745863750.000064com.latimes
218167655796990.000042com.sublimetext
219167606964980.000050org.rubyonrails
220167559111700.000142com.zendesk
221167548008800.000035com.fotolog
22216754091690.000466me.fb
223167513005770.000045com.audible
224167506155490.000047org.pbs
225167493145360.000048com.deviantart
226167477654100.000059com.wiley
227167466603070.000077org.acm
228167453268620.000036tl.page
229167445722120.000114com.ssl-images-amazon
230167438908240.000037com.instapaper
231167426627410.000039com.kinja
232167420081100.000253com.shopify
233167408117670.000038com.newyorker
234167403695030.000050com.yellowpages
235167361722030.000122org.drupal
236167349787580.000039com.xda-developers
237167323119210.000035com.adsoftheworld
238167318952210.000110org.mediawiki
239167311372790.000083fr.free
240167300808050.000037co.ello
241167295154440.000054com.theatlantic
242167252514090.000059uk.co.dailymail
2431672136511890.000031edu.columbia
244167202953880.000062com.bbc
24516720112450.000661com.yimg
246167189054510.000054com.wikihow
247167186972360.000098net.php
248167147875890.000044com.citysearch
249167010818110.000037com.jigsy
250166995516840.000043com.vice
251166934169920.000034ly.ow
252166920565340.000048com.exacttarget
253166855272610.000089com.salesforce
254166828195390.000047com.cbsnews
255166780185020.000050com.zdnet
256166765263970.000061gov.whitehouse
257166754195820.000045com.ft
258166696941050.000280de.google
2591666723911900.000031edu.yale
2601666147812130.000031edu.ucla
261166577076060.000044uk.co.guardian
262166553246850.000043com.googleblog
263166540767340.000040com.nationalgeographic
26416651951920.000369com.qq
2651664966611660.000032edu.psu
266166493943990.000060uk.co.blogspot
267166487307660.000038com.foxnews
268166483216440.000043org.virtualbox
269166478505230.000048org.maven
27016647058770.000418com.people
271166467312160.000113uk.co.amazon
272166459642580.000089com.hp
273166427385500.000047com.cisco
274166400847770.000038com.economist
275166392833210.000073gov.cdc
276166347105900.000044com.bandsintown
2771663436813260.000027com.indiegogo
2781663075411870.000031com.gizmodo
279166300792180.000112com.windowsphone
280166286345840.000045org.hbr
281166281969190.000035com.authorstream
282166276724390.000055edu.cmu
283166243968510.000036com.timeanddate
2841662146811860.000031com.evernote
285166204765780.000045com.dropboxusercontent
2861661939411600.000033com.sciencedaily
287166167936870.000042com.wikia
288166152862240.000108com.bandcamp
289166132933950.000061org.whatbrowser
290166121432560.000090io.atom
2911661200912590.000029in.blogspot
292166101747140.000040com.dpreview
293166100982800.000083com.smugmug
294166094561710.000142com.weibo
295166057545280.000048com.theknot
296166041157510.000039com.merchantcircle
297165996148710.000035us.imageshack
298165985678820.000035com.slate
299165984911970.000126com.blogblog
300165965757150.000040org.imagemagick
3011659419711970.000031org.arxiv
302165916804760.000051com.squareup
303165916613690.000065com.skype
3041658834214280.000023edu.ucsd
3051658652112970.000028com.ning
306165829595750.000046com.tinypic
307165827044930.000050com.giphy
308165824236960.000042com.box
309165820583110.000076com.nypost
3101657662614540.000023com.posterous
311165761586880.000042com.bookdepository
312165760738850.000035com.brandyourself
3131657517512490.000029edu.upenn
3141657330911550.000033org.eff
315165726214780.000051org.postgresql
316165718146770.000043de.blogspot
317165682134070.000059com.angieslist
318165649537870.000038com.samsung
319165633398430.000036com.comixology
3201656166314080.000024edu.wisc
3211656098411610.000032gov.census
322165599417470.000039com.shutterstock
3231655946313230.000027uk.ac.cam
3241655892711710.000032gov.nist
325165588585430.000047com.geocities
326165588411680.000144com.xing
327165584554220.000057com.oreilly
3281655802714590.000023edu.purdue
329165566587160.000040com.nature
3301655618013970.000024com.hotmail
331165550288020.000037com.uk
332165543519960.000034com.livestream
333165532029200.000035com.arstechnica
334165520063370.000071com.prnewswire
335165488582840.000081ca.google
336165467277050.000041org.vim
337165458662200.000111com.getclicky
338165435484150.000058int.who
3391654142514360.000023edu.princeton
340165382685690.000046com.entrepreneur
341165382223820.000063com.sxsw
3421653811014990.000022com.angelfire
3431653792312290.000030edu.umich
344165378894260.000057com.springer
345165336007790.000038com.bravesites
3461653309610380.000033org.unesco
3471653159413510.000026uk.ac.ox
348165314564840.000051com.office
3491652905512600.000029org.iso
3501652876613300.000027com.pcworld
351165277788600.000036com.unsplash
352165273757550.000039com.blackberry
353165266802100.000117de.amazon
354165257907810.000038gov.state
355165235814490.000054com.fortune
356165222177030.000041org.aclweb
357165221787860.000038net.vnexpress
358165220493540.000068com.booking
359165217278790.000035com.dynamics
3601652110610200.000034com.weather
3611652024510030.000034com.communitywalk
362165196727630.000039com.vagrantup
363165161371590.000152com.constantcontact
364165148877080.000041jobs.amazon
3651651472110390.000033com.indiatimes
366165126257750.000038com.cbslocal
3671651227612000.000031com.lifehacker
3681651197214320.000023com.vox
369165097932700.000085it.placehold
370165087115650.000046com.newsweek
3711650820316090.000020net.comcast
372165055012090.000118org.joomla
373165054424480.000054com.force
3741650514812990.000028com.politico
3751650270113100.000028org.altervista
376165006715880.000044com.venturebeat
377164987652780.000083gov.ftc
378164971987560.000039com.java
3791649700012640.000029co.vine
3801649336410680.000033com.ubuntu
3811649330314630.000023com.thinkwithgoogle
382164884084460.000054com.businesswire
383164883482530.000091to.amzn
3841648817513430.000026fm.last
385164870618680.000035hu.elte
3861648601212030.000031com.gofundme
3871648569811680.000032ca.cbc
3881648494010710.000033gov.senate
3891648270715900.000020edu.uchicago
390164825506790.000043com.googlesource
391164812017130.000040org.sqlite
3921647357313350.000026com.airbnb
393164710456800.000043gov.noaa
394164704567190.000040com.manta
395164702971420.000180org.bbb
3961646673312370.000029com.searchengineland
3971646569021030.000014com.twitpic
3981646526514060.000024edu.umn
399164648208840.000035com.googlelabs
4001646429411690.000032com.engadget
4011646408913990.000024uk.co.theregister
402164632665190.000049com.inc
40316463082790.000414com.bleacherreport
404164613532640.000086es.google
4051646116813240.000027com.dell
4061645980416500.000019com.blogs
4071645934412360.000030com.stackexchange
4081645881016760.000019edu.usc
4091645835714820.000022com.mtv
410164562995270.000048org.sonatype
4111645611517200.000018mp.j
4121645608613160.000027com.variety
413164555537400.000039org.gnupg
4141645444125100.000011edu.unl
4151645336113320.000027org.ieee
4161645222215540.000021edu.northwestern
4171645119711840.000031com.americanexpress
418164501224560.000053com.snapchat
419164500652190.000111fr.google
4201644830513070.000028com.discovery
4211644792612570.000029com.businessweek
4221644771112190.000030com.netflix
4231644587015990.000020edu.jhu
424164458597690.000038com.jsbin
425164453701280.000209com.googleadservices
426164451007350.000040com.intel
427164448235660.000046com.delicious
4281644454111520.000033com.pinimg
429164432824740.000052com.nwsource
4301644237311560.000033tv.ustream
431164396011650.000147it.google
432164395137230.000040br.com.uol
433164394085210.000048com.herokuapp
434164380623120.000075com.bitly
435164321141840.000134com.eepurl
4361643132516200.000019com.examiner
437164312443580.000067com.bizjournals
438164303388550.000036com.souq
4391642956011740.000032au.net.abc
4401642920911920.000031fr.blogspot
4411642887418060.000016edu.rutgers
442164286508580.000036ca.pinterest
4431642838616300.000019com.udemy
4441642632416800.000018uk.co.thesun
4451642597514290.000023com.prezi
4461642287116930.000018com.speakerdeck
4471642146712900.000028com.mlb
448164214657820.000038com.mysanantonio
4491642119412110.000031com.chicagotribune
450164206057200.000040com.shopbop
4511641847615750.000020it.blogspot
452164183482900.000080com.hubspot
4531641616318990.000015edu.msu
454164159952820.000082com.fc2
455164157116970.000042com.moz
456164142487840.000038com.boxofficemojo
457164137827270.000040io.getmdl
45816410305760.000421me.m
4591640906112740.000028gov.fbi
4601640643219660.000015ch.ethz
461164062442620.000088com.dribbble
462164044201940.000126jp.co.yahoo
4631640279414910.000022com.trello
4641640156110260.000034com.slack
4651640127513250.000027net.researchgate
466164006893320.000072edu.nyu
467163987701370.000185com.google-analytics
468163982915550.000047com.wunderground
469163981054290.000057com.naver
4701639796018240.000016com.tutsplus
4711639658721710.000013com.googlepages
4721639612815940.000020edu.academia
473163951044130.000059com.bigcartel
474163943488100.000037it.binged
4751639398313800.000025org.khanacademy
4761639328511940.000031com.reverbnation
4771639303515870.000020com.mac
4781639278114720.000022com.target
4791639248520850.000014edu.asu
480163917962750.000084com.wufoo
4811639121920360.000014edu.arizona
482163904597000.000041uk.co.independent
4831638960215190.000022com.pexels
4841638950914120.000024com.over-blog
485163882604660.000052com.adweek
486163873622600.000089com.myshopify
4871638734413950.000024com.bostonglobe
4881638714515720.000020com.zazzle
4891638713413610.000025com.libsyn
490163860104180.000058com.fastcompany
491163854975800.000045gov.ed
492163851611190.000223com.baidu
493163851076120.000044cn.com.sina
494163844555910.000044gov.fda
495163844157280.000040es.com.blogspot
4961638409910400.000033gov.nps
4971638358616460.000019com.vanityfair
498163829788870.000035ws.snack
4991638229710460.000033com.marketwatch
5001638180018880.000016com.yolasite
5011638132415580.000021com.nba
502163799621090.000261org.networkadvertising
5031637908211530.000033gov.house
504163768938980.000035com.sfgate
5051637350522890.000012edu.caltech
506163727834650.000053com.w3schools
507163725661230.000221jp.co.google
5081637204120350.000014com.instructables
5091636946818210.000016com.msnbc
5101636853914700.000022com.scientificamerican
5111636850315430.000021com.ehow
5121636662519840.000015uk.ac.ucl
513163660406900.000042org.bitbucket
5141636572521620.000013ca.ualberta
515163648064640.000053net.openid
516163646177680.000038org.gradle
5171636420317720.000017org.aclu
5181636354014740.000022com.elpais
519163634847240.000040com.yarnpkg
5201636307727420.000010com.hubpages
521163626756330.000043com.cargocollective
5221636196416400.000019com.mercurynews
5231635918711770.000032com.steampowered
5241635897418450.000016edu.ufl
5251635883112350.000030org.change
526163587195830.000045gov.usda
527163580838280.000036com.warriorplus
5281635730212440.000029com.thenextweb
5291635727513860.000024de.spiegel
5301635706411580.000033com.proofpoint
531163569658090.000037com.whitepages
5321635335211880.000031gov.fcc
5331635281517950.000017com.nfl
5341635206813450.000026com.globo
5351635196730650.000009com.answers
536163515797060.000041org.jenkins-ci
5371635105715570.000021com.billboard
538163504577760.000038ly.snip
5391634937611720.000032com.ggpht
5401634932417800.000017org.ap
5411634917219010.000015edu.indiana
5421634906614450.000023com.nokia
5431634888819430.000015com.ign
5441634818519260.000015com.ikea
5451634804717060.000018edu.umd
54616348024520.000596com.messenger
547163478147570.000039com.msdn
5481634586018010.000017org.weforum
549163457875260.000048org.doi
550163452332020.000122jp.ameblo
551163445778910.000035com.woot
5521634444712830.000028com.patreon
5531634395715380.000021br.com.blogspot
554163434141320.000199ru.mail
5551634327621040.000014com.oxforddictionaries
556163425907440.000039com.photoshelter
5571634191913220.000027gov.uspto
5581634130916520.000019fr.lemonde
5591634093915910.000020com.rollingstone
5601634063017630.000017uk.co.metro
561163400436020.000044com.sciencedirect
5621633973037790.000007mx.unam
563163394369440.000035com.hotfrog
5641633859221270.000014com.fiverr
565163377951730.000141jp.ne.hatena
5661633744318400.000016com.aliexpress
5671633637330720.000009com.123rf
568163360763860.000063au.com.google
5691633480112380.000029com.prweb
5701633415218350.000016br.com.abril
5711633287514870.000022com.pcmag
572163321928730.000035ly.plot
5731633170932500.000008com.blog
574163315493910.000061us.icio
575163311998370.000036com.folkd
5761633117623160.000012org.kiva
5771633075223960.000012edu.brown
5781633062414780.000022com.qz
5791633049011800.000032com.psychologytoday
5801632968920880.000014com.newscientist
5811632911415770.000020com.playstation
5821632642514010.000024edu.si
583163249798460.000036io.material
5841632400010720.000033gov.usa
5851632289616080.000020com.hulu
5861632167213410.000026com.cafepress
5871632119519860.000015ca.utoronto
5881632100315970.000020com.econsultancy
589163209398150.000037gov.copyright
590163200384400.000055gov.irs
5911631869030080.000009cc.co
5921631868118340.000016com.canva
5931631743227920.000010pt.sapo
5941631529717050.000018com.colourlovers
595163148819940.000034com.hotukdeals
596163142718300.000036com.getskeleton
5971631251514950.000022com.nymag
598163121874060.000059com.barnesandnoble
5991631155212580.000029org.worldbank
6001631024520060.000014com.bestbuy
6011631010817830.000017com.nhl
6021630887920130.000014edu.uci
6031630883113980.000024com.boston
604163088148780.000035com.insiderpages
6051630753928560.000010edu.tufts
606163072173650.000066nl.google
607163064238260.000037gov.hhs
6081630606622290.000013edu.osu
6091630602417160.000018edu.duke
6101630499612260.000030com.hootsuite
611163047032470.000093jp.co.amazon
6121630221715340.000021gov.nyc
6131630158718550.000016com.fifa
6141630125916420.000019com.withgoogle
615163007274240.000057com.clicky
616162982295240.000048com.whatsapp
617162978627040.000041com.redbubble
6181629767629340.000009com.friendfeed
6191629763719540.000015com.gawker
6201629698113330.000027org.oecd
6211629656820820.000014nl.xs4all
6221629638318570.000016com.pastebin
623162954279380.000035com.tiki-toki
6241629483628090.000010edu.uic
6251629475012810.000028com.istockphoto
6261629435616050.000020com.hyatt
6271629432220590.000014edu.tamu
6281629301721640.000013edu.ncsu
6291629238913850.000024com.com
630162921039460.000034jp.ac.kobe-u
631162917419060.000035com.quantcast
6321629163519310.000015nl.blogspot
633162916188030.000037com.webmd
6341629135328230.000010com.wolfram
635162913367290.000040ca.amazon
636162909414550.000053net.launchpad
6371629009521700.000013com.wikispaces
6381628960013040.000028com.walmart
6391628897830810.000009edu.colostate
640162880945200.000048in.co.google
6411628641712070.000031com.redhat
6421628640915740.000020com.merriam-webster
6431628614217300.000018int.wipo
6441628474411960.000031com.adage
6451628415112240.000030com.ups
646162839888440.000036com.newsbank
6471628394130780.000009com.squidoo
6481628379113370.000026gov.dot
6491628370516770.000018com.me
6501628327114440.000023com.mediafire
6511628319021220.000014ca.ubc
6521628226126320.000011ca.uwaterloo
6531628188816000.000020edu.unc
6541628140520020.000015org.kde
6551628099921090.000014org.gimp
656162806114770.000051com.pingdom
6571627926225170.000011gd.is
6581627923327130.000010edu.hawaii
6591627820420760.000014com.aljazeera
6601627788015250.000021com.xbox
6611627659316110.000020com.freewebs
6621627606422530.000013com.britannica
6631627515916860.000018uk.co.mirror
6641627496222610.000013uk.co.timesonline
6651627410220650.000014au.com.news
6661627378815320.000021com.xkcd
6671627355411980.000031com.feedly
6681627304529310.000009com.laughingsquid
6691627229115070.000022gov.wa
6701627228618420.000016tv.periscope
6711627212214600.000023com.mixcloud
6721627025729190.000010com.codecademy
6731626987120030.000015edu.illinois
6741626939916790.000018uk.co.huffingtonpost
675162691073870.000062net.themeforest
6761626903116540.000019uk.co.ebay
677162688943790.000063com.ea
678162685369980.000034com.att
6791626842316660.000019net.daum
6801626796326130.000011ca.mcgill
681162650416590.000043com.houzz
6821626464815140.000022com.intuit
683162643355730.000046fr.amazon
6841626238420870.000014com.softpedia
6851626196818720.000016com.autodesk
686162618992070.000119org.icann
6871626185618120.000016com.deadline
6881626130627080.000010edu.vanderbilt
6891626120816430.000019com.foxbusiness
6901626054015980.000020gov.uscourts
691162590383800.000063com.heroku
6921625842914710.000022com.gumroad
6931625766722150.000013com.flipboard
6941625642515780.000020com.us
6951625621216340.000019de.welt
6961625576111930.000031com.deloitte
6971625473622760.000012com.yfrog
6981625468715960.000020org.owasp
6991625442427290.000010com.lynda
7001625413920460.000014org.coursera
701162534239420.000035com.cdbaby
7021625225913030.000028com.sagepub
7031625224315850.000020com.vmware
7041625222520450.000014net.earthlink
705162514177110.000041com.usnews
7061625131513830.000025org.unicef
7071625116537140.000007com.space
7081625074521210.000014com.vogue
709162497134230.000057com.cracked
7101624943618440.000016com.domain
711162492625050.000050net.yahoo
712162489212480.000092com.nielsen
7131624781810660.000033site.tenerifeforum
7141624774421290.000014com.theonion
715162474577320.000040com.atlassian
716162468817260.000040com.sharefile
717162453048210.000037org.osgeo
7181624475323290.000012com.searchenginejournal
7191624459115410.000021com.searchenginewatch
7201624386216690.000019com.windows
7211624380925030.000011org.greenpeace
7221624289410610.000033org.bravenewvoices
7231624269523140.000012edu.wustl
7241624247820150.000014uk.ac.lse
725162421489580.000034com.2findlocal
7261624195819290.000015edu.ucdavis
7271623899428080.000010edu.uoregon
728162386227720.000038org.openweathermap
7291623844514790.000022com.kissmetrics
7301623776920950.000014net.jsfiddle
7311623740516290.000019com.chron
7321623722519000.000015gov.usaid
7331623701114040.000024com.steamcommunity
734162361949410.000035com.ripple
7351623439817730.000017org.craigslist
7361623438017680.000017com.howstuffworks
737162338227880.000038com.hilton
7381623373513820.000025com.alibaba
7391623336125820.000011edu.uga
7401623297327250.000010edu.pitt
7411623284916630.000019com.yoast
7421623222628470.000010com.rottentomatoes
743162322032390.000097org.purl
7441623001314160.000024org.plos
7451622950618490.000016com.espn
7461622865720780.000014com.gamespot
7471622665741490.000007ca.yorku
7481622614524370.000012gov.cia
749162249683850.000063com.youku
7501622476916830.000018com.csmonitor
751162245118900.000035tv.twitch
7521622430035020.000008com.secondlife
7531622384314810.000022com.hollywoodreporter
7541622267016680.000019net.battle
7551622135718700.000016com.irishtimes
756162210829270.000035com.bizcommunity
7571622082023280.000012edu.vt
7581622067923750.000012com.technet
759162206698060.000037uk.co.currys
7601621920631570.000009com.avast
7611621746314850.000022org.fao
7621621725720550.000014com.twilio
763162157647170.000040com.netdna-cdn
7641621573028440.000010com.popsci
7651621529922050.000013com.podbean
7661621480512560.000029org.redcross
7671621394528130.000010org.kqed
7681621393714530.000023us.tx.state
769162138864200.000058br.com.google
7701621253017850.000017mil.navy
7711621178520100.000014com.netvibes
7721621171532550.000008edu.iastate
7731620934116310.000019com.animoto
7741620926829160.000010int.esa
7751620926122140.000013com.makezine
7761620803720240.000014edu.ucsf
7771620802935760.000008uk.ac.manchester
7781620662319040.000015com.foxsports
7791620602417920.000017com.blogtalkradio
7801620514513360.000026com.docker
7811620497516320.000019mil.army
7821620461823350.000012com.lonelyplanet
7831620434514400.000023jp.blogspot
7841620370635170.000008edu.wsu
7851620346217540.000017co.angel
786162029317020.000041com.technorati
7871620271316210.000019com.today
788162024732280.000104com.elegantthemes
7891620139515530.000021com.fedex
7901620133618270.000016com.macworld
7911620085516890.000018ru.spb
7921620044034910.000008org.eu
7931619979839400.000007edu.byu
7941619960419800.000015com.topsy
7951619951813080.000028gov.energy
7961619942519320.000015edu.umass
7971619801017820.000017org.cancer
798161976848270.000037com.themonitor
7991619720415390.000021gov.congress
800161971734530.000054com.zenfolio
801161951313260.000073com.newrelic
802161936379810.000034com.scribblemaps
8031619344513270.000027com.webnode
8041619335113090.000028com.zoho
8051619291614030.000024com.techrepublic
806161926004690.000052jp.ne.sakura
8071619194114750.000022com.html5rocks
808161915127520.000039gov.sec
809161910112460.000093me.line
810161903965600.000046gov.export
8111619032124210.000012com.redbull
8121619024512890.000028de.bund
8131619014812730.000028com.formstack
8141618940117270.000018org.pewresearch
8151618705525040.000011org.documentcloud
8161618615322060.000013com.denverpost
8171618510517910.000017com.freepik
8181618450111640.000032gov.justice
81916184479830.000405com.shareaholic
820161842118570.000036org.bouncycastle
821161841341810.000137info.aboutads
8221618307310060.000034com.weddingbee
82316182519220.001469com.wixstatic
8241618115318220.000016com.sky
8251618089835870.000008edu.syr
826161807044420.000055com.teamviewer
8271618050217410.000017edu.cuny
8281617943714920.000022de.heise
8291617939822900.000012com.refinery29
8301617910513730.000025com.gigaom
8311617905176530.000004nr.co
8321617894330660.000009com.seekingalpha
833161787955090.000049com.informit
8341617839421690.000013com.pbworks
8351617677928570.000010com.threadless
8361617652310090.000034com.spoke
8371617624416900.000018com.salon
838161753148640.000036com.tractorsupply
839161750364360.000055ru.vkontakte
8401617341970920.000004com.xanga
841161730618340.000036com.withoutabox
8421617216734310.000008edu.rochester
8431617200419280.000015google.blog
8441617199630390.000009cc.tiny
8451617186223380.000012com.sony
846161717454950.000050com.mapbox
8471617157922160.000013edu.uiuc
8481617143513690.000025com.justgiving
849161710189730.000034com.quandl
8501617090730440.000009edu.oregonstate
8511617077930880.000009edu.rice
852161702579890.000034com.citysquares
8531616930315220.000021com.accenture
8541616904517170.000018gov.weather
8551616826425780.000011ch.cern
8561616778723650.000012com.nbcsports
8571616766534580.000008tt.db
8581616745713110.000027gov.ny
8591616690437640.000007com.panoramio
860161664713980.000061com.list-manage1
8611616571634990.000008edu.fsu
8621616560216720.000019com.indeed
8631616482416700.000019org.gnome
8641616426723060.000012com.motherjones
8651616410932860.000008com.techsmith
8661616402320210.000014de.bild
867161637869870.000034com.zwire
868161637689360.000035org.gwtproject
8691616344115930.000020uk.co.thetimes
8701616174511830.000031com.hostgator
8711616168322470.000013com.shutterfly
8721616122475570.000004com.weheartit
8731616107810370.000033com.lacartes
8741615988521200.000014me.flavors
8751615949418690.000016com.digitaltrends
8761615887725180.000011com.lego
8771615886746850.000006com.skyrock
8781615824814550.000023com.ssrn
879161565917740.000038ru.google
8801615638016610.000019ru.narod
8811615531927110.000010au.edu.anu
8821615513529070.000010net.nocookie
8831615439516820.000018com.infoworld
8841615373617770.000017com.starbucks
8851615281710180.000034com.live5news
8861615193239840.000007to.gplus
8871615147040440.000007org.nypl
8881615145421060.000014com.trendmicro
8891615093516160.000019com.codeplex
8901615078616490.000019com.gettyimages
891161501625160.000049com.typeform
8921614938018140.000016com.amzn
8931614921217330.000018com.upwork
8941614895923740.000012com.hatenablog
8951614853712750.000028uk.co.eventbrite
8961614831129520.000009ly.cl
897161482679790.000034au.com.yelp
8981614766912210.000030com.linksynergy
8991614762324500.000012tv.blip
900161475939660.000034com.strawberryperl
9011614669225160.000011com.ezinearticles
9021614625057460.000005com.minus
9031614612312800.000028gov.archives
904161460029910.000034net.brownbook
9051614589520410.000014org.c-span
9061614583343990.000006com.treehugger
9071614569614960.000022se.google
9081614559014130.000024com.smashingmagazine
9091614520531200.000009com.askmen
9101614459923590.000012com.rt
9111614436213400.000026gov.sba
9121614336821910.000013com.madmimi
9131614328932010.000009com.voanews
9141614219310310.000034edu.alamo
9151614127812930.000028be.google
91616141249980.000323org.nginx
9171614025527900.000010com.asus
9181613977816990.000018com.techradar
9191613970220090.000014com.allthingsd
9201613907421500.000013com.mentalfloss
9211613895540090.000007net.minecraft
9221613770244170.000006com.pbase
9231613622316590.000019com.bloglovin
9241613601415230.000021com.forrester
925161359249290.000035com.sacurrent
9261613555611820.000032com.strikingly
9271613537717810.000017org.openoffice
9281613481710540.000033com.garmin
9291613475411570.000033org.postimg
9301613456524750.000011com.eonline
9311613418015950.000020com.lulu
9321613193618090.000016com.ibtimes
933161317789240.000035com.fabric
9341613171316550.000019com.zillow
935161316239900.000034com.shareasale
9361613149121610.000013com.history
9371613133215420.000021com.mcafee
9381613103154420.000005com.archdaily
939161307913240.000073com.cloudinary
9401613060437000.000007com.thingiverse
9411613041636330.000008com.starwars
9421613003931490.000009com.pitchfork
9431613000735280.000008com.gyazo
9441612970818610.000016ca.huffingtonpost
945161290393550.000068com.monster
9461612894740340.000007com.tistory
9471612878340790.000007edu.utk
9481612854938580.000007com.lmgtfy
9491612849610640.000033mp.mailchi
9501612786017240.000018com.ssllabs
9511612748112470.000029org.moodle
9521612630610170.000034org.simile-widgets
9531612614222310.000013com.invisionapp
9541612601521050.000014com.real
9551612528936400.000007edu.buffalo
9561612497333420.000008com.indiewire
957161249592830.000082org.debian
9581612481120300.000014com.ew
9591612481115310.000021com.uber
9601612474750510.000006edu.gsu
961161244578360.000036com.list-manage2
9621612438013640.000025net.java
9631612393311670.000032com.tandfonline
964161239114860.000051com.taobao
9651612360316600.000019com.bmj
9661612324034200.000008org.lifehack
9671612280823020.000012com.canalblog
9681612259721410.000013edu.ucsc
969161223689800.000034org.tpr
9701612235827810.000010nl.utwente
9711612160819410.000015com.getresponse
9721612157726310.000011com.dallasnews
9731612099822370.000013edu.colorado
9741611856016380.000019com.ecwid
9751611847612870.000028es.amazon
9761611847110220.000034com.ibegin
9771611813516370.000019com.deezer
9781611798913940.000024jp.ne.goo
9791611777219710.000015jp.ne.biglobe
9801611775621300.000014edu.bu
981161177092140.000114com.homestead
982161174779310.000035com.chamberofcommerce
9831611613058920.000005ie.tcd
9841611577240850.000007edu.uconn
9851611473135900.000008edu.usf
9861611470215260.000021com.warnerbros
9871611434847770.000006ca.ucalgary
9881611385720140.000014hk.com.google
989161137861780.000139com.parallels
9901611346718410.000016com.getfirebug
9911611321915300.000021com.waze
9921611314133720.000008ru.org
9931611294931830.000009com.polyvore
9941611262424730.000011com.campaignmonitor
9951611255516840.000018com.thehill
996161124079850.000034com.showmelocal
9971611235313210.000027gov.usgs
9981611193719080.000015jp.or.nhk
9991611175758510.000005com.rapidshare
10001611164730400.000009com.expedia

Graphs of January 2018 Crawl

Erroneously we released webgraphs and rankings of a single monthly crawl (January 2018) instead of a quarterly release covering 3 crawls. To ensure reproducibility we’ve preserved the erronuous release.

The host-level graph consists of 775 million nodes and 2.7 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 719 million dangling nodes (93%).

Download files of the Common Crawl Jan 2018 host-level webgraph

SizeFileDescription
4.84 GBcc-main-2018-jan-host-vertices.txt.gznodes ⟨id, rev host⟩
10.21 GBcc-main-2018-jan-host-edges.txt.gzedges ⟨from_id, to_id⟩
4.90 GBcc-main-2018-jan-host.graphgraph in BVGraph format
2 kBcc-main-2018-jan-host.properties
5.94 GBcc-main-2018-jan-host-t.graphtranspose of the graph (outlinks mapped to inlinks)
2 kBcc-main-2018-jan-host-t.properties
1 kBcc-main-2018-jan-host.statsWebGraph statistics
10.79 GBcc-main-2018-jan-host-ranks.txt.gzharmonic centrality and pagerank

The domain-level graph with 70 million nodes and 835 million edges has 60% or 42 million nodes are dangling nodes, the largest strongly connected component covers 22 million or 31% of the nodes.

Download files of the Common Crawl Jan 2018 domain-level webgraph

SizeFileDescription
0.49 GBcc-main-2018-jan-domain-vertices.txt.gznodes ⟨id, rev domain, num hosts⟩
3.30 GBcc-main-2018-jan-domain-edges.txt.gzedges ⟨from_id, to_id⟩
1.80 GBcc-main-2018-jan-domain.graphgraph in BVGraph format
2 kBcc-main-2018-jan-domain.properties
1.89 GBcc-main-2018-jan-domain-t.graphtranspose of the graph
2 kBcc-main-2018-jan-domain-t.properties
1 kBcc-main-2018-jan-domain.statsWebGraph statistics
1.46 GBcc-main-2018-jan-domain-ranks.txt.gzharmonic centrality and pagerank

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

January 2018 Crawl Archive Now Available

The crawl archive for January 2018 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2018-05/. It contains 3.4 billion web pages and 270 TiB of uncompressed content, crawled between January 16th and Jan 24th.

The January crawl contains 1.1 billion new URLs, not contained in any crawl archive before. New URLs are “mined” by

  • extracting and sampling URLs from sitemaps if provided by any of the highest-ranking 100 million hosts taken from the Aug/Sept/Oct 2017 webgraph data set
  • a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts or top 25 million domains of the webgraph dataset
  • a random sample taken from WAT files of the December crawl
  • and the continued and increased donation of URLs from mixnode.com

We were able to further shrink the overlap between successive crawls: the last two monthly archives (December and January) taken together contain content from 6 billion URLs, the last three archives (Nov/Dec/Jan) cover 8 billion unique URLs.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2018-05/segment.paths.gz100
WARC filesCC-MAIN-2018-05/warc.paths.gz8000074.33
WAT filesCC-MAIN-2018-05/wat.paths.gz8000021.36
WET filesCC-MAIN-2018-05/wet.paths.gz800009.29
Robots.txt filesCC-MAIN-2018-05/robotstxt.paths.gz800000.18
Non-200 responses filesCC-MAIN-2018-05/non200responses.paths.gz800003.02
URL index filesCC-MAIN-2018-05/cc-index.paths.gz3020.27

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2018-05/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

We are grateful to our friends at mixnode for donating a seed list of 400 Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

December 2017 Crawl Archive Now Available

The crawl archive for December 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-51/. It contains 2.9 billion web pages and over 240 TiB of uncompressed content.

To improve coverage and freshness we added 650 million new URLs (not contained in any crawl archive before)

  • sampled from sitemaps if provided by any of the top 80 million hosts taken from the Aug/Sept/Oct 2017 webgraph data set
  • found by a side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 25 million hosts and domains
  • a random sample take from WAT files of the November crawl
  • and the continued donation of URLs from mixnode.com

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-51/segment.paths.gz100
WARC filesCC-MAIN-2017-51/warc.paths.gz8000061.2
WAT filesCC-MAIN-2017-51/wat.paths.gz8000019.38
WET filesCC-MAIN-2017-51/wet.paths.gz800008.41
Robots.txt filesCC-MAIN-2017-51/robotstxt.paths.gz800000.13
Non-200 responses filesCC-MAIN-2017-51/non200responses.paths.gz800001.47
URL index filesCC-MAIN-2017-51/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-51/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

We are grateful to our friends at mixnode for donating a seed list of 300+ Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November 2017 Crawl Archive Now Available

The crawl archive for November 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-47/. It contains 3.2 billion web pages and 260 TiB of uncompressed content.

To improve coverage and freshness we added 750 million new URLs (not contained in any crawl archive before)

  • sampled from sitemaps if provided by any of the top 80 million hosts taken from the Aug/Sept/Oct 2017 webgraph data set
  • found by a side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 10 million hosts and domains
  • a random sample take from WAT files of the October crawl
  • and the continued donation of URLs from mixnode.com

For the first time, during the November crawl, we took measures to actively fight link spam. In the past our policy was to direct the crawl to relevant content, a strategy which avoids spam but does not exclude it. Spam is a valid object of research, and thus spammy content is included in our crawl archives. Spam should not bear on other use cases (mining data for natural language processing) as long as it represents a very low percentage of all documents. However, during this crawl, we faced significant technical challenges caused by link spam:

Penalizing spam domains is the easiest way for us to avoid further issues and also to ensure that these spam clusters do not start to dominate future crawls.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-47/segment.paths.gz100
WARC filesCC-MAIN-2017-47/warc.paths.gz8000066.17
WAT filesCC-MAIN-2017-47/wat.paths.gz8000020.71
WET filesCC-MAIN-2017-47/wet.paths.gz800009.06
Robots.txt filesCC-MAIN-2017-47/robotstxt.paths.gz800000.16
Non-200 responses filesCC-MAIN-2017-47/non200responses.paths.gz800002.33
URL index filesCC-MAIN-2017-47/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-47/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

We are grateful to our friends at mixnode for donating a seed list of 300+ Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017. These graphs, along with ranked lists of hosts and domains, follow the first (February, March, April 2017) and second (May, June, July 2017) web graph releases. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in a prior announcement.

What’s new?

Here is a summary of notable aspects of this web graph release:

  • Tools and scripts to produce the web graph and rank the graph vertices are released as part of the project “cc-webgraph” on GitHub.
  • As compared to prior web graphs, two changes are caused by the large size of this host-level graph (5.1 billion hosts):
    • the text dump of the graph is split into multiple files;
    • there is no page rank calculation at this time. At present, we provide ranking by harmonic centrality, and hope to add page rank values in the upcoming weeks.
    • Update Feb 7, 2018: the host-level ranks file now also contains the page ranks. Thanks to Sebastiano Vigna, one of the authors of the WebGraph framework, for the kind support!
  • For the domain-level graph, we provide ranking by both harmonic centrality and page rank.
  • The host-level graph contains a significant portion of hosts related to link spam clusters (possibly 50% or more of the hosts). This data set, therefore, is a useful tool for the study of link spam; from it, we have identified 300,000 spam domains. 2.25 billion hosts in the host-level webgraph belong to these domains. However, in the October crawl archive, these domains comprise less than 2% of the crawled HTML pages (56 million pages out of 3.6 billion) and less than 0.3% of the crawled domains (70,000 out of 26 million). We will start to penalize pages from these domains going forward.

Host-level graph

The graph consists of 5.1 billion nodes and 18.8 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 5.1 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/hostgraph/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/hostgraph/ as prefix to access the files from everywhere.

The following files and formats are provided:

SizeFileDescription
27.9 GBvertices.paths.gznodes ⟨id, rev host⟩, paths of 48 vertices files
95.2 GBedges.paths.gzedges ⟨from_id, to_id⟩, paths of 72 edge files
37.9 GBbvgraph.graphgraph in BVGraph format
2.0 GBbvgraph.offsets
2 kBbvgraph.properties
56.9 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
6.7 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
74 GBranks.txt.gzhosts ranked by harmonic centrality and pagerank

To download the graph in text format, you need to download all files listed in the two path listings.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted; “private” domains are not accepted (cf. section “divisions” in the documentation on publicsuffix.org). For example, foo.blogspot.com and commoncrawl.s3.amazonaws.com are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com.

The domain-level graph has 93 million nodes and 1,258 million edges. 60% or 56 million nodes are dangling nodes, the largest strongly connected component covers 31 million or 33% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/domaingraph/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/domaingraph/.

Download files of the Common Crawl Aug/Sept/Oct 2017 domain-level webgraph

SizeFileDescription
0.65 GBvertices.txt.gznodes ⟨id, rev host⟩
5.0 GBedges.txt.gzedges ⟨from_id, to_id⟩
2.7 GBbvgraph.graphgraph in BVGraph format
0.09 GBbvgraph.offsets
2 kBbvgraph.properties
3.0 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
0.14 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
1.9 GBranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 93 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (Aug/Sept/Oct 2017)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed domain name
12462439410.016162com.facebook
22248914230.009682com.google
32223066640.009431com.twitter
42192025020.011812com.googleapis
52144646250.008148com.youtube
61915827260.005169org.gmpg
71891859670.003893com.instagram
81878007490.003043com.linkedin
918322072100.002469org.wordpress
1018163650200.001295org.wikipedia
1117977714160.001665com.pinterest
1217934400230.001146com.wordpress
1317769666300.000940com.blogspot
1417754502150.001705com.apple
1517668520190.001310com.gravatar
1617559336210.001278com.vimeo
1717418572110.002239com.adobe
1817384530260.001022com.microsoft
1917371196360.000819com.amazon
2017338218140.001979com.macromedia
2117291274560.000617com.flickr
2217174674480.000707com.tumblr
2317140484460.000716be.youtu
2417134992390.000792gl.goo
2517053582270.001006com.paypal
2616948294720.000500com.yahoo
2716940548730.000498ly.bit
2816907622250.001066com.amazonaws
2916905114350.000836me.wp
3016860996970.000263com.nytimes
3116822716380.000794com.github
3216802262320.000884io.github
3316747451800.000376org.creativecommons
34167136171070.000223com.googleusercontent
3516708091860.000325com.weebly
36166913361100.000213com.blogger
3716675803620.000602net.cloudfront
3816652535420.000756com.huffingtonpost
3916626251780.000398eu.europa
4016602228770.000421org.mozilla
41166000541540.000144org.wikimedia
42165981991300.000165net.slideshare
43165699921290.000166com.myspace
4416564424220.001153net.fbcdn
4516558860790.000378com.medium
4616542265310.000902com.cloudflare
4716520668660.000558org.w3
4816520419440.000732com.cnn
4916509535170.001657com.bootstrapcdn
50164885541020.000239com.android
51164675451920.000118com.photobucket
5216418457850.000342com.soundcloud
53164132331650.000135com.ebay
54164099272140.000103com.about
55164073811270.000169org.apache
5616404812700.000505com.wp
57164041142740.000081gov.nasa
58163621151050.000230com.yelp
5916346871880.000315co.t
60163440312480.000090com.livejournal
61163329011450.000151uk.co.bbc
62163195481120.000198com.issuu
63163172992850.000078com.cnbc
64163126131320.000164org.ietf
65163125381060.000224com.dropbox
66163031112430.000091uk.co.telegraph
67163020933000.000075com.appspot
68163020371350.000158com.forbes
69162941261330.000162net.sourceforge
7016285841710.000502com.gstatic
71162569752340.000093org.npr
7216240414840.000342com.wix
73162326331800.000126com.live
74162248801430.000152com.spotify
75162205313580.000061com.alexa
7616220429400.000787com.statcounter
7716219025290.000952com.squarespace
7816216103520.000667com.mashable
79162062762880.000077edu.mit
80161970411850.000120com.oracle
8116196376500.000691com.bing
82161841931460.000150com.imgur
83161813033310.000067gov.loc
84161661391810.000126com.disqus
8516162488470.000714net.akamaihd
86161601771840.000124com.typepad
87161567972260.000096com.mozilla
88161490312660.000083me.about
89161479871150.000196com.baidu
90161449281040.000233com.reddit
91161397221520.000145com.theguardian
92161394914310.000050edu.ucla
93161372334030.000054com.evernote
94161338151170.000192org.archive
95161292641530.000144org.gnu
96161239312860.000078com.foursquare
97161237533250.000069com.buzzfeed
98161196792190.000099com.techcrunch
99161115611760.000128com.imdb
100160958283760.000058com.slate
101160846321890.000119edu.stanford
1021608304980.003658com.godaddy
103160819343480.000063com.mysql
104160815431860.000120com.wsj
105160814793290.000068com.w3schools
106160775991490.000148com.etsy
107160752164710.000047uk.ac.ox
108160703902350.000093com.nbcnews
109160672642710.000082com.wired
110160587782230.000097com.tinyurl
111160529722410.000091com.cnet
112160479622920.000077com.reuters
11316037078370.000802com.fb
114160257735540.000042edu.princeton
11516021981760.000429com.paypalobjects
116160091602910.000077com.meetup
117160071892280.000096com.businessinsider
118160071891690.000132com.digg
119160002732900.000077edu.harvard
120159927464150.000052org.nodejs
121159875292240.000097gov.ca
122159814781220.000186com.feedburner
123159813631580.000142com.opera
124159786151250.000171com.twimg
125159783984120.000053edu.berkeley
126159754291680.000132com.dribbble
127159674022680.000082com.bloomberg
128159671623100.000071uk.co.dailymail
129159626552840.000078com.msn
130159626534450.000049org.worldbank
131159597961160.000196com.jquery
13215959089810.000376net.doubleclick
13315952785680.000520org.schema
134159461273080.000072com.bbc
135159459932520.000090com.aol
136159431612120.000105com.go
137159413452540.000089com.usatoday
138159331984920.000046com.theverge
139159317032770.000080com.ibm
14015926311900.000284com.addthis
141159262831280.000166com.eventbrite
142159248885040.000045co.g
143159248572310.000095com.staticflickr
144159211453020.000074com.time
145159190142100.000105com.surveymonkey
146159185212080.000106com.washingtonpost
147159164914140.000053org.pbs
148159162673320.000067uk.co.blogspot
149159150211640.000138com.zendesk
150159120313500.000062com.images-amazon
151159119611410.000153gov.nih
152159111362950.000076com.latimes
153159077872700.000082au.com.google
154159042824600.000047gov.wa
155159040895260.000044org.eclipse
156159002044300.000050uk.co.guardian
15715897748450.000720me.fb
158158977383180.000070com.gmail
159158918035050.000045com.pixabay
160158888863750.000058org.un
161158843964200.000051com.variety
162158824431200.000187com.list-manage
163158816285600.000041edu.washington
164158804101130.000198uk.co.google
165158778596420.000039org.chromium
166158664195570.000042org.sciencemag
167158616895910.000040com.chron
168158585374290.000050org.python
169158565998130.000036it.scoop
170158516082980.000076com.example
171158445798880.000033edu.gatech
172158336128000.000037com.arstechnica
1731582920110910.000026com.panoramio
174158266388450.000035edu.illinois
175158228242130.000104com.rackcdn
176158212502250.000096net.php
177158195823130.000071org.acm
178158166374610.000047com.scribd
179158154484680.000047com.dropboxusercontent
180158125903680.000059com.kickstarter
181158118665510.000042com.quora
182158118311180.000188com.jimdo
183158104482150.000102gov.ftc
184158072933030.000074com.stackoverflow
185158013891940.000117org.drupal
186158010001010.000239org.bbb
187157997156440.000039com.withgoogle
18815798347690.000516com.vk
189157976459190.000032edu.utah
190157954473950.000055com.theatlantic
191157921104060.000054edu.cornell
192157894778860.000033com.flipboard
193157891594330.000050com.cisco
194157829672760.000080fr.free
195157809354560.000048com.getpocket
19615774946280.000982ru.yandex
197157741742020.000110uk.co.amazon
198157731763460.000063gov.whitehouse
199157697318070.000036edu.columbia
200157690404590.000047com.venturebeat
201157684293630.000060com.webs
202157621568480.000035edu.yale
203157603254930.000046com.zdnet
204157597076320.000039org.kernel
205157555848950.000032com.businessweek
206157545116560.000038com.economist
207157514669160.000032com.jetbrains
208157441719380.000031uk.org.tate
209157438426060.000040com.libsyn
210157428821420.000152com.windowsphone
211157425944470.000049au.gov.nsw
212157345924070.000053com.inc
213157336329910.000029com.googlecode
214157328031870.000120com.mailchimp
21515729191550.000619com.people
216157288712000.000112net.behance
217157279943430.000064com.wiley
218157267233590.000061com.wikihow
21915725173490.000698com.googletagmanager
22015724381950.000279de.google
22115722940870.000324com.qq
2221572053910110.000028com.storify
223157201276120.000039com.box
224157147791750.000129jp.co.yahoo
225157106578350.000035org.unicode
226157103426940.000038com.vogue
227157095145320.000043com.samsung
228157087792400.000092com.salesforce
229157041174530.000049com.deviantart
230157030858700.000034com.ecwid
231157012973140.000071com.barnesandnoble
232156990183720.000058com.oreilly
233156983839000.000032org.arxiv
2341569829410120.000028edu.rice
235156973351110.000208com.shopify
236156945723170.000070com.tripod
237156937505060.000045com.wikia
238156925483270.000068com.dailymotion
2391569226811190.000025com.diigo
240156912995900.000040com.nationalgeographic
241156906731400.000153com.tripadvisor
242156905543790.000057com.office
243156897531470.000149com.stumbleupon
24415687628570.000615com.bleacherreport
245156862912630.000085gov.cdc
246156828339780.000029com.discogs
247156799659600.000030ms.1drv
248156797009660.000030com.hbo
249156782678680.000034org.eff
250156782154700.000047gov.dot
251156749646600.000038gov.fcc
252156746507050.000038com.tinypic
253156738885660.000041com.vice
254156736053300.000068com.skype
255156690743870.000056com.cbsnews
256156678395180.000044com.blackberry
257156644669260.000031org.ieee
258156639278020.000037com.googleblog
259156624538980.000032gov.ky
260156615613420.000064int.who
261156586547350.000037com.unsplash
262156582147370.000037com.indiatimes
263156554164820.000046com.git-scm
264156478883660.000060com.ted
265156478832270.000096com.mapquest
2661564591910730.000026com.sublimetext
267156441128640.000034gov.mo
268156437315700.000041com.foxnews
269156430868050.000036com.livestream
270156358333600.000061com.springer
271156355348110.000036gov.michigan
272156331804740.000046com.npmjs
273156315599670.000030com.ning
274156301375780.000041com.java
275156300405460.000043de.blogspot
276156298329330.000031gov.mt
277156262109720.000029in.blogspot
278156261706690.000038com.sfgate
2791562149311010.000026com.trello
2801562143410760.000026org.amnesty
281156207569630.000030com.hatenablog
282156189414990.000045com.ft
283156176838340.000035com.marketwatch
284156159091500.000145com.ytimg
285156158508320.000036com.yellowpages
2861561480510420.000027edu.psu
287156147898620.000034gov.oregon
2881561374910490.000027edu.ucsd
2891561291312730.000022com.codeplex
290156127789220.000031com.ubuntu
2911561181911610.000024edu.purdue
292156102294050.000054com.goodreads
2931560971218960.000017com.fifa
294156086832180.000099com.wufoo
295156078802750.000080com.hubspot
296156070858940.000033gov.nist
297156057713450.000063com.sxsw
298156053375300.000043gov.state
299156034259640.000030edu.upenn
3001560207010840.000026com.alibaba
3011560146810250.000028com.boston
302155999828510.000035com.timeanddate
303155982706750.000038org.aarp
304155977331950.000117de.amazon
305155951923440.000063com.prnewswire
3061559452811430.000025com.posterous
307155905257950.000037com.atlassian
3081558843512980.000022net.comcast
3091558762610290.000028edu.wisc
3101558716110670.000027com.qz
311155863474810.000046com.intel
3121558586111880.000024com.instapaper
313155858419350.000031com.politico
3141558510313850.000020com.ehow
315155847249620.000030com.pcmag
3161558433210680.000026uk.ac.cam
3171558413012100.000023com.vox
318155797853990.000054edu.cmu
319155746283860.000056com.symantec
320155736663850.000056com.snapchat
321155735174380.000049com.entrepreneur
322155731874970.000045com.nature
323155719017920.000037com.weather
324155694809680.000030com.gizmodo
325155661989740.000029com.nintendo
326155659229010.000032com.lifehacker
327155644244870.000046com.xrea
328155638961590.000141com.weibo
329155625789790.000029edu.utexas
330155622582640.000084com.getbootstrap
331155606723900.000056com.businesswire
3321555796910970.000026com.hotmail
333155548908210.000036us.imageshack
334155521842940.000076net.themeforest
335155508663380.000065org.debian
336155507911310.000164com.bandcamp
337155492498220.000036net.daringfireball
3381554780512300.000023edu.northwestern
3391554564711340.000025com.discovery
3401554404622680.000014com.wikidot
341155409919900.000029com.indiegogo
342155398109970.000028co.vine
343155391298800.000033com.engadget
344155382639340.000031com.stackexchange
345155378922560.000088com.smugmug
346155378825480.000043com.newyorker
3471553623919070.000017edu.cuny
3481553577610280.000028id.co.blogspot
349155338701970.000116com.wixsite
350155333768460.000035tv.ustream
351155333548820.000033fr.blogspot
352155329511190.000188com.constantcontact
353155321386290.000039us.mn.state
3541553203521870.000014com.twitpic
355155318854950.000046com.moz
3561553015315130.000019org.khanacademy
357155295123330.000067mp.j
358155294609520.000030gov.fbi
359155284285120.000044com.giphy
3601552766510050.000028au.net.abc
3611552660312970.000022ie.thejournal
362155252345990.000040com.uk
363155244679250.000031gov.usgs
364155242588760.000033edu.umich
365155233589060.000032org.change
366155222356250.000039org.redcross
367155209502580.000087to.amzn
368155189058830.000033gov.maryland
369155184733640.000060com.fastcompany
37015516021130.002020com.wixstatic
3711551530313780.000020org.owasp
3721551477212060.000023com.googledrive
373155139469860.000029org.plos
3741551391315250.000019org.cambridge
375155133598690.000034ca.cbc
376155101508710.000033com.mlb
3771550915910430.000027com.dell
3781550901911530.000024com.nba
379155081479800.000029com.pingdom
380155079858720.000033com.slack
381155076146610.000038com.chicagotribune
3821550677418260.000018org.gnome
383155053751710.000131fr.google
384155048232290.000095com.scorecardresearch
385155042318740.000033com.searchengineland
3861550166411990.000023com.mtv
3871550075021820.000014ca.uwaterloo
388154997433470.000063org.whatbrowser
3891549849613240.000021com.nike
3901549737117190.000019net.boingboing
3911549671113280.000021edu.jhu
3921549657913220.000021edu.academia
393154948308530.000035com.sun
394154933126520.000038br.com.uol
39515492698580.000615me.m
396154921172890.000077com.hp
3971549132210950.000026com.manta
3981549029318310.000018com.blogs
399154893637130.000037com.sciencedaily
400154875275450.000043com.geocities
401154852378160.000036gov.census
40215484388330.000876com.messenger
403154809338900.000033ca.blogspot
4041548034712680.000022com.jigsy
405154800543120.000071us.icio
406154800483920.000056com.force
407154784078290.000036us.pa.state
4081547794311490.000024com.target
409154763195550.000042uk.co.independent
410154748494410.000049com.squareup
4111547480411350.000025com.prezi
412154746204400.000049gov.noaa
413154742771780.000127org.icann
4141547427512780.000022com.elpais
415154734719950.000029org.altervista
4161547345511410.000025edu.si
4171547268013940.000020com.gawker
418154716963350.000066com.bitly
419154706939180.000032fm.last
4201547055211810.000024com.hulu
4211547034115010.000019tv.periscope
4221546931610740.000026edu.umn
4231546887219170.000017com.sky
424154686111930.000117it.google
4251546558311750.000024com.pcworld
4261546250911680.000024com.teespring
427154603623410.000065com.booking
4281545868312720.000022com.upwork
4291545863310960.000026de.spiegel
430154564644570.000047org.doi
431154534519870.000029com.vimeopro
432154514942990.000075ly.ow
433154507018300.000036com.fastcodesign
434154504221370.000155com.eepurl
4351544984110010.000028net.researchgate
4361544983313020.000022com.salon
437154493959610.000030org.unesco
4381544877720170.000016com.ikea
4391544790610150.000028com.airbnb
4401544762422630.000014com.wikispaces
4411544726012750.000022edu.uchicago
4421544686511690.000024gov.nyc
443154462996100.000039es.com.blogspot
444154456258490.000035com.bandsintown
445154432046890.000038com.cbslocal
4461544279420800.000015com.speakerdeck
4471544243218150.000018edu.virginia
4481544228218130.000018com.csmonitor
4491544183814480.000020com.vanityfair
4501544011012390.000023com.scientificamerican
451154394589360.000031com.thenextweb
4521543916419550.000017edu.msu
4531543867713260.000021com.freewebs
454154385478500.000035com.shutterstock
4551543590923950.000013org.nypl
456154348754390.000049com.delicious
457154348504540.000048com.sciencedirect
458154338209130.000032gov.wi
4591543308012920.000022edu.unc
460154328133360.000066com.technorati
461154313142370.000093ca.google
4621542844312810.000022com.lulu
4631542805110380.000028com.over-blog
4641542751913950.000020edu.usc
4651542702211890.000024uk.co.theregister
4661542685618520.000018uk.ac.ed
467154266825200.000044com.githubusercontent
468154259475210.000044org.hbr
4691542539110660.000027com.istockphoto
470154253496530.000038gov.copyright
471154244948590.000034com.msdn
4721542404421400.000015fr.lefigaro
4731542401413620.000020au.com.smh
474154213396070.000040org.bitbucket
47515421040530.000656com.atdmt
476154196178600.000034com.ooyala
4771541949313290.000021com.searchenginewatch
478154189201360.000156com.xing
479154183618430.000035com.psychologytoday
4801541755812500.000023com.thinkwithgoogle
4811541689121120.000015au.com.theaustralian
4821541629522760.000014ca.ualberta
4831541561822410.000014com.softpedia
4841541554821800.000015de.bild
4851541319420530.000016org.moma
4861541291211840.000024com.cbs
487154128819300.000031com.ggpht
488154106596370.000039gov.nps
4891540884022120.000014edu.gmu
490154076432530.000090com.fc2
4911540757021420.000015edu.asu
4921540704614020.000020edu.duke
4931540569421200.000015com.gamespot
4941540556115270.000019com.nfl
4951540386810080.000028com.zoho
496154027375290.000043cn.com.sina
4971539856518580.000018edu.umd
4981539849721070.000015com.yfrog
4991539805811240.000025com.globo
500153979793670.000060com.photoshelter
5011539625722520.000014com.mentalfloss
502153952155650.000041fr.amazon
5031539433221030.000015com.yolasite
5041539397314900.000019au.com.blogspot
5051539226612570.000022com.billboard
506153912729370.000031us.fl.state
507153908934650.000047com.fortune
5081538972611670.000024com.forrester
509153888671740.000129com.youtube-nocookie
5101538752611510.000024org.filezilla-project
5111538677121910.000014com.pastebin
5121538675412650.000022it.blogspot
51315385485410.000777org.networkadvertising
5141538493918700.000018ru.narod
515153849067430.000037com.att
5161538440419300.000017com.computerworld
517153832978030.000037gov.justice
5181538326431000.000010com.xanga
5191538074325800.000012org.kiva
5201537990020010.000016nl.blogspot
5211537989822840.000014com.googlepages
5221537959530580.000010cc.co
5231537948614790.000019com.colourlovers
524153790921830.000125com.nielsen
5251537865614270.000020fr.lemonde
5261537821210360.000028org.postimg
527153780432420.000091es.google
528153777019230.000031com.gofundme
5291537700025540.000012edu.unl
530153768322970.000076nl.google
531153765639210.000031org.postgresql
532153763722160.000102com.myshopify
533153753156020.000040gov.senate
53415375181630.000602com.shareaholic
5351537507810630.000027com.gigaom
536153744328630.000034com.steampowered
537153732013610.000061edu.nyu
5381537295618410.000018com.techradar
539153725529170.000032com.sagepub
540153719446380.000039com.quantcast
5411537150424530.000013com.citysearch
5421537077413600.000021com.semrush
5431537051113760.000020com.deezer
544153702784850.000046br.com.google
5451537027810170.000028com.mckinsey
546153689639100.000032com.redhat
547153684559580.000030net.azurewebsites
548153674448540.000034com.prweb
549153666973550.000062io.atom
5501536634911250.000025gov.uspto
5511536618616430.000019com.ibtimes
5521536609223840.000013ly.visual
5531536528713560.000021de.zeit
5541536504011160.000025com.ycombinator
555153627461620.000139org.joomla
556153620013730.000058com.naver
5571536141614620.000020com.pexels
558153598995690.000041com.webmd
5591535817114770.000019com.nbc
560153576058870.000033net.leadpages
561153572711260.000170ru.mail
562153571806170.000039org.openstreetmap
5631535698718870.000017edu.ufl
5641535679210500.000027org.tigris
5651535590811770.000024com.smashingmagazine
5661535544513550.000021com.ssllabs
5671535495313610.000021se.haxx
5681535414413070.000021com.econsultancy
56915353069980.000256jp.co.google
570153512924580.000047gov.fda
5711534900612890.000022com.timeout
572153487159710.000029gov.nh
5731534849622030.000014com.bestbuy
5741534633822380.000014com.codecademy
5751534632912560.000022com.whitepages
5761534626114830.000019com.philly
5771534611521300.000015edu.caltech
5781534448119180.000017com.deadline
579153439229770.000029org.iana
5801534167921570.000015com.gq
5811534144312010.000023com.bostonglobe
5821534085619350.000017com.starwars
5831533952422490.000014com.instructables
5841533840713590.000021com.xkcd
5851533803424590.000013edu.bu
5861533697019950.000016edu.indiana
5871533666213010.000022com.animoto
5881533658313820.000020com.vmware
5891533560712700.000022com.zazzle
590153340965490.000042gov.hhs
591153332724440.000049com.marriott
5921533238730680.000010cc.tiny
5931533194827160.000011com.sophos
5941533193918330.000018com.angelfire
5951533100334040.000009com.blog
5961533078711520.000024com.com
5971532942910510.000027com.superpages
5981532753112850.000022com.nokia
5991532668118840.000018com.me
6001532534121950.000014uk.ac.ucl
601153249421960.000116com.googleadservices
602153248502390.000092jp.co.amazon
603153247252930.000077com.list-manage1
604153246719890.000029com.hootsuite
6051532443620730.000015edu.umass
6061532081229890.000010org.icrc
607153207318970.000032com.formstack
6081532055410620.000027com.nydailynews
6091532013724580.000013org.gimp
6101532003223410.000013edu.uiuc
6111531921727020.000011com.klout
6121531850119190.000017org.aclu
613153184731700.000131jp.ameblo
6141531800323750.000013com.canalblog
6151531765710060.000028org.unicef
616153171453880.000056in.co.google
6171531626926540.000012com.technet
6181531621418810.000018edu.rutgers
6191531447218370.000018com.gettyimages
6201531395719460.000017com.thedailybeast
6211531393124980.000012uk.co.metro
622153135385500.000042com.cargocollective
6231531347424780.000012gov.cia
624153134348750.000033com.pinimg
6251531315313330.000021com.brandyourself
626153130424880.000046com.nwsource
6271531265726820.000011com.nabble
6281531199024100.000013com.fiverr
6291531159013660.000020com.reference
6301531120822430.000014edu.uci
6311531068121690.000015com.denverpost
632153104515190.000044gov.usa
633153102549460.000031org.iso
634153092613540.000062com.newrelic
635153084005730.000041com.herokuapp
636153081428360.000035tv.twitch
6371530802915520.000019com.mac
638153077873740.000058com.bizjournals
6391530749619240.000017org.rubyonrails
640153074857270.000037com.usnews
641153060509020.000032com.fotolia
6421530565231650.000009com.laughingsquid
643153054513490.000063com.bigcartel
6441530479020100.000016com.ign
6451530477335860.000008edu.syr
6461530282619600.000017edu.ucdavis
6471530159712710.000022com.playstation
648153015073690.000059gov.irs
6491530077433770.000009com.answers
6501530020126420.000012edu.hawaii
6511530011820640.000016com.topsy
6521530009923160.000014org.videolan
6531529945413110.000021com.underconsideration
6541529894319410.000017com.investopedia
6551529857127250.000011com.hubpages
6561529843726990.000011org.greenpeace
6571529840522990.000014org.webkit
6581529810913040.000022com.accenture
6591529770718820.000018com.howstuffworks
660152974859920.000029us.ma.state
6611529699510890.000026com.netflix
6621529664132600.000009com.bigthink
6631529619012320.000023com.firefox
664152961894490.000049com.angieslist
6651529581512180.000023com.techrepublic
6661529555635200.000008edu.ucsc
6671529533120440.000016ca.utoronto
668152944454240.000051com.houzz
6691529338144860.000007com.skyrock
6701529328622470.000014com.urbandictionary
6711529288628150.000011org.donorschoose
6721529284713060.000021com.wfaa
6731529236810180.000028com.justgiving
6741529201619130.000017com.getfirebug
6751529195511600.000024com.king5
6761529141318770.000018com.us
6771529119912250.000023uk.co.mirror
678152908298520.000035com.reverbnation
679152906328420.000035gov.sba
6801528943420490.000016com.ndtv
6811528826529700.000010edu.brown
6821528802923790.000013org.coursera
6831528796121900.000014com.readwriteweb
6841528744716110.000019ru.spb
6851528688512380.000023ly.snip
6861528546413490.000021br.com.blogspot
6871528535619470.000017com.ey
688152850884160.000052gov.usda
689152845673340.000066kr.flic
6901528450718560.000018uk.co.thesun
691152825245630.000041gov.sec
6921528158811400.000025com.merchantcircle
6931528148118730.000018edu.tamu
6941528110527950.000011com.techsmith
6951527928012240.000023net.vnexpress
6961527917922690.000014edu.osu
6971527850111660.000024com.mixcloud
6981527829733690.000009edu.rochester
699152779311560.000143jp.ne.hatena
7001527789633070.000009org.oxfam
7011527788212360.000023com.examiner
7021527749618320.000018int.wipo
7031527708520870.000015edu.arizona
704152769548080.000036com.hostgator
7051527657111390.000025jp.blogspot
7061527626734460.000009com.allbusiness
7071527599118950.000017com.msnbc
708152755734690.000047com.wunderground
7091527440114380.000020org.craigslist
710152739614830.000046gov.ny
7111527391619840.000016com.udemy
7121527277010070.000028com.cafepress
7131527190713250.000021ca.kijiji
7141527098524690.000013com.tutsplus
7151526996125910.000012com.wolfram
7161526964718180.000018de.welt
7171526843624680.000013com.makezine
7181526746929070.000010com.friendfeed
719152673158470.000035net.openid
7201526711624910.000012it.repubblica
7211526690813800.000020it.justpaste
7221526479410390.000028com.500px
7231526433313690.000020com.bizcommunity
7241526431821720.000015com.pbworks
7251526325512770.000022com.gumroad
7261526289111960.000023com.mysanantonio
7271526276913890.000020com.foxbusiness
7281526235515240.000019com.canva
7291526188523370.000013com.glamour
7301526092832820.000009com.avast
7311526086811290.000025de.heise
7321526008033330.000009com.news
7331526006532690.000009gd.is
7341525844320430.000016uk.ac.lse
7351525838124610.000013com.tv
736152583694280.000050com.typeform
7371525778822950.000014com.theonion
7381525735414550.000020io.codepen
739152572423700.000059com.adweek
740152572049570.000030org.mediawiki
7411525600022550.000014org.acs
7421525593119020.000017com.getsatisfaction
7431525486027330.000011com.expedia
744152546082600.000086com.windows
745152545824960.000046gov.ed
7461525372312950.000022com.sohu
7471525299510580.000027org.oecd
7481525224415550.000019org.ap
7491525220812410.000023com.city-data
750152516284430.000049gov.epa
7511525098035830.000008edu.missouri
7521524974621890.000014com.ask
7531524930018800.000018com.norton
7541524929521640.000015edu.ncsu
755152487298170.000036net.launchpad
7561524861135340.000008com.scobleizer
757152482999120.000032com.redbubble
7581524575715470.000019com.newsweek
7591524538515210.000019com.blogtalkradio
7601524528212870.000022com.garmin
7611524521210270.000028com.hollywoodreporter
7621524518316370.000019com.yandex
7631524517522780.000014com.foodnetwork
7641524392220920.000015com.si
7651524310021940.000014com.mediabistro
7661524239311830.000024com.mediafire
767152423601340.000161com.parallels
7681524217321100.000015com.delta
769152414833370.000066org.microformats
7701524134514420.000020th.co.lazada
7711524123521230.000015com.ford
7721524043418650.000018com.allthingsd
7731524008212990.000022com.technologyreview
7741523958321930.000014com.freep
775152392513560.000062com.teamviewer
7761523906526520.000012edu.pitt
777152388972070.000107org.purl
7781523853218500.000018org.openoffice
7791523756524570.000013com.podbean
780152371623710.000059com.nypost
7811523710324290.000013uk.co.timesonline
7821523592520740.000015com.real
7831523528223630.000013com.oxforddictionaries
7841523418923310.000013ch.ethz
7851523359233340.000009org.notepad-plus-plus
7861523355022920.000014com.nhl
7871523354224490.000013net.jsfiddle
7881523298333270.000009net.battle
7891523278125550.000012com.chrome
7901523278012090.000023org.iihs
7911523274911800.000024org.fao
7921523207822800.000014org.gutenberg
7931523186519390.000017com.ssrn
7941523127634380.000009net.wordle
7951523108514410.000020net.brownbook
7961523021736480.000008com.dreamstime
7971522996934350.000009be.blogspot
798152298579530.000030us.nm.state
7991522969719010.000017org.weforum
8001522935623460.000013org.lifehack
8011522918234440.000009edu.umaryland
802152282935280.000043com.list-manage2
803152276933820.000057net.windows
8041522701324960.000012edu.vt
8051522690532390.000009com.voanews
8061522576324250.000013com.shutterfly
8071522522527620.000011edu.ucsf
8081522517619620.000017com.starwoodhotels
809152244384350.000050com.ea
810152239619110.000032com.gartner
8111522345041150.000007it.libero
812152223618260.000036org.w
8131522222618790.000018org.slashdot
8141522218514670.000019org.cancer
815152210373960.000055jp.ne.sakura
8161522092213580.000021org.json
8171522081310930.000026com.bufferapp
8181521998613570.000021com.unity3d
8191521963422150.000014net.earthlink
8201521903520590.000016com.digitaltrends
8211521857121580.000015uk.co.huffingtonpost
8221521773229500.000010edu.tufts
8231521753314200.000020com.wsoctv
8241521706024350.000013com.thefreedictionary
825152165933530.000062net.freenode
8261521646913370.000021com.kgw
8271521586718710.000018gov.uscourts
8281521580411080.000025com.steamcommunity
8291521473720500.000016com.kaspersky
8301521440014280.000020com.ripple
8311521430614130.000020mil.navy
8321521263632810.000009edu.emory
833152125718580.000034com.linksynergy
8341521204638840.000008com.4shared
8351521166613920.000020com.chamberofcommerce
8361521142922710.000014com.chronicle
8371521126810710.000026be.google
8381521119730570.000010com.squidoo
8391521113930330.000010com.esquire
8401521105921220.000015com.azcentral
8411521056714180.000020ly.list
8421521030723640.000013com.sony
843152101299090.000032com.deloitte
844152090686280.000039com.webnode
8451520889213670.000020net.yahoo
8461520887529520.000010com.threadless
8471520857214110.000020org.whatwg
8481520852840530.000007edu.udel
849152083573810.000057ru.vkontakte
8501520825731020.000010ca.mcgill
8511520813612610.000022com.zillow
8521520733727440.000011edu.uga
853152073329850.000029com.163
8541520691511500.000024com.techtarget
8551520649322810.000014org.wnyc
8561520608711060.000025com.yoast
8571520582530360.000010pt.sapo
8581520565919920.000016org.jstor
8591520564411580.000024com.fedex
8601520470013520.000021org.oxfordjournals
8611520457513470.000021com.thomsonreuters
8621520452011820.000024org.gnupg
8631520441319650.000017org.ampproject
8641520427722620.000014com.css-tricks
865152037283890.000056com.monster
866152032876330.000039com.cdbaby
8671520291811940.000023com.business2community
8681520233970970.000005com.zimbio
8691520225933950.000009com.macrumors
8701520164030100.000010edu.dartmouth
8711520162741140.000007gr.blogspot
872152012765890.000040com.friendster
8731520117125360.000012org.computer
8741520099910530.000027gov.dhs
8751520075813650.000020com.bmj
8761520026412280.000023com.nymag
877152002563830.000057com.youku
8781520009020890.000015org.npmjs
8791519996722820.000014ca.ubc
8801519958313050.000021com.oup
8811519912537910.000008org.laptop
8821519869738620.000008org.wikiquote
8831519865021470.000015com.gopro
8841519837520140.000016me.flavors
8851519714113870.000020com.hotfrog
8861519670126350.000012com.aviary
8871519657311700.000024com.googleapps
8881519626122170.000014com.popsugar
8891519600720510.000016com.patch
8901519597514510.000020com.communitywalk
8911519576944240.000007com.pbase
8921519551110480.000027us.wi.state
8931519518736400.000008tv.wat
8941519459520380.000016com.autodesk
8951519409130500.000010edu.oregonstate
896151939991600.000141info.aboutads
8971519354013310.000021gov.uscis
8981519320923520.000013com.harpercollins
8991519313032370.000009com.blackgirlscode
9001519248446260.000006com.boredpanda
901151923755830.000040com.hilton
9021519228768290.000005com.weheartit
9031519183921620.000015com.ifttt
9041519173857800.000005com.rebelmouse
9051519100931970.000009int.esa
9061519029421550.000015org.raspberrypi
9071519029339410.000008com.pixlr
908151900787990.000037com.adage
9091518988721000.000015com.netvibes
9101518988731730.000009edu.iastate
911151896804800.000046com.taobao
9121518929713360.000021com.today
9131518866434880.000009edu.usf
9141518865820090.000016com.thestar
9151518826022050.000014org.r-project
9161518800814060.000020com.sap
9171518771530400.000010org.charitywater
9181518720424830.000012tv.blip
9191518708421170.000015com.livestrong
9201518704623210.000013com.britannica
9211518598834080.000009au.com.theage
9221518598115610.000019com.mercurynews
9231518546618970.000017com.foxsports
9241518501212960.000022org.apa
9251518377933480.000009tv.arte
9261518352133020.000009com.bhphotovideo
9271518349112660.000022com.comodo
928151833638010.000037com.brightcove
9291518296234830.000009uk.org.bhf
9301518252341550.000007mx.unam
9311518155311110.000025com.convertkit
9321518153322070.000014com.geekwire
9331518141426130.000012nl.xs4all
9341518121728520.000011com.newsvine
9351518079024720.000013com.w3techs
9361518037721920.000014com.newscientist
9371517998131510.000009com.popsci
9381517990718350.000018gov.in
9391517943511950.000023com.gotomeeting
9401517916124240.000013com.fox
9411517884913400.000021com.rollingstone
9421517789513460.000021co.angel
9431517752525440.000012com.avg
944151770881090.000213com.yimg
945151733434640.000047com.zenfolio
9461517234223660.000013org.kde
947151720159390.000031uk.co.eventbrite
9481517187240820.000007net.minecraft
9491517171866270.000005fr.unblog
9501517144626160.000012com.ezinearticles
9511517111619560.000017com.macworld
9521517109210260.000028com.uservoice
9531517047235190.000008cc.arduino
9541517031824360.000013net.oauth
955151700614460.000049com.gotowebinar
9561517003026070.000012us.zoom
9571516970319720.000016com.wetransfer
958151692215970.000040ru.google
959151690014500.000049com.meyerweb
9601516887634200.000009org.britishcouncil
9611516868833940.000009net.deviantart
9621516862311720.000024com.intuit
9631516843810880.000026com.americanexpress
9641516842620080.000016com.amzn
9651516777620600.000016net.speedtest
9661516750532060.000009se.blogspot
967151673654090.000053com.mapbox
9681516702436940.000008edu.uky
9691516670710220.000028gov.va
9701516666340380.000007com.secondlife
9711516655120610.000016com.comcast
9721516593414960.000019com.espn
9731516573410540.000027com.walmart
9741516562141690.000007org.edublogs
9751516505321280.000015com.mydomain
976151645789820.000029us.tx.state
9771516437241600.000007ca.yorku
9781516429410560.000027jp.ne.goo
979151641038120.000036com.emarketer
9801516308618220.000018uk.gov.legislation
9811516290336680.000008tt.db
9821516264359400.000005edu.ua
9831516235733910.000009com.mlive
9841516220513480.000021com.arabianbusiness
9851516213419420.000017org.c-span
9861516131519160.000017uk.co.wired
9871516107624820.000012org.worldcat
9881515960823140.000014net.daum
9891515921526310.000012com.chow
9901515918233010.000009org.ibiblio
9911515904112230.000023gov.archives
9921515859125390.000012com.univision
9931515814819380.000017com.aliexpress
9941515704043770.000007com.vodpod
9951515698318530.000018com.merriam-webster
9961515690824270.000013org.hrc
9971515631925160.000012com.crunchbase
9981515616233510.000009fr.cnrs
9991515594714310.000020com.ktvb
1000151555583260.000068pl.google

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!

October 2017 Crawl Archive Now Available

The crawl archive for October 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-43/. It contains 3.65 billion web pages and over 300 TiB of uncompressed content.

To improve coverage and freshness we added over 900 million new URLs (not contained in any crawl archive before):

  • 350 million URLs are a random sample extracted from sitemaps if provided by any of the top 80 million hosts taken from the May/June/July 2017 webgraph data set
  • 250 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 80 million hosts
  • 150 million URLs are randomly chosen from WAT files of the September crawl
  • 180 million URLs are links donated by mixnode.com

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-43/segment.paths.gz100
WARC filesCC-MAIN-2017-43/warc.paths.gz8910084.9
WAT filesCC-MAIN-2017-43/wat.paths.gz8910024.33
WET filesCC-MAIN-2017-43/wet.paths.gz8910010.58
Robots.txt filesCC-MAIN-2017-43/robotstxt.paths.gz891000.17
Non-200 responses filesCC-MAIN-2017-43/non200responses.paths.gz891002.62
URL index filesCC-MAIN-2017-43/cc-index.paths.gz3020.28

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-43/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

We are grateful to our friends at mixnode for donating a seed list of 200+ Million URLs to enhance the Common Crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

September 2017 Crawl Archive Now Available

September 2017 Crawl Archive Now Available

The crawl archive for September 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-39/. It contains 3.01 billion web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we added one billion new URLs (not contained in any crawl archive before):

  • 300 million URLs are a random sample extracted from sitemaps if provided by any of the top 60 million hosts taken from the May/June/July 2017 webgraph data set
  • 500 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 60 million hosts and from a list of university domains collected by a Common Crawl user
  • 200 million URLs are randomly chosen from WAT files of the August crawl

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-39/segment.paths.gz100
WARC filesCC-MAIN-2017-39/warc.paths.gz7200059.13
WAT filesCC-MAIN-2017-39/wat.paths.gz7200020.1
WET filesCC-MAIN-2017-39/wet.paths.gz720008.86
Robots.txt filesCC-MAIN-2017-39/robotstxt.paths.gz720000.16
Non-200 responses filesCC-MAIN-2017-39/non200responses.paths.gz720002.07
URL index filesCC-MAIN-2017-39/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-39/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August 2017 Crawl Archive Now Available

August 2017 Crawl Archive Now Available

The crawl archive for August 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-34/. It contains 3.28 billion+ web pages and over 280 TiB of uncompressed content.

To improve coverage and freshness we used the top 50 million ranked hosts from the May/June/July 2017 webgraph data set and added over 800 million new URLs (not contained in any crawl archive before), of which

  • 300 million URLs were found by a side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 40 million hosts;
  • 525 million URLs are a random sample extracted from sitemaps (if provided by any of the top 50 million hosts).

The following improvements affect the WAT and WET extraction:

  • improved spacing / word segmentation in WET extracts, see issue #13
  • extract URLs from JavaScript code in onClick attributes (issue #8)

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-34/segment.paths.gz100
WARC filesCC-MAIN-2017-34/warc.paths.gz7200065.14
WAT filesCC-MAIN-2017-34/wat.paths.gz7200022.18
WET filesCC-MAIN-2017-34/wet.paths.gz720009.81
Robots.txt filesCC-MAIN-2017-34/robotstxt.paths.gz720000.12
Non-200 responses filesCC-MAIN-2017-34/non200responses.paths.gz720001.49
URL index filesCC-MAIN-2017-34/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2017-34/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

Now Available: Host- and Domain-Level Web Graphs

We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017. These graphs, along with ranked lists of hosts and domains, follow on our first host-level web graph (February, March, April 2017). Detailed information about the data formats, the processing pipeline, our objectives, and credits can be found in the prior announcement.

Host-level graph

The graph consists of 1.3 billion nodes and 5.25 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 1.3 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-may-jun-jul/hostgraph/. Alternatively, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-may-jun-jul/hostgraph/ as prefix to access the files from everywhere.

The following files and formats are provided:

SizeFileDescription
7.8 GBvertices.txt.gznodes ⟨id, rev host⟩
20.4 GBedges.txt.gzedges ⟨from_id, to_id⟩
10.0 GBbvgraph.graphgraph in BVGraph format
0.59 GBbvgraph.offsets
2 kBbvgraph.properties
14.5 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
1.6 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
18.9 GBranks.txt.gzharmonic centrality and pagerank

Note that differences in the rankings and the structure of the web graph are due our objective to make monthly crawls more diverse and to reduce the overlap between consecutive crawls. During both February/March/April and May/June/July we crawled about 9 billion pages. As an indicator of less overlap the number of unique URLs increased from 5.0 to 6.2 billion and the number of unique hosts went up from 70 to 90 million (or from 65 to 82 million with leading “www.” removed). The largest strongly connected component contains now 59 million nodes/hosts (45 million in the February/March/April graph). However, the May/June/July host-level graph has doubled its size in terms of edges and more than tripled in terms of nodes. This growth is caused by a significant increase in the number of dangling nodes. 1.2 billion dangling nodes provide a solid foundation to extend the next crawls but we need a closer look at the distribution of these hosts among domains and TLDs.

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only “ICANN” domains are accepted, “private” domains are not (cf. section “divisions” in the documentation on publicsuffix.org). In short, foo.blogspot.com and commoncrawl.s3.amazonaws.com are not accepted as pay-level domains, they are aggregated as domains blogspot.com resp. amazonaws.com.

The domain-level graph has 91 million nodes and 1,071 million edges. 55% are dangling nodes, the largest strongly connected component covers 30 million or 33% of the nodes.

All files related to the domain graph are placed on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-may-jun-jul/domaingraph/ resp. https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2017-may-jun-jul/domaingraph/.

Download files of the Common Crawl May/June/July 2017 domain-level webgraph

SizeFileDescription
0.63 GBvertices.txt.gznodes ⟨id, rev host⟩
4.3 GBedges.txt.gzedges ⟨from_id, to_id⟩
2.4 GBbvgraph.graphgraph in BVGraph format
0.09 GBbvgraph.offsets
2 kBbvgraph.properties
2.6 GBbvgraph-t.graphtranspose of the graph (outlinks mapped to inlinks)
0.13 GBbvgraph-t.offsets
2 kBbvgraph-t.properties
1 kBbvgraph.statsWebGraph statistics
1.8 GBranks.txt.gzharmonic centrality and pagerank

Below you’ll find the top 1000 domains ranked by Harmonic Centrality or PageRank. The full list of all 91 million domains is available for download.

Top 1000 domains ranked by harmonic centrality (May/June/July 2017)

harmonic
centrality
rank
hc valuepage rankpage rank
value
reversed hostname
12498995210.0155264576161686com.facebook
22246088030.00866038900847366com.twitter
32209751420.0128827315785546com.googleapis
42130161640.00741843063139714com.youtube
52123440250.00706371663388533com.google
71899088260.00545683515949591org.gmpg
818904882100.00280210492427644com.linkedin
91884345090.00307727244449465com.instagram
1018259830250.00126167740795298org.wikipedia
1118053094380.00101731276962237com.blogspot
1217970588340.00109405966554137com.wordpress
1317919956220.00141960022500196com.pinterest
1417835014160.0020636325734346com.apple
1517697268140.00222054582589118org.wordpress
1717561416120.00240805753831021com.macromedia
1817508598320.00111771803587573com.gravatar
1917446398550.000661340427978196be.youtu
2017384778520.000701194639286359com.amazon
2117353492170.00187094862083697com.adobe
2217343286560.000637906122981968com.flickr
2317323658470.000801430611613196com.vimeo
2417279624490.000729089638101731gl.goo
2517250688420.000933859058158277com.paypal
2617229298460.000826236952032279com.microsoft
2817103198700.000474840687818822ly.bit
2917089526660.000507632417477531com.tumblr
3017033012500.000719273960345051com.github
3116996118430.000881255954914393com.amazonaws
3216946912770.000406272523681585org.creativecommons
3416863562670.000501297307358693com.yahoo
3516862378930.000299154154651788org.mozilla
3716812384830.00037282011840036org.w3
38167524831250.000207624449137057com.weebly
39167294641150.000221622522886096com.googleusercontent
40167176691180.000215230097627401com.blogger
41166976241530.000171757229153436com.myspace
44166469341850.000141243073946399org.wikimedia
4616644900790.000396589141351638com.medium
4716631223750.000433568676537io.github
4816612200990.000268805299302926com.android
4916590272180.00162368845735663com.bootstrapcdn
50165584341660.000157623382042305org.apache
51165254282290.000115801379415941com.photobucket
52165044261950.000131254056913804com.businessinsider
55164765533048.18438859138264e-05com.ebay
56164697211370.000187664601325017com.issuu
5716460853370.00102123707091179com.bing
58164527291820.000142450905972814com.imdb
5916427950820.000374838485018944eu.europa
6016411357480.000746154314145809com.statcounter
61163903382450.000101622215034005com.appspot
62163799522090.000124338192701991com.about
63163737421400.00018454904883009com.nytimes
64163638213527.06416235739108e-05gov.nasa
6516353372590.000571194421273779com.cloudflare
6616346221130.00239759215021479net.fbcdn
6716338312440.000857327043073336me.wp
6816326390710.000465848580411827com.gstatic
69163200413477.16053406333361e-05com.theverge
70163194141220.000209662262430235com.yelp
72162795242868.69073948101929e-05org.npr
73162720833257.56332706724679e-05com.googleblog
74162563042908.48972545255239e-05me.about
75162481991990.000128879026140158org.ietf
76162412552729.24645494623427e-05com.mozilla
77162404791560.00016832163499487org.gnu
7816239722920.000305401671067505co.t
79162393872160.000120371938835806com.oracle
80162325262280.000116114540948403com.typepad
81162310844535.44944627802185e-05edu.ucla
8216227427270.00121558207856905net.akamaihd
84162143092320.000114818535982162uk.co.bbc
85162142351360.000189983194885236com.jquery
86162118142210.000118993796467497com.imgur
87161974204725.2567183006594e-05edu.princeton
88161968553367.32728429381573e-05gov.noaa
8916191891740.000440473851712633org.schema
90161901451780.000144231760365021net.slideshare
9116190073620.000564081957805812net.cloudfront
92161837302848.92999121057943e-05com.wsj
93161759631880.000140208643596679com.forbes
9416172502760.00041917417746326com.paypalobjects
95161702941210.000212087858171349com.soundcloud
96161650482000.000128726042287979com.spotify
97161613672130.000122378879977989edu.stanford
98161581142020.000127454157734194com.disqus
99161566084815.14517415590491e-05com.mysql
100161486911670.000157054492258209com.dropbox
101161466722769.11987772125017e-05com.tinyurl
102161416981390.000184631238862252com.constantcontact
103161414203317.44922001669146e-05edu.mit
104161399952739.20851285715325e-05com.nbcnews
106161348911030.000249718966343057com.wix
107161313904405.63178297481015e-05com.googlecode
108161265241870.00014024908309338com.theguardian
10916123314950.000285005386439374com.huffingtonpost
111161129413537.01996718671122e-05gov.loc
112161050051550.000170093991134689net.sourceforge
113161016004016.14791022911939e-05org.python
114160888603796.60560437018202e-05com.alexa
115160886454105.97561569096326e-05co.g
116160790282529.88994061572069e-05com.go
118160774533646.82295063450322e-05com.foursquare
12116066757450.000829281063065997com.squarespace
122160619463936.23091312037812e-05com.sun
124160580794425.61406400406246e-05com.withgoogle
125160572732609.56090724221574e-05com.washingtonpost
12716035956630.000537473792465926com.vk
12816035411860.000352598947727924net.doubleclick
129160352394895.06581788427578e-05org.chromium
130160284985844.26316314674836e-05edu.gatech
132160270224165.92122336053588e-05au.gov.nsw
133160256841260.000207293393080606com.reddit
134160251065094.91710961371229e-05org.sciencemag
135160240413896.27321618308479e-05org.nodejs
136160239486054.19017591329544e-05edu.cuny
138160128991330.000193040982037554com.feedburner
140160001325384.57515027163085e-05edu.illinois
141159997784994.96494716668815e-05edu.berkeley
142159989123996.16482775444537e-05au.com.google
144159872285064.93463645755921e-05com.pixabay
145159760132589.60743680804667e-05uk.co.amazon
146159729071110.000230813335083292com.cnn
148159675585444.54457479041577e-05com.libsyn
149159524053716.7120499269601e-05com.wired
150159496182599.59425814050923e-05com.surveymonkey
151159467483836.53396206082701e-05uk.co.dailymail
152159464834525.45761550958247e-05com.variety
15315930068240.0013474030796477com.fb
154159258422150.000120638726611947com.etsy
155159248893916.24685427611247e-05uk.co.blogspot
156159123953117.90671790542201e-05com.dailymotion
157159069052110.000123239102782434com.digg
158159010971810.000142453252268271gov.nih
159158942063706.75180807177585e-05com.reuters
160158868207013.89256402003242e-05edu.columbia
162158729577873.751174278726e-05com.nike
163158680552270.000116596591295635com.live
165158647523547.01156504511129e-05com.livejournal
166158647386573.96129599599408e-05com.sap
167158626656643.95013568833938e-05com.discogs
168158579752828.9964923414358e-05com.aol
169158557504455.5748966776129e-05org.mediawiki
17015845942290.00114281853454816me.fb
171158429473517.07813033957103e-05com.usatoday
172158422836334.05686573540759e-05edu.utah
173158371385494.51699741643719e-05net.daringfireball
174158366744964.9999960509278e-05org.eclipse
175158348733686.76642625755714e-05com.npmjs
176158336385504.49681211898885e-05com.netflix
177158324293377.30215403346833e-05com.tripod
178158310388783.3656387101054e-05com.diigo
180158178718043.67639700948407e-05uk.org.tate
181158162003696.75600758631612e-05edu.harvard
182158154224695.3001847136341e-05com.time
183158103448233.59360502579382e-05id.co.blogspot
184158090261140.000221797473311586com.mashable
185158074608073.6447534726932e-05com.hbo
187158024577693.84109849525509e-05com.chron
188158015577783.79843092102341e-05edu.washington
189157950114265.81220523939956e-05com.gmail
190157931211130.000225135156397275com.jimdo
191157928473557.00169347047329e-05com.meetup
192157925006294.06926572016734e-05org.ampproject
193157913995964.21539576611305e-05com.jetbrains
194157909483227.68171654135211e-05com.bloomberg
195157893621520.000173965702149204de.google
196157878322888.59046288833472e-05to.amzn
197157861526374.04886202931351e-05org.virtualbox
198157802866244.10339488441819e-05uk.co.guardian
199157777173137.90555572975389e-05com.techcrunch
200157769754115.95014769939008e-05uk.co.telegraph
201157764775484.52609403123979e-05com.cnbc
202157751692509.94157457855262e-05uk.co.google
204157716034685.30605297385559e-05com.goodreads
205157661753976.18619648011229e-05com.msn
206157615983926.23915895238641e-05com.kickstarter
207157613701710.000149517430613129com.twimg
208157578999583.10644934362954e-05in.blogspot
209157569033018.3087741732879e-05com.images-amazon
21015755871210.00144087639474043com.wixstatic
211157401203606.91064104263346e-05io.codepen
212157341602250.000117332550512604com.eventbrite
213157331304135.93919728037878e-05com.bbc
214157316218893.33158516532838e-05com.newyorker
215157308893966.2001076998058e-05com.latimes
216157296934475.53136069293824e-05com.webs
219157271191740.000147912563107984org.archive
220157232754076.02624636349819e-05com.bizjournals
221157224936483.98018712364008e-05com.wikia
222157220632230.000118348799632694org.drupal
223157167402968.4089596881278e-05net.php
224157167036434.02154313123456e-05com.venturebeat
225157165766194.12625555463031e-05org.pbs
226157125199143.26392230464033e-05com.marketwatch
227157121447703.83803578491251e-05com.exacttarget
229157060938333.55961710127199e-05com.sublimetext
230157036011980.000129275694487554com.stumbleupon
232156973212719.26344464771574e-05fr.free
233156943734655.327297946833e-05gov.whitehouse
234156937609683.09182306870607e-05com.gizmodo
235156923787963.71950102005915e-05com.chrome
236156863849773.06408953569782e-05com.instapaper
237156822343088.07156735886469e-05gov.cdc
2381568201010342.90083132906254e-05com.ning
239156791914765.17732111584554e-05com.office
240156773834195.86742754402436e-05com.xrea
242156669756324.05994257035936e-05com.ted
2431566682914672.32876139970216e-05edu.usc
2441566626410852.82676357406144e-05edu.umn
2451566584610722.84598053412361e-05com.arstechnica
246156642435354.58793673858625e-05de.blogspot
247156633668833.35673118772062e-05com.chicagotribune
24815663034280.0011763022119838com.atdmt
249156625372619.51380930957369e-05com.baidu
250156609531500.000174749380463391com.list-manage
251156603266134.14680955973278e-05com.scribd
2521565907314302.40444714698486e-05edu.uchicago
253156577835754.32697253018267e-05com.inc
2541565566610052.98439224149668e-05com.evernote
25515654762690.000490371412441887com.wp
256156542523397.26444892171607e-05gov.ca
257156514995914.22531065519201e-05com.dropboxusercontent
258156512804095.98380198468418e-05com.cnet
259156488672858.75060881994027e-05gov.ftc
260156459378803.36512290531188e-05com.nationalgeographic
261156412916513.97372609578435e-05uk.co.independent
262156411759273.20977639678945e-05edu.yale
2631564038911522.76178324420893e-05edu.wisc
264156402908423.50248903988879e-05com.nintendo
265156380436593.96001518252408e-05us.imageshack
266156368239073.27312169013512e-05com.indiatimes
267156355208003.69358708077931e-05com.cbslocal
2681563516317001.8722596335099e-05edu.msu
269156340899443.15472671606861e-05com.naturalnews
270156327933986.16851659136072e-05com.ibm
2711563097113322.62606272746041e-05org.ieee
272156294522579.65661295436327e-05net.behance
274156291728753.37107150206004e-05com.slate
275156260204805.14543056738778e-05edu.cornell
276156244679763.07073787371343e-05org.altervista
27715622049300.00112494056494761me.m
27815620427230.00135395152637574com.messenger
280156171139263.21020959728217e-05com.mysanantonio
281156143272988.37226642358646e-05com.opera
2831561218915442.13674763862979e-05edu.virginia
285156106502350.000109224765834484de.amazon
286156099005974.20971145241691e-05com.bandsintown
287156096892689.38629534487172e-05com.fc2
289156066133078.08602374785154e-05com.staticflickr
290156062055034.94721115017751e-05com.zdnet
291156032069243.21430290644091e-05com.googledrive
292155968515864.24190156027991e-05com.theatlantic
29315595738350.00106767518499836ru.yandex
294155910535644.39081259808879e-05com.fortune
295155889978613.43835713600411e-05org.iso
296155887919473.15155421032288e-05com.redhat
298155878284875.06700299908967e-05com.w3schools
2991558743813452.59590471932613e-05it.blogspot
300155866237863.75300358233661e-05com.nature
301155849114315.74012367083115e-05com.git-scm
302155781166913.89761753170653e-05com.ggpht
303155766271930.000132159977180025com.ytimg
305155754641060.000239116241712999jp.co.google
306155748568913.33060387457206e-05org.caringbridge
307155708446663.94027427723104e-05com.buzzfeed
308155670179203.2305470209502e-05com.vagrantup
3091556671314232.41810395096844e-05com.bostonglobe
310155664325704.36715547457013e-05com.cisco
312155618467733.82831940182663e-05org.imagemagick
3131556157113182.67799890705726e-05edu.si
3141556048110003.00286031998172e-05com.economist
315155591614036.07417952206027e-05edu.nyu
317155577118163.60704732755116e-05com.engadget
318155576482070.000124999566076152com.youtube-nocookie
32015555868970.000277346301246524com.qq
321155530429213.22905898022667e-05com.deepmind
322155484767763.8040845245793e-05com.uk
323155466355324.65404973239769e-05com.cbsnews
324155464189123.26484283331076e-05com.sfgate
325155462618683.40855431726984e-05edu.umich
326155458672370.000108426094503134com.zendesk
3271554166410592.86845309523356e-05edu.upenn
328155414177843.76122537114407e-05com.tinypic
329155413483008.31191255523891e-05uk.gov
330155393807953.72790288125804e-05gov.nps
331155387637683.84226974381513e-05es.com.blogspot
3331553478510272.91722456460812e-05org.amnesty
334155336406304.06382514774634e-05org.un
3351553352410282.91342711800371e-05com.businessweek
336155305908413.51095688807082e-05com.prweb
337155303393726.6861234129418e-05com.hp
338155290943167.85334205313627e-05com.smugmug
339155268923656.80643039930844e-05io.atom
3401552601618641.66783616152529e-05com.wikispaces
341155254365614.43833209085508e-05com.deviantart
3421552454313482.59183160819967e-05com.quora
343155236993127.90560957094403e-05com.stackoverflow
34415523357110.00266026928788743com.godaddy
345155232658673.41028601949806e-05fr.blogspot
3461552146721771.43407953172429e-05edu.hawaii
348155208002300.000115300591313306com.googleadservices
349155156261000.000260210965158318com.addthis
350155117312140.00012109708512216com.weibo
351155102711890.000134991236934387com.tripadvisor
352155100564665.31867573470599e-05edu.cmu
3531550813716541.94505252040299e-05edu.academia
354155070995464.54186649903759e-05com.samsung
355155067155224.73185887798753e-05com.delicious
356155063063107.96162415466004e-05com.salesforce
3571550462313892.47638521688997e-05edu.psu
3581550176710622.8590454491023e-05edu.utexas
359155003554605.37255274210099e-05com.businesswire
360154996713287.50250431392317e-05fr.google
3611549949810542.87241689600589e-05com.thenextweb
362154993435674.37356148766312e-05com.skype
3631549931710182.94461794513264e-05com.thedenverchannel
364154990314185.87309989696444e-05com.booking
365154982609023.30560434754304e-05com.ft
3661549756314452.37713091163388e-05com.storify
367154973201380.0001870649963529org.bbb
368154966771620.000163189952970138com.bandcamp
369154943087943.73092071789939e-05com.giphy
370154942841720.000149178020840425com.eepurl
371154929507523.86710102667409e-05org.rubyonrails
372154925939803.05867053615036e-05ly.ow
373154917793507.09328555378064e-05com.technorati
3751549031114402.39101606822674e-05uk.ac.ox
3761548928717531.80336311984752e-05edu.ucdavis
377154889363068.09041634071766e-05mp.j
3781548843816691.92668234479678e-05edu.umd
380154869781960.000130600081415146org.joomla
381154868035214.74575675113086e-05ca.google
3821548601213002.71672578237322e-05com.foxnews
3831548493511382.77285208514453e-05com.ubuntu
3841548229415182.19662629504982e-05com.hootsuite
385154819813856.4295300561396e-05com.barnesandnoble
3861548104319391.59034639194696e-05edu.bu
387154810118493.47893789322901e-05br.com.uol
3881548041310522.87377660159276e-05org.change
3891547981413902.47562201385043e-05net.researchgate
3901547973210462.87858158612307e-05ca.cbc
391154778622789.10597679558766e-05com.myshopify
3921547760411892.73974389493186e-05gov.uspto
3931547392516621.93123862526904e-05au.com.blogspot
3941547158415582.11443170255096e-05com.ibtimes
395154688671270.000201494587730564com.windowsphone
3961546855316701.92511697630888e-05au.com.smh
397154671718633.43166520951593e-05tv.ustream
3981546417817641.78996714682684e-05edu.asu
3991546305513722.53582202622928e-05com.vice
400154622579353.19137383273714e-05ca.blogspot
401154621077643.85046143189563e-05com.msdn
4021546155414872.27938995549172e-05com.gigaom
4031545642418181.70999687690856e-05fr.lemonde
404154554648863.3487653580818e-05gov.arts
4051545278514192.42221672388526e-05au.net.abc
4061545266910082.97794424811164e-05com.pcworld
407154525496064.18331091885497e-05com.geocities
4081545208811032.8074916778745e-05com.over-blog
4101545169917901.74633853037583e-05edu.arizona
4111545035013802.49714498779882e-05uk.co.theregister
412154500055574.45739926153101e-05com.symantec
413154476499553.10938947847226e-05com.intel
414154468672799.09687835769951e-05com.wufoo
4151544668317251.83123804765686e-05it.scoop
41615445503410.00094307691355208com.fbsbx
4171544464513552.57197109861248e-05com.indiegogo
418154437958973.32227150997552e-05fm.last
4191544368010112.96862133554023e-05com.searchengineland
420154434046384.03884046551975e-05org.tigris
421154419431200.000213563106056296com.xing
422154417229063.28011039522532e-05com.steampowered
423154400812659.44501307131196e-05com.wixsite
424154376228933.32776293807694e-05in.co.google
426154356346963.89468179483598e-05org.kernel
4271543524315792.08114538791667e-05edu.rutgers
428154344103906.26848449332716e-05com.hubspot
429154320569453.1541523599974e-05ly.snip
4311542999716561.94377049562573e-05com.qz
432154290034375.67323209565391e-05com.example
4331542796610332.90523344374742e-05com.pinimg
4341542416627861.10264831868485e-05org.moma
4351542377715222.18888107535011e-05com.nymag
436154231821940.00013209008750952org.icann
4371542206414702.32323539012823e-05com.examiner
4381542152010662.85530482795166e-05com.communitywalk
439154180616444.01838728589206e-05com.googlegroups
440154177965054.94289151194946e-05com.wiley
441154175763626.8485542721816e-05us.icio
442154167429713.07993617237965e-05org.wish
443154159848273.57195346814776e-05com.brightcove
444154150334645.3283954538054e-05com.fastcompany
4451541428015992.03032488006835e-05br.com.blogspot
4461541366213152.68653930493553e-05jp.blogspot
4471541314015132.21687586861463e-05edu.duke
448154128169383.17991908431134e-05com.tqn
4491541154220731.50117732512574e-05com.denverpost
4501541148718901.64147347916477e-05edu.ufl
451154094249503.13211107727205e-05com.jsbin
452154089099593.10548447850379e-05com.bibliocommons
453154087001770.000144292920751828jp.ne.hatena
4541540760315672.10091681230242e-05edu.ucsd
4551540714112692.72490186529387e-05com.shutterstock
456154065635204.74694841820413e-05com.nwsource
457154041989163.25322651787121e-05org.lls
458154029114865.06747050151017e-05com.sxsw
4591540258113132.6876013908712e-05com.box
4601540187317981.73668495255569e-05edu.tamu
4611540080713632.55335729572364e-05com.lifehacker
462154002346773.91846367320029e-05org.bitbucket
463154002148903.33135211093082e-05com.reverbnation
4641539904710382.89493063164984e-05com.intuit
4661539768914262.41475123446898e-05com.theglobeandmail
4671539531115272.17995803147687e-05com.boston
4681539493526391.16533507648765e-05edu.wsu
4691539486425081.2433445453334e-05com.codecademy
470153948563038.18811211827657e-05com.mailchimp
4711539468935998.40481697313659e-06edu.case
473153941588543.4663109492416e-05org.vim
4741539322713612.55650930767044e-05gov.usgs
4751539291829381.03733739575562e-05edu.udel
4761539164015882.06360756674507e-05com.lulu
478153889245714.35997856907777e-05com.herokuapp
4791538857928331.07911954379239e-05edu.usf
4801538738514142.42826197261023e-05org.arxiv
4811538649114582.3550335270772e-05com.hollywoodreporter
482153855028293.56957822588539e-05com.squareup
483153847015604.45055236043777e-05com.rawgit
484153845989943.01587098357822e-05int.wipo
485153845445134.89234029936009e-05int.who
4861538425314792.30586570568822e-05com.mlb
487153807584235.84115059504074e-05nl.google
4881537975415142.21563978466527e-05uk.ac.cam
489153790704355.69413663780201e-05it.google
4901537783324571.27170779759503e-05edu.rochester
491153770855834.26581830659939e-05com.sciencedirect
492153769525404.5590233520292e-05es.google
493153759304925.04565655944046e-05com.prnewswire
496153740635124.89486762586997e-05com.netdna-cdn
497153736408953.32748831426672e-05gov.census
498153727075264.70109683913423e-05org.acm
499153722369973.00968502339631e-05uk.co.eventbrite
500153716252649.45864958803737e-05com.dribbble
502153695524385.66258761382857e-05com.bigcartel
503153692067773.80078034824024e-05com.blackberry
504153689553237.62773882788658e-05it.placehold
5051536802825231.23363276509568e-05com.freewebs
507153653418473.491720143606e-05gov.house
508153650355424.5495955558977e-05com.moz
5091536467910372.89595958340174e-05com.pcmag
5101536345010902.82004663551755e-05th.co.lazada
511153632665954.21623397499087e-05com.java
5121536322024461.27866233856021e-05edu.brown
5131536250614902.27546025150466e-05org.unesco
5141535711910392.89434233736272e-05com.timeanddate
5161535521420621.50466348954934e-05com.me
517153546565564.45769549760707e-05com.wunderground
518153546084785.15403521918125e-05com.naver
5191535429710202.93180988863568e-05com.googlelabs
5201535420025451.21756108703223e-05edu.dartmouth
5211535411610692.84725561625317e-05com.cafepress
5221535324616581.93811871032173e-05nl.blogspot
5231535222133619.06439332694793e-06com.blog
524153492065664.37902161051386e-05net.openid
5251534890711812.74418894971913e-05gov.state
526153487835874.2359471534077e-05com.campaign-archive2
527153483945024.95388543949701e-05com.snapchat
5281534756335308.63918778341324e-06com.answers
5291534724131269.74923014967996e-06com.panoramio
530153464267623.85484763743062e-05org.doi
532153435275194.77202432693482e-05gov.usda
533153434279953.01325901940305e-05com.nydailynews
5341534301126781.14107241890089e-05com.fiverr
535153429044046.05683558445266e-05gov.irs
5361534045815862.06527553264387e-05com.mediafire
5371533876620491.51263773021797e-05com.yolasite
5381533873917931.74163462409128e-05edu.northwestern
5391533864530489.99423133900803e-06au.edu.unimelb
5401533826429551.02945568869793e-05pt.sapo
5411533783622521.3890799881422e-05edu.uci
5421533730421561.45372776850819e-05com.vox
5431533711213432.60056262188797e-05de.spiegel
5441533655218311.6951961616819e-05edu.indiana
5451533640714992.24037156386934e-05com.mixcloud
546153360924026.13443915079197e-05com.gotowebinar
5471533596114862.27988200023977e-05com.politico
5481533508019281.60404649507357e-05uk.ac.ed
5491533454924531.27409920227876e-05org.coursera
5501533373727281.12118871484912e-05edu.iastate
5511533332613202.67149480639566e-05com.gofundme
5521533137513112.69569796358108e-05com.sciencedaily
5531532864125611.20420514958202e-05edu.wustl
5541532666313852.48050571742303e-05com.com
555153249206274.09333910649588e-05org.postgresql
5561532403827421.11586867111723e-05com.aljazeera
5571532232833888.99605195071346e-06cc.tiny
5581532226420221.52880655359817e-05edu.jhu
559153218088693.39056838936811e-05tv.twitch
5601532093340597.40539131056082e-06org.edublogs
5611532047216062.01774638677572e-05com.scientificamerican
5621532022722531.38879432187941e-05edu.georgetown
563153202158313.56813161908985e-05fr.amazon
5641531838015032.23612984764625e-05org.eff
5651531740134378.86702554218844e-06com.metacafe
566153171001910.000133568759135424jp.ameblo
5671531706716931.8888419168614e-05edu.purdue
5681531586613412.60218077289035e-05com.techtarget
5691531550117961.7382847608717e-05com.computerworld
570153153023586.92385615991618e-05com.list-manage1
5711531494226301.16835618607635e-05com.voanews
573153131598103.63928853611669e-05org.filezilla-project
5751531273915842.07088116839661e-05com.globo
5761531253116951.88078663306853e-05com.espn
57715312290610.000565420241560954com.googletagmanager
5791531035919531.57895689797867e-05com.econsultancy
5801530998321991.42058821135541e-05edu.caltech
5811530966840067.50369095396379e-06ca.yorku
5841530881714932.26547145145621e-05uk.co.mirror
5851530723614052.44901711292864e-05gov.wa
586153066711490.000177379963368227com.bleacherreport
587153063141120.000229383265424507ru.mail
588153061445554.46855914261337e-05com.wikihow
5891530589025521.20966873341379e-05com.posterous
5901530571534508.82359793937048e-06edu.getty
5911530526131539.66215408282903e-06edu.iu
5921530519818021.73123902085213e-05com.networkworld
5931530393429901.01379864079042e-05edu.rice
594153036674935.03591727892366e-05com.force
5951530283531659.62217258755941e-06com.popsci
5961530220520571.50768807284092e-05com.blogs
597153016809663.09598932353572e-05ca.amazon
5981530101814922.26668010052997e-05com.prezi
599153008856753.92213914102238e-05org.openstreetmap
6001530043814332.39976511418591e-05org.videolan
6011530039026921.13711151099246e-05edu.oregonstate
602153002639053.28686767582425e-05com.mckinsey
6031529966313742.53508968618347e-05co.vine
6041529799521711.44041517200603e-05com.udemy
6051529672616901.90166303335655e-05com.indeed
6061529647713032.71367812776429e-05com.500px
6071529522817701.77233586297493e-05com.airbnb
6081529471917561.79860216972368e-05com.us
6091529467628961.05399264773873e-05edu.buffalo
6101529363416841.90790330006778e-05com.codeplex
611152929342539.81881014385652e-05jp.co.amazon
6121529228815422.14433652955457e-05mil.navy
613152917603576.9641067198322e-05net.themeforest
6141529174818751.65847048348261e-05com.searchenginewatch
6151529113610012.9937094462355e-05com.weather
6161529080224981.24886423946348e-05com.instructables
6171528998815982.04147744275967e-05com.infoworld
6181528886513382.60730524379888e-05org.postimg
6191528865510142.96274588622905e-05gov.nist
6201528765416551.94498068426432e-05me.flavors
6211528715321891.42494598478046e-05hk.com.google
6221528562315692.09919187442072e-05org.worldbank
623152854484465.57049729509609e-05jp.ne.sakura
624152852119193.24107060440747e-05gov.senate
6251528462913302.62832178308331e-05com.dell
6261528382711372.77346639122896e-05com.alternion
6271528359418581.67084293512309e-05org.weforum
6281528228825781.19657096495962e-05edu.vanderbilt
6291528180010482.87643509027424e-05com.istockphoto
6301528178213192.67520978971456e-05org.unicef
6311528137118191.70846042948261e-05com.blogtalkradio
6321528130113682.54329071778898e-05com.psychologytoday
6331528064116042.02133971936181e-05com.digitaltrends
6341527972111072.80258341333577e-05com.bitballoon
635152795222010.000128219206443624org.purl
63615279264360.00104599466571899com.parallels
6371527920028801.0610438448301e-05au.edu.anu
6381527914713792.4993271977574e-05com.walmart
6421527823819171.61582921968463e-05com.howstuffworks
6431527822724101.30002357541342e-05au.com.theaustralian
6441527796620191.52926412575938e-05ca.utoronto
6451527790031089.82306713797465e-06com.seekingalpha
646152778587673.84259493608188e-05com.cargocollective
6471527731210362.89853194713678e-05com.yarnpkg
6491527504625461.21728936988854e-05com.patch
650152731445304.65890979908163e-05com.marriott
6511527296025641.20339218653302e-05nz.co.stuff
6521527283614612.34742927248467e-05com.bufferapp
653152725342330.000113347706063239com.shopify
6541527228728611.0673622554462e-05br.com.abril
655152721725794.31163726460518e-05gov.fda
6561527211126541.15546716330409e-05org.metmuseum
6571527074817711.7720478702479e-05com.mac
6581527005219211.61418125507527e-05edu.unc
659152685728853.35402797098504e-05com.webmd
6611526828321211.47329526454372e-05com.nokia
6621526803314832.2878848034952e-05org.plos
6631526775440467.42719003848336e-06edu.ua
6641526700122381.3967856594105e-05org.wiktionary
665152645344365.67508362709212e-05com.bitly
6661526452117391.8211369785099e-05com.xkcd
6671526446019301.59995476208952e-05com.amzn
6681526445324691.26617143551831e-05com.allthingsd
6691526282528561.06933839212956e-05edu.ucr
6701526263615112.21802765343256e-05gov.fbi
6711526260811012.81053814869449e-05com.oreilly
6721526063919751.56605882146644e-05net.comcast
6741525959123911.30968680707361e-05com.pastebin
6751525913924761.26118695308562e-05net.boingboing
6761525873317841.75043751947912e-05com.playstation
6771525866623981.30425871141071e-05com.canalblog
678152582566953.89580808757773e-05net.launchpad
679152579334775.1625634539935e-05com.youku
6801525699934128.92246050318757e-06cc.co
6811525682831409.69399959688503e-06com.twitpic
6831525670416132.00790892379684e-05com.today
6841525621311162.79029401808774e-05org.cmlibrary
6861525544612222.73108581082248e-05de.heise
6871525511822281.40200068711249e-05com.googlepages
6881525491816721.92155195280816e-05com.technet
689152540683028.22175508543349e-05com.getbootstrap
6901525381234338.87454964811042e-06edu.temple
6911525299925561.20687755115901e-05com.ign
6931525247431519.66350917742778e-06com.upi
6941525227620741.50071783868689e-05com.nba
695152520148483.48867990264086e-05cn.com.sina
6961525158618701.66343745503257e-05uk.co.thesun
6971525077821051.48179911042675e-05com.discovery
6981525016913042.71164863688636e-05com.getpocket
6991524953019321.5985973562373e-05com.thedailybeast
700152494568943.32758542409795e-05org.gnupg
701152493389863.03560667284895e-05net.azurewebsites
7021524929513592.56231659003632e-05gov.fcc
7031524923313252.6533125002868e-05ru.google
704152490918093.64185428454994e-05com.hilton
7051524809411212.78672834461891e-05gov.ny
7061524687341557.2064158065949e-06edu.syr
7071524634614482.37502348483709e-05br.com.google
7081524626515232.18628778111356e-05com.zazzle
710152452586044.19341270866239e-05com.entrepreneur
7111524512223481.34109723726176e-05ch.ethz
7121524463942447.05352140883181e-06com.space
7131524461830629.9552002600227e-06com.indiewire
7141524326836898.20683368467091e-06it.repubblica
7151524248711592.75783380257729e-05jp.exblog
7161524170716271.97344047444238e-05com.vmware
7171524115317721.77159273715622e-05gov.nyc
7181524059744696.70690376470189e-06edu.fiu
719152402655854.24602948328796e-05com.typeform
7201523977911192.7880890311812e-05edu.alamo
721152393575764.32670162572188e-05com.metafilter
7221523910018741.66039378221903e-05gov.utah
7231523856627561.10983574160883e-05edu.pitt
724152382138373.54670748199058e-05com.technologyreview
7251523745527431.11564239851983e-05ms.1drv
7261523697120011.54095292080742e-05com.mercurynews
7271523558725281.23026764563188e-05com.urbandictionary
7281523521848126.26441533285652e-06uk.ac.st-andrews
731152342405004.95978355797237e-05org.freecsstemplates
7321523411923101.36176106587469e-05edu.ncsu
733152332987743.81419726467888e-05com.newsweek
7341523311325151.23694068280388e-05edu.vt
7351523292822421.39511923347534e-05org.gimp
7361523253521251.47205731999793e-05com.ehow
7371523242227661.10740320522725e-05org.greenpeace
7391523178013862.4798751939996e-05com.steamcommunity
7401523175916811.90934768631241e-05com.mcafee
7411523005925791.19584044094577e-05com.irishtimes
7421522909515972.04331098882185e-05com.livestream
7431522882813642.55057605268654e-05com.stackexchange
7441522845213772.51087702775784e-05com.feedly
7451522808043056.93390584748876e-06edu.pdx
746152279492440.000102784590967636jp.co.yahoo
7471522789618791.65447602576108e-05com.zillow
7481522713523991.30401089444467e-05com.foxsports
7491522693712172.73204489809514e-05com.redbubble
7511522591515712.09669925267844e-05com.reference
7521522580817751.76300650079329e-05se.google
753152251665184.78475650600298e-05gov.copyright
7541522493520461.51404543018914e-05uk.co.wired
7551522470810532.87281963287335e-05com.templatemonster
7561522378626031.18487163717953e-05com.tutsplus
7571522314822751.38238195338617e-05edu.uiuc
7581522199551345.8713899338997e-06edu.brandeis
7591522172425631.20373170102433e-05com.observer
7601522065914442.37776101286514e-05com.justgiving
7611522063530171.00492793873175e-05gd.is
762152202381430.000182177431084415com.people
764152182724545.43767285380984e-05com.nypost
7651521779220171.52947227103117e-05org.openoffice
7661521766415212.19029938843643e-05net.yahoo
767152172395154.86242565462417e-05com.photoshelter
7681521619911052.80631360388941e-05org.stopidentityfraud
7691521595629411.03557602124572e-05edu.nd
7701521582710762.83930562753374e-05com.sagepub
771152156197923.73306683068506e-05com.cdbaby
7721521514630809.89422877978489e-06com.pbworks
7731521507831319.73273115724507e-06ch.epfl
7741521473010162.95913720075838e-05net.docdroid
7751521434435388.59892849115483e-06com.nationalpost
7761521430618361.6890525363002e-05nl.xs4all
7771521313025311.22807841365904e-05edu.osu
7781521239017541.80225680175084e-05com.target
7791521173329961.01271424887745e-05ca.ualberta
7801521025222641.38516035718976e-05com.britannica
7811520885535218.66915050464592e-06com.readwriteweb
7821520875623621.32853529783148e-05tv.blip
783152084496623.95683024324097e-05com.webnode
784152076435334.61857724018269e-05com.informit
785152075765884.23206675622496e-05com.houzz
786152066819173.25050735019443e-05com.atlassian
787152064177033.88738057972146e-05gov.epa
7881520530514362.39702540982123e-05com.patreon
7891520487519131.62263884951756e-05org.pnas
7901520452646026.54795414698615e-06com.plurk
7911520449525651.20309774714031e-05com.deadline
7921520411724611.2694191531719e-05com.csmonitor
7931520324525581.20590026406928e-05com.thefreedictionary
7941520285614822.29792000644675e-05com.techradar
795152016543666.79713039346072e-05gov.export
7961520154613022.71523901829089e-05jp.jugem
7971520117139147.66665907938012e-06com.patheos
798152009657973.70188361564265e-05com.springer
7991520063331619.63918862375184e-06com.fifa
8011519988229101.04940988571428e-05uk.co.metro
8021519982827961.09820826242647e-05com.podbean
8031519934728421.07590520070566e-05com.eu
8041519886523411.34651543708415e-05com.9to5mac
8051519876638937.72829568187162e-06edu.ucsc
8061519836613662.54906334920141e-05gov.sba
8071519763825291.22968407272032e-05org.khanacademy
8081519688333209.15893198278876e-06com.discovermagazine
8091519604616301.97102959583915e-05com.bloglovin
8101519582313122.68820751313068e-05gov.dot
8111519493924791.26030313566866e-05gov.cia
8121519483639137.66754130260334e-06edu.missouri
813151945217413.86928597648577e-05gov.sec
814151941064625.34455013327675e-05com.adage
8161519405917481.81076747100426e-05org.owasp
818151933618143.62833594255235e-05org.maven
820151931825904.22817895742876e-05com.clicky
8221519258719861.5564837127422e-05com.netvibes
8231519150317011.86858940694352e-05uk.org.greenend
8241519119230799.8973806614334e-06edu.uic
8251519049315562.11535012115229e-05net.seesaa
8271518958418711.66248944998599e-05com.livescience
8281518905823291.3520097935398e-05com.newscientist
8291518901124341.28219860937364e-05edu.umass
8301518828344816.69072560608157e-06edu.byu
8311518820523431.34543166872523e-05edu.gwu
8321518744025051.24388641430322e-05com.getsatisfaction
8331518675035638.49826407908942e-06edu.ucar
8341518604219671.57161284089379e-05com.bestbuy
8351518560126181.17677918287584e-05org.slashdot
8361518556033029.20889985322655e-06com.lynda
8371518502515522.12374312638794e-05com.itunes
838151839568983.31541720640316e-05com.linksynergy
839151820442200.000119236860219388com.elegantthemes
840151818954715.27435047435429e-05com.zenfolio
841151816142998.36816972799617e-05kr.flic
8421518148647876.29442653768555e-06com.treehugger
8431518140749066.14399747187076e-06com.4shared
8441518127727751.10589713260247e-05sg.com.google
845151812568703.38511869472674e-05com.uservoice
8461518071725531.20874453751689e-05au.com.news
8471518023029591.02691773243877e-05edu.uoregon
84915179697390.000963859187439238com.atlassolutions
8501517927614152.42376980846539e-05com.calameo
851151787656943.89619798369146e-05gov.ed
8531517801514412.38849164666484e-05net.edgesuite
8541517682218101.72249794008617e-05com.angelfire
8551517664128641.06660801729435e-05com.friendfeed
8561517645933469.09973283027648e-06com.squidoo
8571517623143626.86729655781333e-06edu.sjsu
858151754825344.59747379453658e-05com.list-manage2
859151752826394.03760068683785e-05org.w
8601517527619761.56554998174229e-05com.billboard
861151752708883.33229976472426e-05org.jenkins-ci
8621517478418151.71639406507681e-05org.craigslist
8631517309814132.43275295847814e-05ch.google
8641517291216241.98193414781252e-05com.norton
8651517275924991.24788748858098e-05org.gutenberg
8661517259835428.58529237980699e-06edu.gmu
8671517207943326.90179812828641e-06com.urbanoutfitters
869151702499633.0982348094362e-05com.warriorplus
8701517006528921.05536267729891e-05gov.lbl
8711516991524301.28272703930632e-05ca.ubc
8721516964736708.24518633348136e-06edu.emory
8731516963519601.57546904235563e-05com.madmimi
8741516959127771.10579126365468e-05com.flipboard
8751516879326491.158087404906e-05uk.co.express
8761516856130449.9973459042887e-06com.groupon
8771516852965964.61066717637004e-06fm.ask
8781516847217581.79602590370848e-05org.dyndns
879151681219403.16738878962008e-05org.osgeo
8801516797134028.95368354470453e-06edu.unh
8811516756837028.16829795347617e-06com.dreamstime
8821516744721741.43790430825757e-05de.mpg
8831516721723521.33549876043573e-05com.timeout
8841516707522911.37930561320996e-05com.trello
8851516688544346.75650883193862e-06net.inquirer
8871516378833978.97470907356599e-06tt.db
8881516340325681.2028676812975e-05com.smashwords
8891516304540097.5009851913861e-06org.lifehack
890151627906114.15506626046456e-05com.mapbox
8921516262740887.33799753523127e-06com.theta360
8941516250826501.15685409144789e-05com.elpais
8951516218939847.5452462328069e-06mx.unam
8971516207417151.84504055261042e-05com.hyatt
8981516158922991.37457361006392e-05ca.qc.gouv
8991516129329151.04695633262643e-05edu.uga
9001516091228901.05582698397468e-05tv.periscope
9011516084622391.39665182554598e-05com.ikea
9021516070032319.40684172196742e-06com.rt
9031516013115082.22312003369202e-05com.netscape
9041515991640847.3412350461331e-06com.dpreview
9051515955729691.02451694615442e-05com.waze
9061515947031749.58306422626019e-06edu.tufts
9081515824814652.32953650777979e-05com.usps
9091515820726331.16729003842566e-05org.wnyc
9101515688946206.52677225898427e-06com.pbase
9111515688023271.35354017520418e-05uk.co.huffingtonpost
9121515687622041.41810128439695e-05com.socialmediaexaminer
9131515648524521.27547105359095e-05com.miamiherald
9141515635116771.91454793535634e-05com.163
9151515526219291.60291831429757e-05com.nfl
9161515399115252.18455988672043e-05com.fedex
9171515395918341.69239070862306e-05com.suntimes
9181515390734358.87050857955755e-06net.uk
9191515376528461.07327970998889e-05net.earthlink
9201515338925751.19824685546719e-05org.eu
9211515183510432.88478957939967e-05jp.ac.kobe-u
9221515142940147.48637493537742e-06com.macrumors
923151508876224.12161691159013e-05com.friendster
9251515065849386.09241171910296e-06li.paper
9261515059310172.94689613731383e-05org.aclweb
9291514778842487.04318153363729e-06org.nypl
9301514625036618.25885549831049e-06com.seattlepi
931151462249323.19355693878378e-05edu.utep
9321514612940917.33275777840132e-06com.esquire
9331514561516511.94907985965323e-05com.alibaba
9341514525419461.58403729551741e-05com.gallup
935151451919753.07094414631654e-05com.gartner
9361514468730321.00243908229847e-05org.ibiblio
9371514460422061.41652445483039e-05ru.habrahabr
938151443555694.36758910514297e-05com.orkut
9391514401037658.05161795367949e-06com.psychcentral
941151427119993.00349367316673e-05com.tandfonline
9421514229640757.3622724322498e-06be.blogspot
9431514182616361.96036095572849e-05ru.spb
9441514094211912.73934146878194e-05jp.ne.goo
9451514070251995.81304350399957e-06io.soup
9461514043963684.80285781528238e-06com.autoblog
9471514009647696.31731333379226e-06gr.blogspot
9481514003017311.82779854197245e-05com.marketingland
9491513977233559.0739118458068e-06org.propublica
9501513973113822.49154931051116e-05com.unsplash
9511513922436728.24261668907916e-06uk.ac.vam
95215138529540.000677034331007159net.ovh
9531513839118531.67490179527703e-05org.ap
9541513838217471.81116122309211e-05de.bayern
9551513831726141.1780491106467e-05com.ezinearticles
9561513827535328.63374503998574e-06at.ac.univie
9571513807619021.62814600358615e-05com.lww
958151379048263.58116025428148e-05org.whatbrowser
9591513780219091.62391225244165e-05ru.narod
9601513749517991.73626417117781e-05de.welt
9611513712825271.23125015700162e-05com.vogue
9631513676826671.14903880876818e-05com.sacbee
9641513649522231.40422839026896e-05com.salon
9661513519333709.04444445804001e-06com.howtogeek
9671513505638777.75786443249133e-06com.scienceblogs
9681513483025761.19753053105799e-05com.speakerdeck
9691513463537408.0925036532909e-06edu.utk
9701513462221151.47665516324275e-05edu.colorado
9711513436226631.15263452785689e-05ca.ctvnews
9721513333332169.46266143255436e-06edu.rit
9731513314950215.99011002035218e-06com.spreaker
9741513287231729.58906484578881e-06com.sbnation
9751513282944746.70057653000476e-06com.ravelry
9761513246191193.36255538243955e-06com.purevolume
9771513221314292.40986597355896e-05org.redcross
9781513195110122.96536226245788e-05it.amazon
9791513053038307.87841169754985e-06jp.co.japantimes
9801513052339477.60521847167761e-06edu.neu
9811512996329841.01806930403748e-05org.notepad-plus-plus
9821512985412492.72621515187053e-05us.tx.state
9831512983858065.24730127406629e-06edu.stonybrook
984151291908793.36537032047091e-05com.githubusercontent
9851512875333569.07349412732171e-06nz.co.nzherald
9861512868244666.71188460059978e-06au.edu.uq
9871512820217431.81424145083871e-05com.hulu
9891512737037228.11380321181256e-06com.kotaku
9901512719728031.09596426379084e-05com.pixlr
9911512683225361.22403614813801e-05org.plosone
9921512681923191.35869429391614e-05be.skynet
9931512662852855.73383523576963e-06org.kiva
9941512624558895.19218457452368e-06com.dailykos
9951512556519471.58394285740344e-05com.windows
9971512464843806.83359980472686e-06com.appleinsider
9981512444572724.16137009681623e-06com.zimbio
9991512431234938.72400390257565e-06com.sony
10001512423115162.20203522611855e-05com.usnews
10011512411822131.41046155978645e-05org.jstor
10021512406335788.44147236834851e-06br.usp
10031512396216921.89005659232104e-05com.hotmail
10041512394535918.42266358295749e-06org.publicradio
10051512392342996.94773919350873e-06ca.queensu
10071512291930949.85256523885283e-06com.techsmith
10081512284613922.47451015821843e-05com.starwoodhotels
10091512266923321.35082248225311e-05int.itu
10101512244147216.38885267967483e-06ru.msk
10111512232119181.61560935000156e-05com.slack
10121512226735188.67216939620092e-06com.ocregister
10131512152815612.11146320845391e-05com.att
10141511959424191.29223390974961e-05ru.com
10161511876127261.12330945363111e-05fr.lefigaro
10171511827019361.59672085303341e-05com.merriam-webster
1018151179088153.62635751772531e-05org.sonatype
10191511790615762.08815868514927e-05com.domain
1020151177098193.60499120933233e-05jp.livedoor
10211511753320471.51327792586241e-05com.gettyimages
10221511744827801.10566824510243e-05org.pewresearch
10231511724821461.46293466334101e-05com.xbox
10241511720423361.34965990334014e-05com.philly
10251511713519981.54386617149377e-05com.macworld
10261511677029351.03852509028486e-05com.pandora
10271511600522351.39818040735583e-05com.screencast
10281511577829131.04753710838801e-05cn.com.chinadaily
10291511523817811.75423937882377e-05co.angel
10301511493836118.37665878835035e-06com.softpedia
10311511481834388.86482430240979e-06com.sun-sentinel
10321511456237618.0617698796768e-06com.gq
10331511397547606.33100426632562e-06uk.ac.lse
10341511345656225.41405793070476e-06ie.dcu
10351511306642427.05731974728055e-06cz.cvut
10361511222719051.62743170924513e-05us.mn.state
10371511199740007.52083022965871e-06com.gamespot
10381511143916801.90947838704124e-05net.battle
10391511108243266.90589418244931e-06org.kqed
10401511060025131.23787698420637e-05com.deezer
10411511059063084.84743641510709e-06com.weheartit
10421511040624361.28163659354189e-05us.fed.fs
10431511000560605.04386130379262e-06edu.boisestate
10441510971636588.26486260361324e-06com.itv
10451510961935758.44807075839288e-06com.chromeexperiments
10461510922224271.28446958304232e-05com.ecwid
10471510922218171.71096481543741e-05com.rollingstone
10481510909021621.45052881285571e-05com.lg
1049151090346493.97925746370024e-05gov.usa
10501510872913842.48508143917158e-05com.mediapost
10511510864218941.63687473890889e-05org.7-zip
10531510752834518.82152744851336e-06uk.ac.ucl
10561510603914182.42315485369215e-05org.oecd
10571510517145876.56637741524037e-06edu.uconn
10581510489219921.54943353581812e-05com.oup
10591510409721871.42705509195955e-05com.fastcodesign
10601510406917611.7949821530063e-05org.fao
10611510368919661.57209150085303e-05com.ssrn
1062151030236124.15348394643097e-05com.ea
10631510270111972.73652354802987e-05site.tenerifeforum
10641510223424561.27338596207715e-05org.gnome
10651510213228121.09082842009225e-05com.dallasnews
10661510189247036.41996854624355e-06edu.baylor
10671510095726251.17066663780132e-05com.upwork
1068151009385684.3688041709467e-05com.udacity
10691510080338107.92556546534424e-06com.marthastewart
1071150999031080.000233897006385183com.scorecardresearch
10721509984072274.18439407525323e-06com.smore
10731509976421851.42753327167622e-05gov.uscourts
10741509948723311.35150828730417e-05com.thestar
10751509938637967.95835809109849e-06com.mediabistro
10761509855624671.26718746433665e-05com.washingtontimes
10771509828721731.43821141052024e-05gov.fws
10781509824726981.13376477961827e-05de.fu-berlin
10791509693833948.97977597433521e-06org.thinkprogress
10801509641523781.31988303928714e-05uk.co.ebay
10811509603115342.16177620913569e-05org.cancer
10821509593021921.42362632550556e-05com.tunein
10841509538245286.63269521044456e-06net.fusion
10851509532334838.75212816606324e-06com.shutterfly
10861509490128451.07360892958791e-05edu.unl
10871509450216331.96647199416434e-05gov.archives
10881509448135558.53632335646553e-06ca.uwaterloo
10891509433041957.13316306266006e-06com.rebelmouse
10901509422970954.27197287772721e-06nr.co
10911509408410732.84596850945921e-05com.wsoctv
10921509323424401.28051282516683e-05ch.cern
10931509296713762.51450016917556e-05com.magentocommerce
10941509286230051.00907350023168e-05com.blurb
10951509269127301.12020560201465e-05com.medscape
10981509162748916.17105152283906e-06com.thingiverse
10991509151516122.00799886698044e-05com.biblegateway
11001509117431879.55435977880778e-06com.carbonmade
11011509015223281.35223797685066e-05gov.weather
11021508999522021.41915895238769e-05com.canva
11031508972416631.93009906939172e-05com.thehill
11041508904028831.06020285008986e-05com.twilio
11051508876832819.2762882038917e-06tv.arte
11061508864569714.35419930787964e-06sg.blogspot
11071508785240807.35568049360429e-06com.gawker
11081508739175074.03564477881975e-06fr.online
11091508717633968.97494140584996e-06edu.colostate
11101508694652595.75702090283342e-06edu.nau
11111508693731329.73185560248581e-06net.jsfiddle

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl’s Google Group!