Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url
http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

Evaluating graph computation systems

Frank McSherryThis is a guest blog post by Frank McSherry
Frank McSherry is a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project. His current interests involve understanding and improving performance in scalable data processing systems.


The computer science systems and database research communities are abuzz with graph processing systems, large computer systems specialized to process and analyze very large graphs such as global social networks, internet topologies, and the hyperlink structure of the world-wide web. Academic researchers working in these areas have made significant strides in understanding how to process graph-oriented data, but their ability to validate their techniques at scale have been limited by their access to very large graph datasets. The data made available by Common Crawl and Web Data Commons provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.

Background on graph processing

Graphs are an interesting source of data in which each record (an “edge”) references two distinct entities (called “nodes”). In the case of the web graph, for example, each hyperlink can be viewed as an edge with the source and destination web pages as the nodes it references. Many important social graph analyses are functions of the graph structure: the existence of edges between nodes imposes a constraint on the answer, and many computations can be viewed as maintaining or updating per-node state as a function of the incident edges (and the states of adjacent nodes).

Researchers have since built computational engines based around algorithms in which information flows only along graph edges. These systems distribute the large set of edges across multiple threads, processors, and even computers, and for each edge ensure that the information at each node is shared with the other node. To define a computation, a data analyst then supplies the code for what should happen with this information each time it is presented, for example updating the information maintained by each node to reflect what they have learned from others. The separation of computational logic from system specifics allows each system to effect scalable implementations of many popular graph algorithms, without worrying the programmer about issues of data distribution, network communication, or recovery in the presence of machine failures.

Evaluation at scale

As popular as graph processing systems have become, their evaluation has largely either been on small to medium size data sets, or behind the closed doors of corporate data centers. Evaluations on moderately sized data is not without its use, but it does fail to stress the systems in ways that could inform users, programmers, or administrators when faced with truly large data. Worse, from a research perspective, we shouldn’t expect to make progress in improving graph processing systems without relevant evaluation of when they perform well and when they perform badly.

For example, Gonzalez et al., OSDI 2014 evaluated four of the most popular recent systems used for graph processing —Spark, Giraph,GraphLab, and GraphX— on two graph datasets edges each containing more than one billion edges. Their performance numbers on a cluster of 16 machines, comprising 128 processing cores, reveal that their proposed system improved on prior work with the same features, without losing the performance of specialized systems.

Although one billion edges is quite a lot, the datasets still fit comfortably on a modern laptop. To demonstrate the limitations of evaluations on this size of data, we compared the performance numbers reported by Gonzalez et al. with the performance of the same algorithms executed on my laptop, using about one hundred lines of code. Eliding the specific details of the comparison, no graph processing system strictly out-performed the laptop, and some of the systems were consistently slower than the laptop. With a few improvements to the code, the laptop was able to out-perform all systems on all computations on all datasets evaluated by Gonzalez et al.

These conclusions surprised many people, who had assumed that the scalable systems would obviously out-perform a single core on a laptop. Indeed, the main point of these systems is that one can accomplish more by using multiple computers than one can accomplish using just a single computer. If the systems are not yet accomplishing this goal, doing something you couldn’t otherwise do, it is less clear when and why you would use them, especially given the additional resources (and complexity) they entail. Perhaps one simply should not use them, and instead do graph analysis on my laptop instead.

Scaling up evaluation

One common (and fair) response to our initial evaluation was that the datasets we used (those used by the GraphX evaluation) were not sufficiently large to distinguish between good scalable systems and bad scalable systems (including my laptop, presumably). We completely agree (and this was part of the point of using only existing measurements), but researchers are challenged to access to realistic data at meaningful scales, and the graph datasets used by the GraphX evaluation were among the largest publicly available datasets at the time. Given this difficulty, it is perhaps unfair to blame the researchers for the limits of their evaluation.

This situation has now changed with the data provided by Common Crawl and Web Data Commons (who have processed the former’s crawl data into a compact graph dataset. The graph data they provide is substantially larger, by almost two orders of magnitude, than the graph datasets used by Gonzalez et al. As the dataset is made freely available, researchers can use it to evaluate the improvement their systems provide over simpler single-computer alternatives. If the systems do not improve as much as expected, the researchers can now pin-point what is slow, and can either improve their system’s implementation or improve the ideas underlying its design. In either case, the state of the art can now advance in a more principled manner.

To get the ball rolling on evaluation, we took the graph data supplied by Web Data Commons, 128 billion edges relating 4 billion nodes, approximately one terabyte uncompressed, and evaluated my laptop’s performance on this dataset. This collection is substantially larger than other datasets, and is near the limit of what can be processed by my laptop. The data required some further transformation and compression both to fit on my laptop’s drive, and to avoid exhausting memory when the computations were executed. But, the computations do run, and they take time that is only slightly longer than what one might expect based only on the increase in the number of edges.

The laptop’s performance is good news for people who want to run graph computations at scale, and a healthy challenge (and opportunity) for implementors of graph processing systems. These measurements provide an excellent baseline and proof of concept for the scalable systems to target. A laptop can perform the large-scale computation, admittedly under duress, and so the scalable systems should be expected both to perform the computation and to improve on my laptop’s performance. If a system’s performance is lacking, a comparative evaluation should reveal which components are limiting the system, and they can be improved.

Advancing graph processing systems

It should be mentioned that there are already several graph processing systems that do improve on laptop-scale performance. Ligra is a shared-memory (single machine) system whose performance appears to scale quite well from a single-core baseline when graphs fit in to the computer’s random access memory. For larger graphs, FlashGraph uses an array of solid-state drives to provide performance without requiring nearly as much memory. Finally, Naiad distributes computation across multiple computers, but manages to avoid introducing much overhead when it does, scaling performance up from the appropriate single-machine baseline.

In my opinion, each of these systems represent progress in the state of the art, and we should understand what each have done well (and what each do badly). Given our access to these systems and their ideas, as well as sufficient data to evaluate and distinguish good ideas from bad, there is no obvious reason not to expect the state of the art in graph processing to advance significantly and swiftly. In fact, to the extent that we are serious about performance in graph processing systems, we should demand it.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

5 Good Reads in Big Open Data: March 26 2015

  1. Analyzing the Web For the Price of a Sandwich – via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers

    I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.

    Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.


  2. Open Source does NOT mean a lack of security -via Information Age: Businesses are increasingly moving to Open Source platforms to reduce costs and improve efficiency; however, many mistakenly believe that Open Source means a tradeoff in security.


  3. Utility Companies should use Machine Learning– via Intelligent Utility: Machine learning can have a huge impact on energy efficiency, customer usage incentive programs and personalizing the customer experience around energy usage

    Load Curve graph (via Intelligent Utility) demonstrates "Energy Personalities" of customers

    Load Curve graph (via Intelligent Utility) demonstrates “Energy Personalities” of customers


  4. QVC loses lawsuit against Resultly in Web Crawl case via Forbes: under the Computer Fraud & Abuse Act, the courts ruled that Resultly did not demonstrate any intent to damage QVC’s systems, but their overloading of QVC’s servers was a result of “wrinkles in its business operations.”


  5. Can Data Science actually predict the perfect March Madness bracket?– via Sport Techie: (Apparently not)

    Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”

    “So there is something there.”

    Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.

    In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

5 Good Reads in Big Open Data: March 20 2015

  1. Startup Orbital Insight uses deep learning and finds financially useful information in aerial imagery– via MIT Technology Review:

    To predict retail sales based on retailers’ parking lots, humans at Orbital Insights use Google Street View images to pinpoint the exact location of the stores’ entrances. Satellite imagery is acquired from a number of commercial suppliers, some of it refreshed daily. Software then monitors the density of cars and the frequency with which they enter the lots.

    Crawford’s company can also use shadows in a city to gather information on rates of construction, especially in secretive places like China. Satellite images could also predict oil yields before they’re officially reported because it’s possible to see how much crude oil is in a container from the height of its lid. Scanning the extent and effects of deforestation would be useful to both investors and environmental groups.


  2. Goodbye to Google Code -via eweek.com: Google is closing it’s open source project. With hosts like GitHub and BitBucket, users have migrated and Google Code is no longer needed.


  3. Trends in Big Data Vs Hadoop Vs Business Intelligence– via Hadoop 360: Visualizing how interest has changed over the years

    Screen Shot 2015-03-19 at 12.26.02 PM
    Image via Hadoop360

  4. Analysis of Common Crawl PDF metadata via PDFinfo.net
    Screen Shot 2015-03-19 at 2.49.16 PM


  5. Open Data should be the new Open Source– via Computerworld:

    But the lack of open data still seriously holds innovation back, and as data becomes more critical, the problem becomes worse.

    For example, think about how hard it is for innovative predictive analytics companies to get off the ground. It’s not that they don’t have the software; it’s that they don’t have the data. There are plenty of excellent open source projects to build on top of (Sci-Py, R, etc.). But the lack of usable data is a huge issue when it comes to testing and training the algorithms in any domain.

    The same exact thing would be true when an entrepreneur starts an e-commerce company. A high quality search engine is crucial in e-commerce and there plenty of great tools to build the search infrastructure such as Lucene, but no good datasets to test and train the ranking and relevance algorithms.

    Which is to say this: There are smart, creative data scientists out there who don’t have the tools to do valuable work.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

5 Good Reads in Big Open Data: March 13 2015

  1. Jürgen Schmidhuber- Ask Me Anything– via Reddit:  Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades. He is the recipient of the 2013 Helmholtz Award of the International Neural Networks Society.

    Many think that intelligence is this awesome, infinitely complex thing. I think it is just the product of a few principles that will be considered very simple in hindsight, so simple that even kids will be able to understand and build intelligent, continually learning, more and more general problem solvers. Partial justification of this belief: (a) there already exist blueprints of universal problem solvers developed in my lab, in the new millennium, which are theoretically optimal in some abstract sense although they consist of just a few formulas.  (b) The principles of our less universal, but still rather general, very practical, program-learning recurrent neural networks can also be described by just a few lines of pseudo-code.


  2. An abridged list of Machine Learning topics -via Startup.ML: great presentation of research, software, talks and more on Deep Learning, Graphical Models, Structured Predictions, Hadoop/Spark, Natural Language Processing and all things Machine Learning.


  3. Deeper Content Analysis with Aspects: Interest Graph Grows Beyond Topics– via Prismatic Blog:  Prismatic opens up their Interest Graph with an aspect tagging API to classify URLS by aspect (structural content) and not just topic

    Via Dave Golland- Prismatic Blog
    Via Dave Golland- Prismatic Blog

  4. Wikipedia’s open letter to the NSA- stop spying on our users! via New York Times:  The NSA tracks your every view and edit to a Wikipedia page, on top of your location and (if they can) who you are. Open knowledge collaboration shouldn’t come at the cost of losing privacy over your very private identity, especially when the cost can be as high as prosecution.


  5. Generational Performance of Amazon EC2’s C3 and C4 families– via GigaOm:

    The results presented here indicate that the C4 virtual machines had 10 to 20 percent higher vCPU performance and approximately 6 GB/s more memory throughput than the C3 VMs across different machine sizes. However, after factoring in the price increases, the price-performance values of the C4 VMs averaged the same as the C3 VMs. Both vCPU performance levels and network throughput displayed high stability over time and across all tested machines. The results highlight Amazon’s effort to provide highly predictable performance outputs and to match its C4 family’s price-performance with that of its earlier generation C3 family.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: March 6 2015

  1. 2015: What do you think about Machines that think?– via Edge:  A.I isn’t so artificial

    With these kind of software challenges, and given the very real technology-driven threats to our species already at hand, why worry about malevolent A.I.? For decades to come, at least, we are clearly more threatened by like trans-species plagues, extreme resource depletion, global warming, and nuclear warfare

    Which is why malevolent A.I. rises in our Promethean fears. It is a proxy for us, at our rational peak, confidently killing ourselves.


  2. What would you do with 139TB of big open data? -via Common Crawl: We’ve just released 1.82 billion web pages for you to discover, build and innovate. Check it out and please email [email protected] to share your work!

  3. Google Makes Overriding Redirection More Difficult  – via Search Engine Land:  Google says this move is to improve local user experience, but is The Right To Be Forgotten the true reason?

  4. Less than 40 percent of the world has ever connected to the internet via Slatethe problems are “infrastructure, affordability and relevance” according to Facebook’s Internet.org. This information may be disheartening to some, but it also shows what tremendous potential the web still has if we can connect the world.

  5. Hadoop gamechangers– via Opensource.com:

    Hadoop, an open source software framework with the funny sounding name, has been a game-changer for organizations by allowing them to store, manage, and analyze massive amounts of data for actionable insights and competitive advantage.

    But this wasn’t always the case.

    Initially, Hadoop implementation required skilled teams of engineers and data scientists, making Hadoop too costly and cumbersome for many organizations. Now, thanks to a number of open source projects, big data analytics with Hadoop has become much more affordable and mainstream.

    Here’s a look at how three open source projects—Hive, Spark, and Presto—have transformed the Hadoop ecosystem.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: February 27 2015

  1. Hadoop is the Glue for Big Data – via StreetWise Journal: Startups trying to build a successful big data infrastructure should “welcome…and be protective” of open source software like Hadoop. The future and innovation of Big Data depends on it.


  2. Topic Models: Past, Present Future -via O’Reilly Data Show Podcast:

    You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.


  3. Border disputes on Europe’s Right To Be Forgotten – via Slate: Is the angle of debate (disruptors vs. regulators) wrong? Should we be thinking of more custom solutions to this global issue?


  4. Flashgraph can analyze massive graphs to the proven tune of 129 billion edges- via the Common Crawl Blog (Flashgraph on GitHub):

    You may ask why we need another graph processing framework while we already have quite a few…FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.


  5. The future of the internet is NOT all decided by net neutrality – via The Atlantic: A wonderfully curated net neutrality reading list, including one article where Justice Antonin Scalia tells us the Internet is a pizzeria (he’s right)


    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

Analyzing a Web graph with 129 billion edges using FlashGraph

DaZhengThis is a guest blog post by Da Zheng
Da Zheng is the architect and main developer of the FlashGraph project. He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining.   


FlashGraph is a SSD-based graph processing framework for analyzing massive graphs. We have demonstrated that FlashGraph is able to analyze the page-level Web graph constructed from the Common Crawl corpora by the Web Data Commons project. This Web graph has 3.5 billion vertices and 129 billion edges and is the largest graph publicly available in the world. Thanks to the hard work of the Common Crawl and the Web Data Commons project, we are able to demonstrate the scalability and performance of FlashGraph as well as the graph algorithms designed for billion-node graphs.

You may ask why we need another graph processing framework while we already have quite a few, such as Pregel/Giraph, GraphLab/PowerGraph and GraphX. As pointed out by Frank McSherry in his blog 1 & 2, the current distributed graph processing frameworks have substantial overhead in order to scale out; we should seek performance and capacity (the size of a graph that can be processed). On top of the runtime overheads Frank McSherry mentions, these frameworks also have very large memory overhead. For example, as shown in the performance evaluation of the GraphX paper, Giraph cannot even process a graph with 106 million vertices and 3.8 billion edges in a cluster with aggregate memory of 1088 GB. The similar problem exist in others, as shown here. The large memory overhead prevents them from scaling to a larger graph or unnecessarily wastes resources.

FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.

To scale graph analysis and achieve in-memory performance, FlashGraph uses the semi-external memory model, which stores algorithmic vertex state in memory and edge lists on SSDs. This model enables in-memory vertex communication while scaling to graphs that exceed memory capacity. Because vertex communication is the main source of computation overhead in many graph algorithms, it is essential to achieve in-memory performance in vertex communication. To optimize data access on SSDs, FlashGraph deploys two I/O optimizations: access edge lists only required by the applications; conservatively merge I/O requests to achieve higher I/O throughput and reduce CPU overhead caused by I/O.

The graph format used by FlashGraph is designed for both efficiency and flexibility. All graph algorithms in FlashGraph use the same graph format, so each graph only needs to be converted into the format once and to be loaded to SSDs once. FlashGraph stores both in-edges and out-edges in a graph image. In the Web graph, an out-edge is a hyperlink from a Web page to another page, and an in-edge is the reverse of a hyperlink. It is necessary to keep an edge twice for a directed graph because some graph algorithms require in-edges, some require out-edges and some require both. For efficiency, in-edges and out-edges of a vertex are stored separately. This reduces data access from SSDs if an algorithm requires only one type of edges.

FlashGraph provides a very flexible vertex-centric programming interface and supports varieties of graph algorithms. The vertex-centric programming interface allow programmers to “think like a vertex”: each vertex maintains some algorithmic state and performs user-defined computation independently. In FlashGraph, a vertex can communicate with any vertices through message passing and read edge lists of any vertices from SSDs. We have implemented a set of graph algorithms such as breadth-first search, PageRank, connected components and triangle counting. All of these graph algorithms implemented in FlashGraph can run on the page-level Web graph in a single commodity machine and complete at an unprecedented speed, as shown in the table below. The performance result also shows that FlashGraph has a very short initialization time even on this massive graph.

Algorithm Runtime (sec) Init time (sec) Memory (GB)
BFS 298 30 22
Betweenness 595 33 81
Triangle counting 7818 31 55
Weakly connected components 461 32 47
PageRank (30 iterations) 2041 33 46
Scan statistics 375 58 83

 

The more detailed design of FlashGraph is documented by the paper published at FAST’15.

We further explore community detection with FlashGraph on billion-node graphs. Here we detect communities with only active vertices. The activity level of a vertex is measured by a locality statistic (the number of edges in the neighborhood of a vertex). Again, we use the large Web graph to demonstrate the scalability and accuracy of our procedure. The key here is to quickly identify the most active vertices in a graph. Having these vertices, we further cluster them into active communities. In our experiment on the paper, we identify 2000 most active vertices in the Web graph and discover five communities. The sizes of community 1 to 5 are n1 = 35, n2 = 1603, n3 = 199, n4 = 42 and n5 = 121 respectively. Community 1 is a collection of websites that are all developed, sold or to be sold by an Internet media company networkmedia. Community 2 are all hyperlinks extracted from a single Pay-level-domain adult website. In the community 3, most links are social media websites and often used in our daily life such as WordPress.org and Google. Community 4 consists of websites related to online shopping such as the shopping giant Amazon and the bookseller AbeBooks. Community 5 is another collection of 121 adult web pages where each web page comes from a different Pay-level-domain in this cluster. In summary, top 5 active communities in the Web Graph are grouped with high topical similarities.

Active community detection is one application that demonstrates the power of FlashGraph. We are looking forward to seeing more cases that people use FlashGraph for mining massive graphs. We are happy to help others develop algorithms to explore the Web graph as well as other graphs of the similar size or even a larger size.

5 Good Reads in Big Open Data: Feb 20 2015

  1. Why The Open Data Platform Is Such A Big Deal for Big Data– via Pivotal P.O.V:

    A thriving ecosystem is the key for real viability of any technology. With lots of eyes on the prize, the technology becomes more stable, offers more capabilities, and importantly, supports greater interoperability across technologies, making it easier to adopt and use, in a shorter amount of time. By creating a formal organization, the Open Data Platform will act as a forcing function to accelerate the maturation of an ecosystem around Big Data.


  2. Machine Learning Could Upend Local Search -via Streetfight: From the Chairman of Common Crawl’s Board of Directors (and Factual CEO) Gil Elbaz on the future of search


  3. On opening up libraries with linked data – via Library Journal: While the rest of the web is turning into the “Web of Data,” libraries and catalogs  are (partially for reasons for a closed culture) struggling to keep up


  4. Interactive map: where are we driving, busing, cabbing, walking to work? via Flowing Data:

    Interactive: How Americans Get to Work
    Image via Flowing Data

  5. On the ongoing debate over the possible dangers of Artificial Intelligence– via Scientific American:

    Current efforts in areas such as computational ‘deep-learning‘ involve algorithms constructing their own probabilistic landscapes for sifting through vast amounts of information. The software is not necessarily hard-wired to ‘know’ the rules ahead of time, but rather to find the rules or to be amenable to being guided to the rules – for example in natural language processing. It’s incredible stuff, but it’s not clear that it is a path to AI that has equivalency to the way humans, or any sentient organisms, think. This has been hotly debated by the likes of Noam Chomsky(on the side of skepticism) and Peter Norvig (on the side of enthusiasm). At a deep level it is a face-off between science focused on underlying simplicity, and science that says nature may not swing that way at all.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Ross FairbanksThis is a guest blog post by Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project wikireverse.org and why he built it.



What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.


[1] https://wikireverse.org/
[2] https://commoncrawl.org/2011/12/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html