Exercise 2: Topology Analysis of Web Crawl Results

To prepare the data set for this example, I used the Apache Nutch web crawler engine (which was designed by Doug Cutting and Mike Cafarella before they built Hadoop). Although data ingestion from SQL databases and huge document repositories dominate in commercial use cases, a data scientist should also be aware of the advantages of Nutch for scraping web data, in addition to those of Apache Flume and Apache Sqoop. For example, instead of using APIs or writing clients to load data from specific web pages, portals, or social media applications, you can use Nutch to get the data all at once and then apply specific parsing, such as HTML parsing with JSoup or more advanced triple extraction using Apache Any23.

Our crawl data is loaded into a staging table (in this case, in Apache HBase). Beside the full HTML content, it contains inlinks and outlinks. Because HBase is managing the crawl data (and the state of the crawl dataset), we can add more links, especially inlinks to a particular page, while we crawl more and more data. To distinguish both types of links in visualization, we use link type 1 for inlinks and link type 2 for outlinks. Finally, to analyze the overall network, we need both lists in one homogeneous collection, where each link is described by source, target, and type.

Creation of Node- and Link-Lists

Using Spark SQL and DataFrames allows for inspection and rearrangement of the data in just a few steps. We start with a query that transforms the map of string pairs into individual rows. During the crawl procedure, Nutch aggregates all links in an adjacency list. We have to “explode” the data structure with all links per page to create a link list from the previously optimized representation.

For our network, we need one column named source, which contains the URL and the page name. This is why we first explode the inlinks map into “virtual columns” called mt and mp. We can now concatenate both of them and create a new source column:

With this intermediate result, we can revisit GraphFrames with the variable g1 representing the graph loaded by Nutch. As homework, you can next apply the same analysis steps as in Exercise 1.

Conclusion

As you can see, Spark SQL provided access to many different data sources, no matter if we used the Apache Hive table in Parquet format or the HBase table via Hive. And for the many use cases in which the existing table layout cannot be used, Spark SQL made all the filtering, grouping, and projection really easy.

Using GraphFrames, it is possible to turn data tables into graphs with just a few lines of code. The full power of the Pregel API, implemented in GraphX, is available in combination with Spark SQL. As a result, raw data stored in Hadoop in different flavors can easily be combined into huge multi-layer graphs, with graph analysis all done in place via Spark.

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.