Analyzing Flight Delays with Apache Spark GraphFrames and MapR-DB

Contributed by

Apache Spark GraphX made it possible to run graph algorithms within Spark, GraphFrames integrates GraphX and DataFrames and makes it possible to perform Graph pattern queries without moving data to a specialized graph database.

This blog will help you get started using Apache Spark GraphFrames Graph Algorithms and Graph Queries with MapR-DB JSON document database. We will begin with an overview of Graph and GraphFrames concepts, then we will analyze a real flight dataset for January-August 2018 stored in a MapR-DB table.

Graphs provide a powerful way to analyze the connections in a Dataset. GraphX is the Apache Spark component for graph-parallel and data-parallel computations, built upon a branch of mathematics called graph theory. It is a distributed graph processing framework that sits on top of the Spark core. GraphX brings the speed and scalability of parallel, iterative processing to graphs for big datasets. It partitions graphs that are too large to fit in the memory of a single computer among multiple computers in a cluster. In addition, GraphX partitions vertices independently of edges, which avoids the load imbalance often suffered when putting all the edges for a vertex onto a single machine.

With Spark 2.0 and later versions, big improvements were implemented to make Spark easier to program and execute faster: Spark SQL and the Dataset/DataFrame APIs provide ease of use, space efficiency, and performance gains with Spark SQL's optimized execution engine. GraphFrames extends Spark GraphX to provide the DataFrame API, making the analysis easier to use, more efficient, and simplifying data pipelines. This blog is an update to a the blog How to Get Started Using Apache Spark GraphX and also an excerpt from the ebook Getting Started with Spark 2.x: From Inception to Production which you can download to learn more about Spark 2.x.

Overview of Some Graph Concepts

A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects, and the edges are the relationships between them.

A regular graph is a graph where each vertex has the same number of edges. An example of a regular graph is Facebook friends. If Ted is a friend of Carol, then Carol is also a friend of Ted.

A directed graph is a graph where the edges have a direction associated with them. An example of a directed graph is a Twitter follower. Carol can follow Oprah without implying that Oprah follows Carol.

GraphFrame Property Graph

Spark GraphFrames support graph computation with a distributed property graph. A property graph is a directed multigraph, which can have multiple edges in parallel. Every edge and vertex has user-defined properties associated with it. The parallel edges allow multiple relationships between the same vertices.

With GraphFrames, vertices and edges are represented as DataFrames, which adds the advantages of querying with Spark SQL and support for DataFrame data sources like Parquet, JSON, CSV, and also MapR-DB with the MapR-DB Spark Connector.

Graph Algorithms verses Graph Queries

Graph analysis comes in two forms: graph algorithms and graph pattern queries, let’s look at an example of each and how GraphFrames integrates the two.

PageRank Graph Algorithm

The breakthrough for the creators of the Google search engine was to create the PageRank graph algorithm, which represents pages as nodes and links as edges and measures the importance of a page by the number and rank of linking pages plus the number and rank of each of the linking pages. The PageRank algorithm is useful for measuring the importance of a vertex in a graph, for example with Twitter, a Tweeter with lots of important followers, or with flights an airport with lots of connections to airports with lots of connections.

Many graph algorithms such as PageRank, shortest path, and connected components repeatedly aggregate properties of neighboring vertices. These Algorithms can be implemented as a sequence of steps where vertices pass message functions to their neighboring vertices and then the aggregate of the message functions is calculated at the destination vertex.

You can visualize the PageRank algorithm as each page sending a message with it’s rank of importance to each page it points to. In the beginning each page has the same rank equal to the ratio of the number of pages.

The messages are aggregated and calculated at each vertex with the sum of all of the incoming messages becoming the new page rank. This is calculated iteratively so that links from more important pages are more important.

Graph Motif Queries

Graph motifs are recurrent patterns in a graph. Graph queries search a graph for all occurrences of a given motif or pattern. As an example, To recommend who to follow on Twitter, you could search for patterns where A follows B and B follows C, but A does not follow C. Here is a GraphFrames Motif Query for this pattern, to find the edges from a to b and b to c for which there is no edge from a to c:

GraphFrames integrate graph algorithms and graph queries, enabling optimizations across graph and Spark SQL queries, without requiring data to be moved to specialized graph databases.

Graph Examples

Examples of connected data that can be represented by graphs include:

Recommendation Engines: Recommendation algorithms can use graphs where the nodes are the users and products, and their respective attributes and the edges are the ratings or purchases of the products by users. Graph algorithms can calculate weights for how similar users rated or purchased similar products.

Fraud: Graphs are useful for fraud detection algorithms in banking, healthcare, and network security. In healthcare, graph algorithms can explore the connections between patients, doctors, and pharmacy prescriptions. In banking, graph algorithms can explore the relationship between credit card applicants and phone numbers and addresses or between credit cards customers and merchant transactions. In network security, graph algorithms can explore data breaches.

These are just some examples of the uses of graphs. Next, we will look at a specific example, using Spark GraphFrames.

Example Flight Scenario

As a starting simple example, we will analyze 3 flights; for each flight, we have the following information:

Originating Airport

Destination Airport

Distance

SFO

ORD

1800 miles

ORD

DFW

800 miles

DFW

SFO

1400 miles

In this scenario, we are going to represent the airports as vertices and flight routes as edges. For our graph, we will have three vertices, each representing an airport. The vertices each have the airport code as the ID, and the city as a property:

Vertex Table for Airports

id

city

SFO

San Francisco

ORD

Chicago

DFW

Texas

The edges have the Source ID, the Destination ID, and the distance as a property.

Edges Table for Routes

Src

Dst

Distance

Delay

SFO

ORD

1800

40

ORD

DFW

800

0

DFW

SFO

1400

10

Launching the Spark Interactive Shell with GraphFrames

Because GraphFrames is a separate package from Spark, start the Spark shell, specifying the GraphFrames package as shown below:

Below, we create a GraphFrame by supplying a vertex DataFrame and an edge DataFrame. It is also possible to create a GraphFrame with just an edge DataFrame; then the vertices will be equal to the unique src and dst ids from the edge DataFrame.

The Spark MapR-DB Connector enables users to perform complex SQL queries and updates on top of MapR-DB using a Spark Dataset, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

With MapR-DB, a table is automatically partitioned into tablets across a cluster by key range, providing for scalable and fast reads and writes by row key. In this use case, the row key, the id, starts with the origin, destination airport codes, so the table is automatically partitioned and sorted by the edge src, dst with Atlanta first.

The Spark MapR-DB Connector leverages the Spark DataSource API. The connector architecture has a connection object in every Spark Executor, allowing for distributed parallel writes, reads, or scans with MapR-DB tablets (partitions).

MapR-DB supports JSON documents as a native data store, making it easy to store, query, and build applications with JSON documents. For the flights MapR-DB table we have the following JSON schema. We will focus on the highlighted id , the src, dst, depdelay, and distance attributes. The MapR-DB table will be automatically partitioned and sorted by the rowkey or id , which consists of the src,dst, date, carrier and flight number

Loading The Flight MapR-DB Table Data into a Spark Dataset

In our use case Edges are the flights between airports. An edge must have src and dst columns and can have multiple relationship columns. In our example, an edge consists of:

id

src

dst

distance

depdelay

carrier

crsdephour

SFO_ORD_2017-01-01_AA

SFO

ORD

1800

40

AA

17

Below, we define the flight schema, corresponding to a row in the MapR-DB JSON Flight Table .

To load data from a MapR-DB JSON table into an Apache Spark Dataset, we invoke the loadFromMapRDB method on a SparkSession object, providing the tableName, schema, and case class. This returns a Dataset of Flight objects:

Create the Property Graph

Again, in this scenario, we are going to represent the airports as vertices and flights as edges. Below, we create a GraphFrame by supplying a vertex DataFrame and an edge DataFrame. The airports and flights Dataframes are available as the graph.edges and graph.vertices. Since GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries.

You can see the physical plan for a DataFrame query by calling the explain method shown below. Here in red we see projection and filter push down, which means that the scanning of the src, dst and depdelay columns and the filter on the depdelay column are pushed down into MapR-DB, which means that the scanning and filtering will take place in MapR-DB before returning the data to Spark. Projection pushdown minimizes data transfer between MapR-DB and the Spark engine by omitting unnecessary fields from table scans. It is especially beneficial when a table contains many columns. Filter pushdown improves performance by reducing the amount of data passed between MapR-DB and the Spark engine when filtering data.

Vertex Degrees

The degree of a vertex is the number of edges that touch the vertex. GraphFrames provides vertex inDegree, outDegree, and degree operations, which determine the number of incoming edges, outgoing edges, and total edges. Using GraphFrames the degree operation we can answer the following question.

PageRank

Another GraphFrames query is PageRank, which is based on the Google PageRank algorithm.

PageRank measures the importance of each vertex in a graph, by determining which verticeshave the most edges with other vertices. In our example, we can use PageRank to determine which airports are the most important, by measuring which airports have the most connections to other airports with lots of connections. We have to specify the probability tolerance, which is the measure of convergence. Note that the results are similar to the degrees operation,but the algorithm is different.

Message Passing via AggregateMessages

Many important graph algorithms are iterative algorithms, since properties of vertices depend on properties of their neighbors, which depend on properties of their neighbors. Pregel is an iterative graph processing model, developed at Google, which uses a sequence of iterations of messages passing between vertices in a graph. GraphFrames provides aggregateMessages, which implements an aggregation message-passing API, based on the Pregel model. GraphFrames aggregateMessages sends messages between vertices and aggregates message values from the neighboring edges and vertices of each vertex.

The code below shows how to use aggregateMessagesto compute the average flight delay by the originating airport. The flight delay for each flight is sent to the src vertex, then the average is calculated at the vertices.

Motif Find for Graph Pattern Queries

Motif finding searches for structural patterns in a graph. In this example we want to find flights with no direct connection. First we create a subgraph from the flightroutecount DataFrame that we created earlier, which gives us a subgraph with all the possible flight routes. Then we do a find on the pattern shown here to search for flights from a to b and b to c, that do not have a flight from a to c. Finally we use a DataFrame filter to remove duplicates. This shows how Graph queries can be easily combined with DataFrame operations like filter.

Shortest Path Graph Algorithm

Shortest paths computes the shortest paths from each vertex to the given sequence of landmark vertices, Here we search for the shortest path from each airport to LGA , the results show that there are no direct flights from LAX, SFO, SEA , and EWR to LGA (the distances greater than 1).

Breadth First Search Graph Algorithm

Breadth-first search (BFS) finds the shortest path from beginning vertices to end vertices. The beginning and end vertices are specified as DataFrame expressions, maxPathLength specifies the limit on the length of paths. Here we see that there are no Direct flights between LAX and LGA

You can combine motif searching with DataFrames operations. Here we want to find connecting flights between LAX and LGA using a Motif find query. We use a Motif query to search for the pattern of a flying to b , connecting through c , then we use a DataFrame filter on the the results for A=LAX and C=LGA . The results show some flights connecting through IAH for flights from LAX to LGA .

Combining a Motif find with DataFrame operations, we can narrow these results down further, for example flights with the arrival flight time before the departure flight time, and/or with a specific carrier.

Summary

GraphFrames provides a scalable and easy way to query and process large graph datasets, which can be used to solve many types of analysis problems. In this chapter, we gave an overview of the GraphFrames graph processing APIs. We encourage you to try out GraphFrames in more depth on some of your own projects.

All of the components of the use case we just discussed can run on the same cluster with the MapR Data Platform.