Pages

Search

Graphs can be used as proper data structure in many applications from marketing data to social networks. It is amazing how representing data in terms of vertex and edges make the problem easier to solve and scalable. That is why I am supper excited about the Spark Graphframes. It is a graph processing package that sits on top of Spark Dataframe and can be considered as an extension to GraphX (which is based on RDDs). The cool thing about Graphframes is that you can run both graph algorithms and graph queries (pattern matching) within the same framework.

The purpose of this post is to show how how one can use power of pattern matching in Graphframe for recommendation in social networks. For that we use a toy dataset that is inspired by the data at Table.co.

What is our network?

Consider a network with two type of objects: 1- users 2- tables.

Users can follow each other and it is one way connection.

Tables are similar to private chatrooms. They contain both private and public messages. A user can only follow a table which gives him/her access to public messages OR a user can be a member of table which means access to all messages and also privilege to send message to other members and follower of the table.

Let assume we have the following simple network of users and tables:

In this toy graph, we have 6 users and three tables. A lot can be read from this network. For example, Med follows Joe and he is a member of DesignMe table. Bob is a follower of Jerald and he is also followed by Jerald. He follows DesignMe table.

Put it in Graphframes

First lets define this graph in Graphframes. For that we need to first define two dataframes: 1- vertex dataframe 2- edge dataframe.

Vertex dataframe:

Each node in the graph would have an unique id. All nodes have a type field to identify the type of node; Is it a user node or table node. Furthermore, all nodes have a name which is their human readable identity. Finally, there is a extra filed for each node object which has a json value and it must be interpreted based on the type of the node. The vertex dataframe for graph in Figure 1 can be found as follows:

Edge dataframe:

Nodes in graph can be connected to each other. Based on type of the object, the connection can be different with different information and property. The common denominator for connection properties in all of the connections are 1- source id 2- destination id 3- type of the connection.

As it is mentioned before and it depicted in Figure 1, we have three types of connection in our network.

1- person follows a person (connection): If one person is following another person

2- person follows a table (follow): a person only have access to public message of table

3- person is member of table (member): person can post a message to the table and see all other messages in the table.

Then, the edge information can be described in a dataframe as follows:

In this query, we are asking to find two nodes a and b that are connected to 4 other generic nodes c1, c2, c3, and c4. The last statement, !(a)-[]->(b), requires that a and b are not connected.

The last two pieces of requirement is a little bit complex because they require a statefull query. It means we need to look at the type of intermediate nodes and the edges and keep the count of number of member type connection to tables.

Since Graphframes is based on Dataframe, for statefull queries, we need to define a filter that uses sequence operations.