Friday, 12 September 2014

Data Science at scale: the 1-hop neighbours problem

This post is
technical. An understanding of how systems work and algorithms designed is recommended.

As a companion
post to “Calling out the Big Data Scientist”, here is an illustration of the
kind of decisions a data scientist would need to make when carrying out graph
analytics with large data sets.

Consider building a recommendation engine for people/friends on a social network. There are many
attributes that make up people recommendations,
including common subjects of interest, places visited, educational universities
studied together, workplaces shared etc. One of the most important features for
recommendations is the number of friends that the user and the potential
candidate to be recommended have in common. Let’s refer to the user as U and
the recommended person as X. The number of friends that U and X have in common
is in turn the number of intermediaries that U is connected to and that in turn
connects to X. In graph jargon, it is the number of distinct paths that connect
U and X in a single hop. That’s why this problem is also referred to as the
"1-hop neighbours problem.”

For ease of
understanding and without loss of generality, assume that a typical user U has
10 connections, and those friends are typical users themselves in that they
also each have 10 connections. Note that there is a good chance that a typical
user could also be connected to a well-connected user (where well-connected
refers to having a large number of connections, say 1000 or more), for instance
a celebrity or a headhunter or a popular subject matter expert.

To solve the
"1-hop neighbours” problem, a programmer’s approach could be to store the
data of edges (i.e. who is connected to whom) in a file and then write some
code. Clearly it is the code that is the complex part of this approach. We will
not dig deep into this line of approach since it is brute-force and there are
more readymade techniques available that will solve the problem for lesser
effort and time.

A
RDBMS-oriented analyst could model the edges data as a relation with each tuple
having 2 columns - “src” and “dest” - possibly representing people as
vertices with an identifier. There are other representations for graph data
that are possible but an edge-based representation turns out to work well in
practice for large, real-world graphs.

In this RDBMS
setup, the SQL query to calculate the number of common friends between every
pair of users is as follows:

SELECT L1.src,
L2.dest, count(1) as common_friends_count

FROM
edges L1 INNER JOIN edges L2 ON (L1.dest=L2.src)

WHERE L1.src
<> L2.dest

GROUP BY
L1.src, L2.dest

In brief, the
query joins the edges table (referred as L1) with itself (referred to as L2, to
disambiguate) so that for every user (i.e L1’s “src” column) we get his friend
(i.e L1’s “dest” column for the corresponding tuple) from the edges
table, and then use the edges L2 table to get that friend’s friends (i.e
the L2’s “dest” column for all tuples as determined by L1's "dest"
column). Then, by counting the number of tuples that result from the join, we
get the answer for every pair of users. We also need to avoid loops, the WHERE
clause helps to filter that out.

The SQL query
above performs well for a small graph. Unfortunately, it fails quite
spectacularly on graphs of large real-world networks. This is because of two
reasons:

real-world graphs are sparse (intuition: the number of people each of us
typically knows is far far lesser than the total number of people in the
network worldwide);

real-world graphs have network effects (meaning, the user U
who has 10 friends can, by tapping into his friends’ networks, reach 100 users
at least who in turn can tap into their friends’ networks to allow U to reach
1000 users at least, and so on; since there are many well-connected individuals
in real-world graphs, within a small number of hops, a typical user can reach
most of the network - this is the so-called “Six Degrees of Separation”).

To appreciate
the argument, let us consider first how we would execute this query on a large
graph in a database. To start with, single machine databases are not sufficient
to carry out the analysis on a large graph. Even if the graph can be stored on
disk, it is prohibitively expensive to have enough resources on a single
machine to be able to compute the answer. So, a parallel database, preferably
with a shared-nothing, massively parallel processing architecture, needs to be
considered. In a typical parallel database, data is usually laid out
distributed by the hash of a particular column or set of columns - for the
edges data of 2-tuples, there are only 2 choices "src" or “dest”
either of which is reasonable. Let us choose “src” as the distribution column.
This implies that all edges from the same “src” are available together in a
single machine. The various "src" values are spread out roughly
uniformly across the entire cluster.

The
distribution of the graph on a column across the parallel database is itself a
problem. This is due to the presence of well-connected users. The specific
machines that store the edge data for such well-connected users are assigned a
much larger chunk of data than other machines resulting in skew. For a parallel
database, skew is a real performance killer. Let us set this revisit this graph
distribution issue later.

Now, consider the
processing side of the argument. Let us get into the operations of a parallel
database involved in the execution of this query:

In order to execute the join of the edges
table with itself, the join needs to be executed locally on every machine. This
requires having the data necessary for the join to be present locally. This in
turn requires creating a copy of edges that is distributed on the “dest”
column. The dominant cost in this operation is the cost of transferring the
entire edges table once over the network.

Each machine in the cluster then executes
the join locally. The join produces significantly more number of tuples than
the input tables due to network effects of expanding from the circle of friends
to the circle of friends of friends. The dominant cost here is the join itself,
which is be executed using any of a number of well-known relational algebra
join techniques like hash joins, sort joins, hybrid hash joins etc.

Since the aggregation in the query happens
on columns that are different from those used in the join condition, the output
of the join will need to be redistributed. The dominant cost here is the cost
of transferring the output table from the join over the network.

The aggregation is carried out on the
redistributed join output. The dominant cost depends on the algorithm used to
perform the aggregation, which could be any of a number of well-known
techniques like hash aggregations, sort aggregations, hybrid hash aggregations
etc.

The join
operation in step 2 is more expensive than it could be considering the sparsity
of the graph. The network effects present in real-world graphs produce a large
amount of output, which adds to the latency involved in step 2 and also has a
more significant impact in the aggregation operation. The aggregation operation
again is more expensive than it needs to be due to the same reasons of
sparsity. As an illustration of sparsity, even though network effects mean a
user U could now reach 100 friends in 1-hop starting from his 10 friends,
notice that computing the number of friends user U has in common with any of
those 100 users is not dependent on the rest of the network.

By adopting a
scalable approach and a platform that is optimised for processing large graphs,
the "1-hop neighbours” problem becomes easier to solve and get the
performance right. The popularity of bulk synchronous parallel (BSP) models, or
vertex-centric processing, is because of the benefits it provides for
large-scale processing of real-world graphs. Notable among the implementations is Google’s Pregel and its many clones including ApacheGiraph,
Teradata Aster SQL-GR, and GraphLab. In
this model, each vertex becomes an independent thread of computation where
messages that are sent by its friends are processed followed by the sending of
messages, if any, to its friends. The BSP model guarantees that a vertex
receives only the messages that is sent to it and a message is sent only to the
vertex it is meant for.

For the task of
solving the “1-hop neighbours" problem, an implementation using the BSP
model has the following 3 iterations after which the model stops execution
since there are no more messages:

Each vertex sends to its neighbours a
message about that edge of the form “src”-”dest”. The dominant cost in this
operation is the cost of transferring the entire edges table once over the
network.

In the next iteration, a vertex broadcasts
each message that it receives to every one of its neighbours. The dominant cost
in this operation is the cost of transferring the output messages which
are the tuples corresponding to the friends of friends.

In the last iteration, each vertex
aggregates over its messages the number of occurrences of each “src”, sets
itself as the "dest" and constructs the output as the
"src"-"dest"-count.

For a large
graph, iterations 1 and 2 have nearly the same cost as steps 1 and 3 in the
parallel database approach. Note that there is no join operation in the BSP
model as vertex-to-vertex message routing is provided by the platform.
Iteration 3 in the BSP approach carries out the aggregation operation but needs
to only consider those rows that influence the output for that pair of
users.

Comparing at a
deeper level the aggregation steps in the 2 approaches, Step 4 in the parallel
database approach is aggregating over all rows. Each machine in the cluster
executes the aggregation locally and needs to consider a memory structure
(typically, a hash table) suitable to hold all pairs of vertices that occur in
that machine. When the number of distinct pairs (and their counts) is likely to
exceed available memory, the database engine decides to adopt a disk-based
approach to carry out the aggregation. When that happens (which it does for
large graphs), the same input data needs to be scanned multiple times. This
translates to significant overheads compared to the aggregation in the BSP
model.

In the BSP
model, since each vertex is a thread of computation, iteration 3 carries out
aggregation easily in memory scanning only those friends of friends tuples that
vertex receives. The typical vertex uses up very little memory in carrying out
a hash aggregation, exploiting the sparsity observation. For the extremely connected
users, it is possible that the number of distinct pairs is far larger and so a
disk-based aggregation would be required. Again, since each vertex’s
aggregation operation operates on only its friends of friends, the disk-based
aggregation would need to store, retrieve and scan multiple times only the
tuples of that vertex. Even with putting together aggregations that happen
across all vertices, the resources used in the BSP model are lesser than in the
parallel database world.

In summary, the
BSP model is equivalent to the parallel database approach in terms of network
cost of data transfer, avoids the join processing cost almost completely, and
minimises the cost of the aggregation step exploiting graph sparsity. For a
large graph, even though network costs are significant, the data processing costs
are significant too. Coupled with potential skew issues related to graph
distribution over the database, the parallel database approach is less optimal
than the BSP model for the "1-hop neighbours" problem.

A quick note on
graph distribution and the potential problems of skew: the BSP model allows for
much more flexibility in dealing with the problem of distributing real-world
graphs. It is not to say that the skew problem disappears altogether but only
that the flexibility allows modeling the distribution in such a way so as to
minimise the impact of skew.

A recent blogpost over at Intel's Science and Technology Center (ISTC) for Big Data presents results comparing performance of a special class of parallel databases
(namely, a column-store) with that of a BSP model on two graph-processing
problems of computing page rank and computing shortest paths. The post goes on
to claim that the "1-hop neighbours" problem too is best solved in a
database using SQL to express the solution. As we saw earlier in this post,
though the claim of the SQL being easy to write is indeed true, it is far from
obvious that databases would perform as well as a BSP model. Notwithstanding the
fact that the performance in practice is not dependent on theory alone but also on the implementation of the systems (for examples
of BSP model implementations that are faster than Apache Giraph, see GoFFish and Giraphx),
for the “1-hop neighbours” problem on massive graphs, our past experience has
in fact been to the contrary to what the post over at ISTC claims - namely, a database by itself is not the right
vehicle for the "1-hop neighbours" problem as compared to the BSP model.

Using the right
tool for the right job can go a long way in solving a problem, and that is very
much true for data science as well. The relational model along with SQL has
many advantages owing to the maturity of the industry. By exploiting specific properties of the problem
(including number of vertices, number of edges, connectivity etc.), there are
other approaches on parallel databases a data scientist could design for
solving the problem of “1-hop neighbours” that could be faster than - for instance,
using custom user defined functions and keeping a copy of the edges table on every
machine in the cluster provided the graph is not too large. A big data
scientist learns this by studying the properties of the problem and then
prototyping and experimenting these variants.