News and stories from data mining research in Aalto

Scalable facility location for massive graphs on pregel-like systems

Our paper on designing a distributed algorithm for solving the facility-location problem was accepted at the CIKM 2015 conference, and will be presented in Melbourne next week.

What is the facility-location problem? Facility location is a classic problem, first studied in the field of operations research. In the problem setting, we are given a set of ‘facilities’ and a set of ‘locations’ and the goal is to find a mapping of the locations to the facilities such that a certain objective function is minimized. The objective function models the operating cost of serving the locations with a set of selected facilities, and it includes two terms: a cost term for opening a new facility, and a cost term for serving a location with an open facility.

A classic example is the placement of warehouses in a city. Consider a supermarket chain, with stores at various places in the city. Lets say we want to place warehouses (facilities) in order to serve the various supermarkets (locations) such that the overall operational cost is minimized. The operational cost is the cost of opening and maintaining new warehouses plus the transportation cost from warehouses to supermarkets.

Limitations of existing methods. The facility-location problem also has numerous applications in other fields such as information retrieval, social-network analysis. and network planning. Though this problem (and numerous variations) have been studied extensively, most existing algorithms have two drawbacks: (i) they assume that the input is a full distance matrix between facilities and locations, which requires n^2 space and time to be materialized, (ii) they are sequential and assume that the input data fit in memory. These assumptions are a bottleneck when dealing with large, real-life graphs.

Our solution. To address these issues, we designed a fully-distributed algorithm that allows us to solve the facility-location problem for graphs with billions of edges. In our paper, we make the following major contributions:

We provide the first ever distributed solution to the facility-location problem, thus making it applicable to process massive web-scale graphs.

We do not compromise on the approximation guarantees provided by the state of the art sequential & PRAM algorithms.

Our solution uses a sparse graph-based representation, which does not assume the availability of distances between all pairs of vertices. This is critical for our algorithm to be applicable to real-world graphs.

We also provide the first open-source Giraph implementations of All Distances Sketching (useful for neighborhood estimation) and Maximal Independent Set computation.

Our algorithm to solve the facility-location problem is divided into three phases. I will describe the three stages briefly now. For more details, please refer to our paper.

All Distances Sketches (pre-processing)

All Distances Sketches (ADS) help in approximating neighborhood estimation for weighted and unweighted graphs.

We compute the ADS for an input graph as a pre-processing step and use it later in our algorithm. Since we represent the input as a graph, ADS helps in answering the question “How many vertices are at a distance d from vertex v?”

Facility opening

Each vertex tries to expand a (virtual) ball around it, in parallel.

When the ball corresponding to a vertex is big enough to reach a sufficient number of other vertices, we declare it as an ‘open facility’ and the vertices it reaches as ‘frozen locations’.

This stage continues until all vertices are either open facilities or frozen locations.

Facility selection (Maximal Independent Set (MIS) computation)

Because of the inherent parallel nature of our algorithm, each location could be assigned to multiple facilities (which is not allowed in our setting).

To solve this, we construct a special graph on the set of open facilities and run a Maximal Independent set algorithm on that graph.

The resulting MIS gives us the output to the Facility Location problem.

We evaluated our implementation on a large Hadoop cluster (thanks to Yahoo!) of 500 machines using Apache Giraph. Our experiments on large real and artificial graphs show the scalability of our approach to graphs with 10’s of millions of vertices and billions of edges.

We had to overcome many practical challenges while designing and implementing the algorithm on Giraph. Since we are dealing with large-scale graphs, on a distributed setting, even seemingly simple stuff such as function calls can be complicated, because of issues with coordination between machines. We needed to reduce the communication overhead and we tried to design our algorithms such that the number of messages exchanged between machines is minimized. No assumptions could be made that the graph fits in the memory, so we had to design algorithms on graphs without actually constructing them.