Sergey Edunov, Software Engineer at Facebook gave a great talk on how and why his company generating large-scale social graphs. The underlying reasons to start such an ambitious project are capacity planning to make sure that their system will be able to handle a graph that keeps growing year after year and fair evaluation of their system against the ones being implemented by other companies.

He explained that Facebook has been testing other benchmarks such as graph500 but it doesn’t work for them since it doesn’t represent social graphs (it’s based on Kronecker graphs) and the algorithms used are not relevant for the company either. Edunov commented that they’re mainly interested in Friend of Friend counts, PageRank, Community Detection or Local Clustering Coefficient amongst others. To test some of the benchmarks out there the only thing they did was try different applications that they use on the real graph expecting similar behaviours but they found out that, for the most complex algorithms (balanced partitioning for example), Kronecker-based benchmarks were completely inefficient. After this test, they looked at the different benchmarks available and generated a set of requirements to test them out:

Match the graph size. No scalability, no fit for them.

Match degree distribution.

Match joint degree and clustering coefficient.

Match high-level application metrics.

None of the existing benchmarks matched their criteria so they decided to build Darwini, their own Apache Giraph that scales to hundreds of machines. Darwini is also capable of generating graphs with trillion edges, joint degree-clustering coefficient distribution and local degree-clustering coefficient and shows better accuracy in performance than the original graph.

Darwini’s algorithm works following 4 steps:

Generate vertices with your desired local clustering and regional coefficient.

Group vertices that expect the same amount of triangles together.

Generate random edges within each of the previously mentioned groups.

Create random edges between each set of groups.

Testing shows that Darwini performs similarly as the BTER graph but they’re trying to improve the results they currently have.

For the slides Sergey presented and his whole presentation see below. Remember that the 9th TUC Meeting will be held at at SAP's HQ in Walldorf, Germany the 9-10th of February. Start planning your assistance!