Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Distributed processing of large graphs in python

Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.

Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.

A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.

25.
Sampling in Networks
Note that sampling in Networks is fraught with difficulties. One cannot simply
sample the edges and nodes and expect that the sample be representative of the
original network. In the graph below, a sample that missed node 1 or 2 would
disconnect the two clusters, and would not have the same properties as the
original
Node 11
Node 2

45.
A partitioned, distributed graph processing engine
is significantly more complex and difficult to build

46.
GraphX and graphframes (new in spark
2.0)
• GraphX is to RDD as graphframe is to dataframe
• GraphX is lower level, and the API is scala-only. Graphframe is
very new:
• It’s not designed to be a graph database, as neo4J. Nodes and
edges can contain metadata, but the query engine is not as
complete as cypher

47.
Advantages of graphframes
• Graphframes have a python API
• Graphframes give you simple querying for free. GraphFrame
vertices and edges are stored as DataFrames, many queries are
just DataFrame (or SQL) queries
• They contain most of the algorithms in graphX, but the API is
less well-tested
• Pyspark shell instead of spark-shell

50.
Summary of implementation, benefits
• Graph theory is a really flexible way to represent a problem
• Data structures to represent graphs are mature
• You can do now out-of-core, distributed graph analysis for
cheap
• Implementations are there for even state-of-the-art methods

51.
Summary, finding a problem
• We live in an age of abundance (methods, data, hardware, ideas)
• Finding the question is more than half of the battle
• I had about a week to prepare this talk, but I managed to put
together something that showcases what you can do with large
graphs today, and it could be effective as a startup idea
• My question is not great because you cannot demonstrate that it
works till you use it (common problem for unsupervised methods)

52.
The question: When should I tweet
to influence the right account?
Or ‘beat Buffer at their own game’

53.
References: Drawing graphs
• Graphs in this slide set have been drawn with Gephi
• If you use Zeppelin notebook, you can draw graphs with:
drawGraph(org.apache.spark.graphx.util.
GraphGenerators.rmatGraph(sc,32,60))

54.
25 videos explaining ML on spark, 50 more
to come. A bunch on graphX
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-
scala-and-spark

56.
About learning new tech over seven
weekends
• You have time and enjoy using it to learn alone: learn it ‘the
hard way’
• You are extremely motivated and talented, have money: Apply
for DSR
• You want your weekends for yourself. You are already very
good but want to switch jobs. Apply for codekitt