Neo4j and Apache Spark

Goals There are various ways to beneficially use Neo4j with Apache Spark, here we will list some approaches and point to solutions that enable you to leverage your Spark infrastructure with Neo4j. Prerequisites You should have a sound understanding of… Learn More →

Developer

Neo4j Integrations

Goals

There are various ways to beneficially use Neo4j with Apache Spark, here we will list some approaches and point to solutions that enable you to leverage your Spark infrastructure with Neo4j.

Prerequisites

You should have a sound understanding of both Apache Spark and Neo4j, each data model, data processing paradigm and APIs to leverage them effectively together.

Intermediate

Overview

General Observations

Apache Spark is a clustered, in-memory data processing solution that scales processing of large datasets easily across many machines. It also comes with GraphX and GraphFrames two frameworks for running graph compute operations on your data.

You can integrate with Spark in a variety of ways.
Either to pre-process (aggregate, filter, convert) your raw data to be imported into Neo4j.

Spark can also serve as external Graph Compute solution, where you

export data of selected subgraphs from Neo4j to Spark,

compute the analytic aspects, and

write the results back to Neo4j

to be used in your Neo4j operations and Cypher queries.

Neo4j itself is capable of running graph processing on medium to large graphs quickly.
For instance the graph-processing project demonstrates that we can run PageRank (5 iterations) on the dbpedia dataset (10M nodes, 125M relationships) in 20 seconds as a Neo4j server extension or user defined procedure.
Spark might be better suited for larger datasets or more intensive compute operations.

Similar operations are available for DataFrames and GraphX.
The GraphX integration also allows to write data back to Neo4j with a save operation.

To use GraphFrames you have to declare it as package.
Then you can load a GraphFrame with graph data from Neo4j and run graph algorithms or pattern matchin on it (the latter will be slower than in Neo4j).

Spark for Data Preprocessing

One example of pre-processing raw data (Chicago Crime dataset) into a format that’s well suited for import into Neo4j, was demonstrated by Mark Needham.
He combined a number of functions into a Spark-job that takes the existing data, cleans and aggregates it and outputs fragments which are recombined later to larger files.

This website uses 'cookies' to give you the best, most relevant experience. Using this website means you’re OK with this. You can change which cookies are set at any time - by clicking on more info. Accept