Neo4j and Cassandra

Goals The goal of this guide is to export data from Cassandra, convert to a property graph model and insert into Neo4j. To accomplish this we will use the Neo4j Cassandra data import tool, a prototype command line application that… Learn More →

Developer

Goals

The goal of this guide is to export data from Cassandra, convert to a property graph model and insert into Neo4j.
To accomplish this we will use the Neo4j Cassandra data import tool, a prototype command line application that enables translation from a column-oriented data model to a property graph data model.

The Neo4j Cassandra data import tool is in its infancy and currently has many limitations.
It is currently a simple prototype meant to support a limited data model. We’d appreciate any feedback you might have, please raise an issue on the GitHub project.

Prerequisites

You should have an understanding of Cassandra, Neo4j and be familiar with column-oriented and property graph data models.
You will need to have Cassandra, Neo4j and Python 3.x installed.

Intermediate

Figure 1. Translating from a column-oriented data model to a property graph.

Neo4j Cassandra Data Import Tool

Overview

The main goal of the Cassandra Neo4j data import tool is to provide a direct way to map a Cassandra schema to Neo4j and import result sets that come from Cassandra columns to Neo4j property graph model, generating a meaningful representation of nodes and relationships. This translation is specified by populating a YAML file based on the Cassandra schema to specify how the data should be mapped from a column-oriented data model to a property graph. The tool exports data from Cassandra using the Cassandra Python driver into CSV format as an intermediate step. LOAD CSV cypher statements are then generated based on the data model mapping specified for loading the data into Neo4j. The following sections will guide you through this process and also provide some mapping examples.

Populating an initial Cassandra Database

We will use a sample database of musicians and songs:

A sample database is included that works with this example. Simply go to db_gen directory, start Cassandra shell cqlsh and invoke the command SOURCE '/playlist.cql'. You can also provide the absolute path of the file. This will populate your Cassandra database with a sample Tracks and Artists database.

Inspect the Cassandra schema

After populating your initial database, you must generate a file to properly map a Cassandra Schema to a graph. Do the following:

Into the project directory, navigate to the subfolder connector/

Run the script connector.py. Invoke it with python connector.py parse -k playlist.

Some output files will be generated. At this stage, take a look into the generated schema.yaml file. It contains a YAML representation of the Cassandra schema with placeholders for specifying how to convert this Cassandra schema into a Neo4j property graph data model.

The next step consists of populating the placeholders in this file with mapping information. Check out the next section for more information.

Configure data model mappings

In order to import data into Neo4j the mapping from Cassandra schema to Neo4j property graph must be specified. This is done by populating the placeholders in the generated schema.yaml file.

An example of filled YAML file can be found on connector/schema.yaml.example.

Important points to consider when mapping:

For this first version, we do not have a strong error handling. So please be aware of the following aspects:

If you populate a field as a relationship between two nodes, please map the field with r in both table. In the example above, note that artist is mapped as r in both tables, playlist.track_by_artist and playlist.track_by_id. In this initial version keys must have the same name to indicate a relationship.

Regarding unique constraints: be sure that you will not have more than one node with the property that you selected for creating this constraint. u is going to work only for lines that have been marked with PRIMARY KEY. For example: PRIMARY KEY (first_letter {p}, artist {u}) This example denotes that artist is selected to be a constraint. We cannot have more than one node with the same artist.

To avoid performance issues, try to promote fields to constraints if you notice that it would reduce the number of reduced nodes (of course considering the meaningfulness of the modelling).

Import to Neo4j

After populating the empty brackets, save the file and run the script connector.py, now specifying the tables you wish to export from Cassandra:

Mapping data into Cassandra to Neo4j

The YAML file will be parsed into Cypher queries. A file called cypher_ will be generated in your directory. It contains the Cypher queries that will generate Nodes and Relationship into a graph structure. After generated, the queries are automatically executed by Py2Neo using the Neo4j connection parameters specified in schema.yaml.

Using the sample Artists and Tracks dataset, we have Track nodes and Artist nodes, connected by artist fields. We also wanted to make a constraint on artist by its name – we could not have two different nodes with similar artist names.

Figure 3. Property graph data from sample playlist database

The Neo4j Cassandra data import tool is in its infancy and currently has many limitations. It is currently a simple prototype meant to support a limited data model. We’d appreciate any feedback you might have, please raise an issue on the GitHub project.

This website uses 'cookies' to give you the best, most relevant experience. Using this website means you’re OK with this. You can change which cookies are set at any time - by clicking on more info. Accept