I finally found some time to test the Neo4j connector to Kafka. Specifically I am showing here how to use the consumer in Neo4j to consume data from Kafka. I have not found a sample on the web, so I thought I show one here.

MySQL: I have data in a MySQL database containing information about airlines and airports and which airline flies from which origin airport to which destination airport

Nifi: I use Apache Nifi to listen for changes to the MySQL database tables and to send the data to Kafka

Neo4j: is configured to consume data from the three topics created using Nifi

MySQL:Here are the tables:

MySQL Tables

and here the first 10 rows from each table:

MySQL Data

The id column is an autoincrement value and the last_update is the inserted or last updated timestamp for the records in each table. The airlines_airports table has the information of which airline flies from where to where (to which airport).

Nifi:Updates to the MySQL tables will result in an update of the last_update column of the relevant record. Nifi will pickup the change records and send them to Kafka in JSON format.

My Dataflow looks like this:

Three QueryDatabaseTableRecord processors are used to watch for changes to the three MySQL tables. The UpdateAttribute processors are used to simply define the name of the Kafka topic. Finally the records are sent to Kafka using the PublishKafkaRecord processor.

Neo4j:I have adjusted the Neo4j configuration as documented (see link at the beginning). First, I have added the Kafka config at the end of the neo4j.conf file:

File: neo4j.conf

This is configuration for zookeeper, the Kafka brokers, the consumer group id and some others. After this I added three cypher statements to process data from the three Kafka topics:

Cypher: Airports

Cypher: Airlines

Cypher: Airlines-Airports

The first two cypher statements do a merge on the Airport or Airline based on the ID of the records. If the relevant ids exist they are updated, otherwise created.

The last cypher statement creates the relationship between the airports and the airlines: which airlines flies from which airport (origin) to which other airport (destination).

So this is my data pipeline: MySQL has the data and any updates are made here. The changes are picked up by Nifi, which send it to the relevant Kafka topic. And because I configured the three cypher statements in the Neo4j config, Neo4j consumes any messages that arrive in the three Kafka topics. And if there are any changes in the MySQL data, then they will automatically arrive in Neo4j.

Once the data is available or updated in Neo4j, I can run e.g. a query to see where Swiss (airline code=LX) is flying to from Zurich (airport code=ZRH).

Cypher: Find destination of Swiss departing from Zurich

The result would then look like this:

As you can see, configuring Neo4j to use Kafka as a streaming source is straightforward. The developers of the connector have made a good choice to use cypher as the connecting part between Kafka topics and Neo4j. This way, you have the greatest flexibility to handle the data from Kafka using the power of cypher.

Besides kafa and Neo4j, Apache Nifi is used for the dataflow management. It is a very good tool for dataflows: flexible, scalable, has many connectors and is the tool when it comes to schemas (inherit, infere), data provenance and then routing the data to various target systems.

A short follow up on the last post. Sometimes a picture explains more than 1000 words, so I have visualized the advantages of metadata injection.

In the screenshot below one can see that with different input files - because they are different e.g. in the fields and data types, the separator used or maybe the encoding - the ETL logic is duplicated. Allthough the same basic logic applies, we need to create multiple transformations because of the file differences.

The result is, that if something changes or the transformation is extended, it has to be done in multiple places. Or if you get additional input files with yet another structure, then you have even more duplication. All this has a bad influence on agility and also quality.

Of course, you could make an effort to generalize some of the logic that all have in common. In this case you would not have multiple (or as many) duplicates. But it makes the overall ETL more complicated because you have things that are different and things that are common. With a growing number of input files (differences) this also gets quickly complicated or even unmanagable.

If you use metadata injection, then the work to analyze the differences of the input files still has to be done. But the positive aspect is, that you can reduce the number of transformations. Instead you define metadata - e.g. in simple CSV files. This approach is much cleaner and simpler. Simplicity is always good when it comes to maintenance, when you share your development work with others, but also for agility and also the overall quality will benefit.

In this case, when you get additional input files yet in different formates and with different data types, you won't have to touch your ETL logic. You simply define the metadata according to the input file(s) and you are done. So the more different files you have the more you will benefit from this solution.

In Pentaho PDI (ETL) metadata injection is available for many of the steps (plugins). Once you have understood the concept and have done it 2 or 3 times, it will be an easy task to use it instead of hardcoding file structure and duplicating logic. Of course, only if metadata injection makes sense in your use case.

Welcome to part two. I have been experimenting quite a lot in the last time combining Pentaho PDI (Kettle) - the Pentaho ETL tool - with Neo4j. One thing obvious is, that when you have more than a couple of nodes and relationships to create and you do it from e.g. CSV files, then you quickly start to duplicate a lot of things - e.g. reading the file and outputting it to Neo4j.

Metadata Injection:But there is a way around this duplication: PDI - as the great ETL tool it is - supports metadata injection. So instead of hardcoding field names, data types and various other things, these can be injected to a step at runtime. So when you have different CSV files for different nodes - because they have different attributes and data types - then you can define this metadata (e.g. in a file). When a node shall be created, the relevant metadata is then used at runtime to fill the PDI steps and with the next node, exactly the same is done. This avoids the duplication I talked about previously.

Actually many steps in PDI support metadata injection. Not only for fields and data types, but also for filenames, formatting (separators, enclsure), encoding (UTF-8) and much more.

The plan:What I plan to do is to use metadata inection in PDI to load data about Neo4j nodes from CSV files into Kafka. And then I will have another process that will consume data from a different Kafka topics (one for each Neo4j node) and create or merge or update the nodes and also the relationships. This I will show in the part 3 of this blog series.

The messages all have an "event_type" attribute. This indicates which type of event (transaction) is applicable: insert, create, merge or delete.

I put it into Kafka for several reasons:

I can wipe my graph, reset my Kafka consumer to the beginning of the topic and "replay" all events that happened, which in turn recreates my graph from scratch.

I can feed Kafka e.g. from a file or do it manually; for the consumer side of the messages nothing will change if I change this end. I could also e.g. hook up a database (with CDC - Change Data Capture) and get the data from there.

Kafka gives me the realtime processing capabilities. Data that arrives in Kafka as events can immediately be consumed and trigger an update of my graph.

Use Case:The graph I will create, will store information about source systems, target systems, connectors, servers, clusters, people and much more. So it shows how these individual object are connected to each other and then in turn allow to make queries on it.

So the first step is to read the CSV files for the different nodes and output the data to Kafka - one topic per node type. At the beginning I have defined which nodes I need and their relationships. Then I have created several CSV files which contain a header defining the attributes of the node and then some data.

Here is an example:

The "type" field will be used as the label of the nodes in Neo4j. And the "event_type" is the type of transaction that has to be done. Later the "source_system_id" will be used to create a relation from this node (ITApplicationOwner) to the SourceSystem node.

Then I have a PDI transformation which contains the metadata injection step and several other parts that deliver the metadata to this step. Inside the metadata injection step the mapping is done of metadata to the relevant fields in the steps of another (child) transformation.

This is what the transformation looks like. In the subtransformation there are steps to:

read the relevant CSV file

concat two fields to construct the key of the Kafka message

format the data to JSON format

dependent on the "environment" setting (DEV or PROD) send the result either to the log or to Kafka.

So the steps that run the the metadata injection step in the middle, provide information about which fields are used in the CSV file, which delimiter is used, which encoding and more. For the JSON output step also the fields and data types are required and e.g for the steps that concats the two fields I need to provide the name of the resulting field of the concatenation.

Now the subtransformation:

From the metadata injection step, the "CSV file input" step above gets the definition of the fields in the CSV file and also the data types. Next the two fields are concatenated and then the "environment" setting is evaluated and the flow continues to write to the log or send the message to Kafka.

Let's have a look athe "CSV file input" step:

You can see that - at the bottom - there are no fields and data types defined. Usually - in a transformation without metadata injection - one would have to define the fields that the CSV file is made up of. But here it is empty. These settings are injected at runtime into this step. Also the "delimiter, "enclosure" and "File encoding".

For the "ITApplicationOwner" node and CSV file shown futher above, I have defined this metadata file:

The first line is the header row. The following lines each define a field and the data type and are injected to the step (for the "Name" and "Type" field) shown in the previous screenprint.

And this is what metainjection does: In this case, I only have to define the metadata of each CSV file I want to use. At runtime this structure is injected so that I can have lots of differently formatted files and still just need one flow and logic to process them.

As you can see above, there is also a configuration in the "CSV input file" step for the "Filename" - the file - that shall be processed. This does not come from the metadata. Instead I have defined parameters for that. When I run the whole transformation, then before it kicks off, I specify the file (filename) I want to process and this information will be inserted appropriately. This way - at runtime - I can dynamically load different files. And of course this could be scripted and scheduled to process many files.

Here is the log info for the processed "ITApplicationOwner" CSV file that will later become a node in Neo4j. Below you can see the key and message (in JSON) that would be sent to Kafka:

Snippet from logfile

And when I run this exact transformation again, but using a different file (here: "SourceSystem") with different metadata, then this comes out in the log:

Snippet 2 from logfile

As you can see, Pentaho PDI and metadata injection are immensly helpful to avoid duplicate work and hardcoding. And as such it is a clear plus for data quality: easier flows/logic will be easier to control and to maintain.

It is a different way of constructing an ETL, but it is way more efficient than to create several flows and logic e.g. one flow per CSV file and Node type. You end up doing the same things over and over again just with a little bit of difference in the file structure.

Now there is not so much Neo4j in here today, other than I have prepared the data to be sent to Kafka. But then the next part of this series will use also PDI (with metadata injection) to consume the data from kafka and then send it to Neo4j to construct a graph.

It will be a somewhat universal process to allow to update the Neo4j graph based on messages that arrive in Kafka. So a message that is sent to Kafka (from the console or a file or maybe a database) is immediately processed by PDI and updates the graph (database).

One important point in the design of a graph - the nodes and relationships - is the idea of sharing or not sharing nodes. It may be obvious, but I wanted to collect my thoughts here anyway. Either, so that others can benefit and on the other side to stimulate discussions about graph design.

It comes down to the question of uniqueness of objects. E.g. you might have two objects that have the same name. So one might be tempted to create one node and then subsequently multiple relations will point to it. But the question is: are they the same object or are they independent objects on their own - which just happen to have the same name?

For example: A company "Xtec" has an office in Australia and one in Germany. Both have a department that is named "Sales". And then the "Sales" department is connected to multiple customers. The main objects here are marked in bold.

So here are my objects/nodes in Neo4j: In red the company, blue for customers and green for the offices and yellow for departments. I created three dummy customers here.

Please bare with me - I oversimplify the case here a little bit for the sake of the length and complexity of the post.

It is important to define and name the main objects. Compared to a relation database - which is a technical representation of the real world - the nodes and relations in a graph represent the real world.

Here is the graph to it. I connected company and offices.

Next I connect offices and departments. So the graph looks now like this:

Ok. This is not what I wanted. Both offices connect to the same department. If you think that you connect the customers to the sales department, then how are you going to know which customer belongs to which department. Implicitly we have really two departments. One at the office in Australia and one in the office in Germany. They are not the same objects - they carry just the same name.

Why did I show this? It is about the uniqueness of objects. Objects that are not the same must be distinguished between each other (at least in this case). Typically you would have a unique identifier - e.g. a unique number. In the relational databases world this is a best practice. And then joins are made using this unique identifiers. It assures uniqueness but also helps speed up, as joins on numeric values are more efficient as e.g. on strings. And in the graph world you also need a unique identifier to distinguish - like here - between two things having the same name but actually they are different things.

Reusing nodes in Neo4j happens quickly, when you use "merge". Depending on which attributes you use, you either create a new node or re-use an existing one.

Ideally - in the source data - you have a unique identifier. So when you create the nodes, then you specify the unique identifier as an attribute. Likewise, if you connect the node to others, then a match is done on this identifier to retrieve the correct node to establish the relationship.

In other cases, a unique number might not be available. Typically when you are using CSV files: the "interesting" data was exported to CSV but not the unique id's. So how can you ensure uniqueness? In this case we have the sales department in Australia and the sales department in Germany. When we create the department node, we can specify an additional attribute for the office. This will make the departments unique. It is really just the same in the real world. If you speak about the sales, then one would immediately ask: "Where?", if there are multiple offices!

You could just simply create two separate nodes? Yes. But then if you query the graph, you will get both sales departments back as they are not distinguished. This can cause problems - just as an example - when counting.

Here is the cypher code to create the departments. I use a composite key to distingish them:

Cypher

And this is what the graph looks like now:

We can now connect the customers to the relevant by specifying which sales department is meant.

I hope you got the point I was trying to make here: You will find yourself in the position to ask what makes a certain node unique as it will influence your queries but also if your design is logically correct. In other cases you will question the results (number of nodes).

And of course in other cases you would want to share nodes. But this needs a deliberate decision when the graph is designed and the nodes are created. Dependant on the use case there might be multiple ways how one could design the graph.

Let me know your thoughts or how you approach this topic when designing a graph.

Neo4j is currently getting a lot of attention. I have started using Neo4j a while ago and have many use-cases in mind. And although I am very fluent with various Linux tools to prepare and transform data, it is quite cumbersome and time consuming.

Pentaho PDI is the ideal tool to feed Neo4j: you can join data from different source, clean it, transform and format it, add data from lookups. The possibilites are nearly unlimited. You have the full power of an ETL tool that supports you with an easy to use GUI to transform the data according to your requirements.

I am also a long-time Pentaho PDI user and as a Neo4j plugin is now available, I though I give it a try. I will do a series of blog entries here to show how PDI and Neo4j can work together.

This first part shows how to load a CSV file in PDI and create node and relationships in Neo4j. Here the Neo4j Output step is used. I am not going into the details of how to work with PDI - there is a lot of documentation available online and also videos.

To follow you will need Pentaho PDI installed. Once you have done that go to https://github.com/knowbi/knowbi-pentaho-pdi-neo4j-output/releases/ and download the file: Neo4JOutput-4.0.0-beta1.zip. Unzip the file to the PDI "plugins" folder. Now start PDI (spoon.sh or spoon.bat) and create a new transformation. You should now see the available Neo4j steps.

I use a simple CSV file with 7733 lines of data and 1 header row. The file contains data on airports with following columns:

id - a unique running number of the airport

name - the name of the airport

description - the description of the airport

country - the name of the country

iata - three letter IATA code of the airport

icao - four letter ICAO code of the airport

location - latitude/longitude of the airport (separated by a comma)

elevation - elevation of the airport

The columns are separated by a semicolon.

Here is some sample data:

1;Goroka;Goroka;Papua New Guinea;GKA;AYGA;-6.08169,145.39188;52822;Madang;Madang;Papua New Guinea;MAG;AYMD;-5.20708,145.7887;203;Mount Hagen;Mount Hagen;Papua New Guinea;HGU;AYMH;-5.82679,144.29586;53884;Nadzab;Nadzab;Papua New Guinea;LAE;AYNZ;-6.56983,146.72624;2395;Port Moresby Jacksons Intl;Port Moresby;Papua New Guinea;POM;AYPY;-9.44338,147.22005;1466;Wewak Intl;Wewak;Papua New Guinea;WWK;AYWK;-3.58383,143.66919;197;Narsarsuaq;Narssarssuaq;Greenland;UAK;BGBW;61.16052,-45.42598;1128;Nuuk;Godthaab;Greenland;GOH;BGGH;64.19092,-51.67806;2839;Sondre Stromfjord;Sondrestrom;Greenland;SFJ;BGSF;67.01697,-50.68932;16510;Thule Air Base;Thule;Greenland;THU;BGTL;76.5312,-68.70316;251

From this data I will create nodes with a label "Airport" and nodes with the label "Country". And finally a relationship between the two node types is created.

So here is the PDI transformation:

I use a "CSV input file" step to read the input data. The location column contains information about the latitude and longitude position of the airport. I split these values into two fields using the "Split Fields" step. Next I add two constants: label_airport and label_country. They are used for the labels of the two node types "Airport" and "Country". The last step is the output to Neo4j. Here is what the dialog of the step looks like:

The first action is to define the connection to the Neo4j server. My Neo4j instance runs on my localhost using the default ports: 7474 for the browser and 7687 for the bolt protocol. Finally I have added the username and password required to access the database.

Once the connection is defined, we can start and define which fields shall be used for the "Airport" node first.

I have selected that indexes shall be created. They are defined based on the selection of the fields that make up the index by marking them as "Primary". In this case the unique "id" column from the CSV file is used.

I have also slected to "use CREATE instead of MERGE". Merge will check if the node exists and if so update it. Create will skip this check and simply create a new node. So create will be faster, but if you want to upsert nodes (create if they don't exist otherwise update), then uncheck this value.

Further down I selected the value of the field "label_airport" to be used. In the "Add Constants" step I have defined these.

The last step is to define the properties of the node. The quickest way is to click on "Get Fields". PDI will fill in all fields that are inputs to this step. Under "Property Name" the name of the property in Neo4j can be defined. I then removed all fields that should not be used, such as the labels. I also removed the "country" field because a separate node will be created for countries. Note that the "id" and "elevation" fields are defined as integer and "latitude" and "longitude" are defined as float. This will allow later to use Neo4j's spatial functions to e.g. calculate a distance between two points.

We can now go ahead and output the data to Neo4j by running the transformation (save it first!). When I switch to my browser and display the airport nodes I see the following: 7733 nodes have been created (here 25 are displayed).

If I select one airport node I see following details:

Back to PDI we will now create the country nodes. Open the Neo4j Output step and uncheck "Use CREATE instead of MERGE". Previously we created the airport nodes, but now we want to 1) update the airport nodes that exist already and 2) create additional country nodes. Then I selected the "To Node". This node is very simple: we select the value of the "label_country" field as the label and the "country" fieldas the only property. We change the property name to "name" and we set it as primary. So for the name property an index will be created.

Click ok and then run the transformation again. Then switch to the browser.

Here some country nodes are displayed. Back to PDI we will now do the last step and have the relationship between country and airport nodes created.

This is very easy, we add the value for the relationship between the nodes in the "Relationship value" textfield. I added "locatedIn" to indicate that an airport is located in a country.

Close the dialog and re-run the transformation. In the browser we will now find the relationships- In the middle is the country, surrounded be the airports in this country and connected with a "locatedIn" relationship.

So we now have the nodes and their relation and we can now go ahead and run some Cypher queries in the browser to get more out of the data. As we have the latitude/longitude values of each airport, we can e.g. calculate the distance between two points. Here is the query:

So I match the airport Hamburg (ICAO code = EDDH) and the airport Zurich (ICAO code = LSZH) and define two points p1 and p2 from the result. Then I calculate the distance (in meters), convert it to kilomerts and round the result. Below you can see that the distance between the two points is 694km.

This was the first part. There are some things to well note here:- you can rerun the PDI transformation at any time to update Neo4j- you can extend the PDI transformation to e.g. add more data from other sources or to create more nodes and relationships- I created the nodes and relationship step by step, but of course this can be done in one go- there are other Neo4j steps in PDI - I will tackle those in other blog posts

Whenever you have data about relations between objects, then Neo4j might be a good candidate to store this data, as it is very fast in answering complex questions about relations. It answers these questions in an easier and much more sophisticated way than relational databases. The more relations you have or the more complex relations you have the more likely it is that Neo4j will make life easier for you.And you have seen than Pentaho PDI with the Neo4j steps is easy to use and a powerful combination with Neo4j. Apart from being very flexible and versatile and scalable, PDI also allows you to use metadata injection. A powerful feature when you e.g. have lots of Excel sheets - with different structures - and you don't want to hardcode the structures into the transformation process.