Jump Start with the Graph Database – Neo4j

What is Neo4j

Neo4j is an open-source Graph Database implemented in Java by Neo4j, Inc. Its accessible from software written in other languages using the Cypher Query Language (CQL) through a transactional HTTP endpoint or through the binary ‘bolt‘ protocol. The developers describe Neo4j as an ACID-compliant (Atomicity, Consistency, Isolation, Durability) transactional database with native graph storage and processing. Neo4j is the most popular graph database at this point.

It is dual-licensed:GPLv3 and AGPLv3 / commercial. Neo4j comes in 3 editions:

Community:free but is limited to running on 1 node only due to the lack of clustering and is without hot backups

Enterprise:requires buying a license unless the application built on top of it is open-sourced; does not have the limitations of community edition; allows clustering, hot backups, and monitoring

Government:extends the Enterprise Edition adding additional government specific services including FISMA-related certification and accreditation support

Components of Neo4j

Neo4j stores the data in the form of either an edge, a node, or an attribute. Each node and edge can have any number of attributes. Both the nodes and edges can be labelled. Labels can be used to narrow searches.

Nodes:It is like table in RDBMS where data is stored e.g. Asset, Customer.

Properties:Tags which can be attached to both Nodes and Relationships. It is having the data. E.g. Node ‘Asset’ can have properties like ‘Height’, ‘Manufacturing Date’ etc.

Use Case:Employee Skills DB

We wanted to build a database in Neo4j to understand the internal relationships of the people through their personal skillsets, certifications, domain expertise, clients and projects. We wanted to build a simple interface that would help the HR, Leadership team and the project manager to allocate the right resources based on the technical requirements of the project. It would have functionalities to know the skills for every person in the organization; search for the right employee based on a list of skillsets, find connections (shortest path) between two persons etc.

For hosting Neo4j, we launched a new VPC EC2 instance on AWS cloud and installed Neo4j Community Edition v3.2.6 in that server. Once the database is installed, we created a new database instance and build Neo4j nodes and relationships. Neo4j interface is accessed through the URL http://localhost:7474 or http://127.0.0.1:7474with default or configured login credentials in Neo4j.

Picture 1 – the AWS EC2 instance we launched for hosting Neo4j

Picture 2 – Launching the Neo4j custom database

Creating the Data Model

For our use case, we needed to create nodes for managing the master data and relationship data. Below table explains the data model.

SL #

Node Name

Type

Description

Field Names

1

Person

Node

Master list of all people

Alias, First_Name, Last_Name, Email, Designation, Joining_Date

2

Skill

Node

Master list of all skills and skillsets

Skill_Alias, SKill_Detail, Parent_Skill_Alias

3

Certification

Node

Master list of all industry certifications

Certification_Name, Certifying_Company

4

Industry

Node

Master list of all industries

Industry_Name, Industry_Description

5

Domain

Node

Master list of all business functions

Domain_Name, Domain_Description

6

Reports_To

Relationship

What is reporting hierarchy of the organization, team or project

Source_Person_Alias, Target_Person_Alias, Role

7

Knows

Relationship

Skillset expertise list for people

Alias, Skill_Alias, Knowledge_Level

8

Ceritified_As

Relationship

Certification list for people

Alias, Certification_Name

9

Worked_For

Relationship

People and Industry relationship

Alias, Industry_Name

Table 1 – Data Details

We used below commands, to create the data model. First, we cleared all existing (if any) nodes and relationships.

MATCH (n)OPTIONAL MATCH (n)-[r]-()DELETE n,r

Next, the Nodes are created. Data can be loaded through individual CREATE commands as well, but we chose to load data from CSV file since its would be faster and easy to manage in future. For importing the CSV files that are located in local computer (EC2 machine in our case), we need to put the CSV files inside the ‘import’ folder at the location of database (mentioned in Picture 2 above). If ‘import’ is not present, we need to create it.We created an index on First_Name for fast searching on People node.

Since, there would be many records in the relationships, we thought of loading the same data from CSV as well for easier management.‘org_structure.csv’ contains the relationship between persons through ‘Alias’ field from Person node.

Once all these commands are successfully run, we can validate the data through sample commands like below. The below commands show graph output with only top 25 records from the node / relationship.

MATCH (n:<NODE NAME>) RETURN n LIMIT 25

MATCH p=()-[r:<RELATIONSHIP NAME>]->() RETURN p LIMIT 25

Query to Know Employee / Team Hierarchy

To see the graph for team hierarchy, we can either click on ‘REPORTS_TO’ relationship or run the below command.

MATCH p=()-[r:REPORTS_TO]->() RETURN p

Query to Know All Employees’ Skillset

We wanted to know all the skills for employees. Below command shows the graph output with all the skills for all people from ‘KNOWS’ relationship. It shows both the person alias and their associated skills.

MATCH p=()-[r:KNOWS]->() RETURN p

Query to Know All Employees’ Certifications

We wanted to know all the certifications for employees. Below command shows the graph output with all the certifications for all people from ‘CERTIFIED_AS’ relationship. It shows both the person alias and their associated skills.

MATCH p=()-[r:CERTIFIED_AS]->() RETURN p

Query to Find Who Has a Particular Skillset

We wanted to know the list of people who has ‘AWS’ as skillset for a particular project requirement. Below command shows the graph output with all the people names that have ‘AWS’ from ‘KNOWS’ relationship. It shows the person names and all their skills. If we want to show only the person names, we need to use RETURN a.

Conclusion

Neo4j is highly efficient in managing data with many interconnecting relationships. It’s data model doesn’t usually require a predefined schema. We don’t need to create the database structure before loading the data, unlike traditional DBMS. Neo4j is a “schema-optional” DBMS, where data is the structure.We wanted to create a basic employee database and quickly query the same through Neo4j to understand the functionalities of this popular graph database. The whole exercise took less than 2 hours. But if you want to build a complete end-to-end application, Neo4j provides lot of good functionality to support your advance requirements.Neo4j is extremely well suited for social networking applications like Facebook, Twitter, etc. But there are many other areas where Neo4j excels. Here are some of the areas that Neo4j can be used for: