Importing CSV Data into Neo4j

Goals This article demonstrates how to import CSV data into Neo4j and solutions to potential issues that might arise during this process. Prerequisites Before importing data you should have modeled the first iteration of your domain in the property graph… Read more →

Developer

Working with Data

Goals

This article demonstrates how to import CSV data into Neo4j and solutions to potential issues that might arise during this process.

Overview

CSV Data Quality

Real World Data Considerations

Real world data is messy.
Don’t assume what someone told you is in a CSV file is actually, in there, don’t rely on format descriptions, consistency or correct quoting.
Only trust data validity that you checked yourself.

Common Pitfalls

BOM byte order mark (2 UTF-8) bytes at the beginning of a file ← remove them

Binary zeros or other non-text-characters dispersed throughout the file ← remove them

Inconsisent line breaks – mixed Windows and Unix linebreaks ← make sure they are consistent, best choose Unix style

stray quotes – standalone double or single quote in the middle of non-quoted text, or non-escaped quotes in quoted text ← escape or remove stray quotes

Tools

There are a number of tools that help you check and validate your CSV data files.

The most basic is hexdump, and the hex-mode of editors like vi, emacs, UltraEdit and Notepad++.

But there are other tools available, that are more end-user friendly.

CSVKit

CSVKit is a set of Python tools that provide statistics (csvstat), search (csvgrep), …​ and more for your CSV files.
Especially csvstat is very helpful to get an overview and statistic over your file, if you know your domain and data you can immediately spot inaccuracies.
Field length is important as excess field length indicates stray quotes.

CSVLint

CSVLint is an online service to validate CSV files.
You can upload the file or provide an URL to load it from, there is the option to provide additional schema information.

Papa Parse

Papa Parse is a comprehensive Javascript library for CSV parsing, that allows you to stream CSV data and files even with your browser and provides good, human readable error reporting on issues.
In the Demo area you can test your files yourself or use the library directly.
There is also a fork called BabyParse running on node.js.

Cypher

What Cypher sees, is what will be imported, so you can use that to your advantage.
You can use LOAD CSV without creating graph structure and just output samples, counts or distributions.
So it is also possible to detect incorrect header column counts, delimiters, quotes, escapes or spelling of header names.

// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);
// check first few raw lines
LOAD CSV FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;

LOAD CSV for Medium Sized Datasets

The real secret of LOAD CSV.

It is not just your basic data ingestion mechanism, but actually an ETL Power Tool. Why?

It combines multiple aspects in a single operation:

supports loading / ingesting CSV data from an URI

direct mapping of input data into complex graph/domain structure

data conversion

supports complex computations

create or merge data, relationships and structure

Important Tips for LOAD CSV

Always use the latest version of Neo4j, it will most probably be faster than earlier ones.

Data Quality and Conversion

See the data quality section above

Empty fields have to be skipped or replaced with default values during LOAD CSV

All data from the CSV file is read as a string, you have to use toInt, toFloat, split or similar functions to convert

Split arrays in a cell by delimiter using split (combine with extract for conversions)

Indexing and Performance

Make sure to have indexes and constraints declared and ONLINE for entities you want to MATCH or MERGE on

Always MATCH and MERGE on a single label and the indexed primary-key property

Prefix your load statements with USING PERIODIC COMMIT 10000

If possible, separate node creation from relationship creation into different statements
If your import is slow or runs into memory issues, see
Mark’s blog post on Eager loading.

Memory Config

Make sure to set the heap size to a sufficient value.
You do this by setting dbms.memory.heap.initial_size and dbms.memory.heap.max_size to at least 4G in neo4j.conf.

Ensure that the page cache is sufficient.
You do this by setting dbms.memory.pagecache.size in neo4j.conf.
Ideally, the page cache should be large enough to keep the whole database in memory.

File-URLs

Make sure to use the right URLs for files: file:///data.csv or https://host/path/data.csv
The file URLs are relative to the import directory (for location of the import directory refer to the Neo4j Operations Manual).

Use bin/cypher-shell instead of the browser for better control.
By default it connects to the database running on localhost but you can point it to a database anywhere over the network.

Step by Step Example for LOAD CSV

In our guide on ETL import from a relational database we explain how to import CSV data step by step, from data modeling, creating indexes to writing the individual LOAD CSV statements.

Webinar “LOAD CSV in the Real World”

In this very hands-on webinar Nicole White, shows how to use LOAD CSV to import a real world dataset (consumer complaints from consumerfinance.gov) into Neo4j.

After a quick modeling discussion she walks through the steps of preparing indexes and constraints and then imports one part of the dataset at a time into Neo4j.

If you call the neo4j-import script without parameters it will list a comprehensive help page.

The --into retail.db is obviously the target database, which must not contain an existing database.

The repeated --nodes and --relationships parameters are groups of multiple (potentially split) csv-files of the same entity, i.e. with the same column structure.

All files per group are treated as if they could just be concatenated to form a single large file.
A header row in the first file of the group is required, it might even be contained in a single-line file which might be easier to handle and edit than a multi-gigabyte text file.
Compressed files are supported too.

The customers.csv is imported directly as nodes with the :Customer label and the properties are taken directly from the file.

Same for the products, where the node-labels are taken from the :LABEL column.

The order nodes are taken from 3 files, one header and two content files.

Line item relationships typed :CONTAINS are created from order_details.csv, relating orders with the contained products via their id’s.

Orders are connected to customers by using the orders csv files again, but this time with a different header, which :IGNORE’s the non-relevant columns

The --id-type string is indicating that all :ID columns contain alphanumeric values (there is an optimization for numeric-only id’s).

The column names are used for property-names of your nodes and relationships, there is some extra markup for specific columns

name:ID – global id column by which the node is looked up for later reconnecting,

if property name is left off it will be not stored (temporary), this is what the --id-type refers to

if you have repeated ids across entities you have to provide the entity (id-group) in parentheses like :ID(Order)

if your ids are globally unique you can leave that off

:LABEL – label column for nodes, multiple labels can be separated by delimiter