Cleaning CSV Files in Bash

Hi, friends, and welcome back to my summer road trip through the world of healthcare. For those who are new to my adventure, this is the third part of the blog series. Catch up right here with Part 1 and Part 2.

I am using Neo4j to connect the multiple stakeholders of healthcare and hope to gain some interesting insights into the healthcare industry by the end of my exploration. This blog series demonstrates the entire process from data modeling and ETL to exploratory analysis and more. In the previous two posts, I discussed data modeling and how to integrate XML data to Neo4j by using APOC, you can find every single detail about the project on Github.

This week, I will be working with CSV files. If you are using Neo4j for the first time (like me), I can tell you honestly that loading CSV files to Neo4j is a lot easier than loading XML files. But don’t get too optimistic about it unless your data is perfectly clean. Now, let me show you the steps I used to successfully load the CSV files.

1. Get the Data

This week, our data covers information on drugs, drug manufacturers, providers and prescriptions. You can download the same data from these sources:

As the healthcare provider data gave me the most problems, I will use this data as a demonstration in this blog post.

2. Display the Data:

A. What Does the Data Look Like?

head npidata_20050523-20160612.csv

Wow, the data looks a little bit crazy, and because of that, I will not overwhelm you by copying the result here. However, I learned three characteristics about the data by displaying the first 10 rows:

The data has a header, and the header contains white space.

The data has many columns (we will find out how many soon).

The data has a lot of empty values.

B. How Many Rows Are in the Data?

wc -l npidata_20050523-20160612.csv

Results:

4923673 npidata_20050523-20160612.csv

Each row of the data represents a registered provider in the United States from 2005 to 2016.

C. How Many Columns Are in the Data?

head -n 1 npidata_20050523-20160612.csv|awk -F',' '{print NF}'

Results:

329

Now you see my point where I said the data is a little bit crazy. But don’t panic — most of these columns do not contain values, and we only need to extract a few columns to load them to my healthcare graph.

D. Remove the Header From the Data

sed 1d npidata_20050523-20160612.csv > provider.csv

This will delete the first line and save the content to a new file named provider.csv. The original file will not be changed.

The
FOREACH statement is used to mutate each element in a collection. Here I use
CASE WHEN to group the data into two collections of rows: the rows with
col[1] = 1 and the rows with
col[1] = 2. For each row in the
col[1]=1 group, I use the
FOREACH statement to set the
firstName,
lastName,
credential and
gender properties, and for each row in the
col[1]=2 group, I set the property
orgName.

C. Fix the Fields Containing Delimiters

Running the Cypher query above returns an error:

At /Users/yaqi/Documents/Neo4j/test_0802/import/provider.csv:113696 - there's a field starting with a quote
and whereas it ends that quote there seems to be characters in that field after that ending quote.
That isn't supported. This is what I read: 'PRESIDENT","9'

By using periodic commit, you can set up a number of transactions to be committed. It helps to prevent from using large amounts of memory when loading large CSV files.

4. Conclusion

Now, I have successfully loaded the healthcare provider data into Neo4j. The processes of loading drug, drug manufacturer, and prescription data are very similar. I also created the relationship
WRITES for the nodes
:Provider and
:Prescription based on the NPI information contained in both files. By now, all the data is stored in the
graph database.

Let’s take a look at the healthcare graph data model again:

I hope you find this blog post helpful. Sometimes cleaning large CSV files can be tricky, but using the command line to manipulate the files can make the work go faster. In the next blog post, I will show you how to link data when you have limited resources. Specifically, I will demonstrate how I created the relationship (:Prescription)-[:PRESCRIBE]->(:Drug) and (:Drug Firm)-[BRANDS]->(:Drug). Stay tuned, and I’ll see you soon!