Flight routes graph database with Neo4j

Data and cutdown versions

In this tutorial I am importing a list of flight routes obtained from openflights.org into a Neo4j graph database. I am using three files available from this source

airlines.dat

airports.dat

routes.dat

As the names suggest, these contain lists of airlines, airports and routes in CSV format. Every route is linked to an airline and it has a source airport and destination airport. I have placed these files in a data folder and a simple line count wc -l data/* gives us the number of records in each

6048 airlines.dat
8107 airports.dat
67663 routes.dat

To make testing while developing quicker I have created a cutdown version of the data using a python 2.7 script. I write a function that takes an input CSV file, an output file to write (actually overwrite), the column names and a predicate which is a lambda expression on a row that returns a boolean. If the predicate is True for a given row we keep the row in the cutdown file. Finally, if the column_to_return is passed in, a python set is returned with all the values for this column.

Next I filter all airports and airlines in United Kingdom. Because the cutdown function will return a set with all the ids I keep, I can use these sets in the predicate for the routes. I keep only the routes for these airports and airlines.

Neo4j server installation

Populating the graph database

To extract the data from the CSV files and store them in Neo4j I constructed a make_graph.py script. I have used python 2.7 and py2neo version 2.0.4. I will show the bits one by one and you can put them together in the order they are presented. Alternatively, the code for this post can be found in this github repo.

I use py2neo to talk to Neo4j, and also define some functions in order to calculate the distance between two airports and place this information on the route node.

The logic for creating airports and airlines is similar, so I write a common function that will read the sourcefile which is a CSV file with columns given in fieldnames, and then create nodes with label on the given graph. In our case the label will be either ‘Airline’ or ‘Airport’.

Notice that the function expects that there is an ‘id’ column, and it will return a dictionary of all the nodes created with their ‘id’ being the key. I keep the nodes, as I will later on need to connect them to routes.

For a given pair of aiports, there might be several routes connecting them, and actually for each route there must be one in the opposite direction. I keep a dictionary of distances to save me from computing them again.

known_distances = {}

Given two (latitude, longitude) pairs I can calculate the distance between the two points on the map like this (code taken from this stackoverflow answer)

Next, I write a function that takes two airport nodes, checks if their distance is already computed and present in known_distances dictionary. The key of this dictionary is the pair of airport id’s. If the distance is not found, I extract the latitude and longitude information and calculate it. Before returning the result I store the result with both orderings of the airport id’s.

Creating a route node, is slightly more complicated than the airline and airport nodes. First, notice that I pass the dictionaries with the airline and airport nodes. I connect to its source airport with a FROM relationship and a TO relationship to its destination airport. If any of the two airport id’s is missing, I ignore the route. I connect a route node to its airline via OF relationship. If the airline id is missing, I still create the route node but not connect it to an airline.

Make sure you change the neo4j_uri if you are running your server on a different machine or if you are not using the default port. Also change the paths to the .dat files if you have placed them somewhere else, or if you just want to use the cutdown versions to make smaller graph.

Note that this might take long time to run as I am inserting nodes one by one. For more efficient ways of importing CSV files via cypher have a look here.

Querying the graph

I will run Cypher queries in the Neo4j browser that will be available on the address http://localhost:7474/browser/. If the Neo4j server is running on a different machine replace localhost with the hostname or ip address of the machine that is running the Neo4j server.

I want to fly from Athens, Greece to Narita Airport in Tokyo, Japan. Checking for direct flights

Airports are purple nodes, routes are green, and airlines are red. I have customised the airport and airline nodes to show their names instead of their id. If you click on a node of a specific label, a window appears with its properties.

In this window there is a style tab with an “eye” symbol, this is the second tab. In there you can customise the caption of the specific type of node, its size and its colour.

The start and end of the journey are the airports in the middle. There are pairs of routes that are linked to the two airports and the stopover airport. Stopover airports form a ring around the start and end airports. Each stopover airport is connected to two legs, and the two legs are connected to a single airline. Therefore, the airline that stops over at the specific airport is drawn close to the stopover airport in the graph.

This is just an example query. To learn more about the Cypher query language you can start here.