In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

The Neo4j Browser UI, which is very neat and clear;

The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;

The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfer – time_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL)

Neo4j endpoint (Cypher)

Notes

1200 ms

5 ms

Return all properties

1400 ms

1400 ms

Return all properties order by primary identifier

360 ms

12 ms

Return primary identifier and symbol

85 ms

5 ms

Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

Representative example of the gene query – the real one has thousands of results!

psql (SQL)

Neo4j endpoint (Cypher)

Notes

80 ms

4 ms

Return all properties

110 ms

84 ms

Return all properties order by primary identifier

20 ms

10 ms

Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

psql (SQL)

Neo4j endpoint (Cypher)

InterMine Web services

Notes

15 ms

16 ms

37 ms

protein binding

28 ms

15 ms

38 ms

cellular_component

4.7 ms

6 ms

29 ms

nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

We can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL)

Neo4j endpoint (Cypher)

Notes

2 ms

3 ms

No JOIN with organism

3 ms

4 ms

JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher)

Server extension (Java API)

Intermine Web services

11.3 ms

12 ms

35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher)

Server extension (Java API)

Intermine WS

3.5 ms

3.5 ms

30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

After the excitement around BiVi, I’d be remiss if I didn’t discuss all the work put into both a BioJS presentation at BiVi, and BioJS Workshop in the afternoon after BiVi.

We’re already avid BioJS fans at InterMine, because BioJS provides easy plug-in visualisations (for example, cytoscape). I’d expected that a Venn diagram depicting the BioJS crowd would intersect almost perfectly with the BiVi crowd, so I was surprised to find that they were actually completely separate groups.

The difference was explained to me by Manny Corpas as follows: While BioJS, given the (mostly) browser nature of Javascript, is indeed about visualisations – not all of it is dedicated to visual things. BioJS modules can be related to data parsing, for example.

On the other side of things, BiVi is about visualisation – no matter the language. Indeed, quite a few of the demos we saw at BiVi were desktop or server based, and unrelated to Javascript at all.

The workshop covered the basics of Javascript development, and shown how to include/interact with BioJS components on a webpage – but the most interesting sessions for me (as someone who makes a living out of writing Javascript, among other things) was definitely the session at the end where we were talked through creating our very own BioJS component.

Dennis Schwartz bravely live-coded a pie chart using d3.js on a projector – not an easy task! We started by setting up the scaffolding of the project using the BioJS Slush generator. This created examples, set up a build process, and ensured the BioJS pre-requisites were present, like licence and tags (which allows the biojs registry sniffer to pick up biojs packages from the npm registry). Despite only having an hour or so to get it all done, by the end we had each coded a functioning basic component.

The workshop finished off nicely with group pizza to feed hungry biojshackers. Unfortunately I was unable to attend the hackathon the next day, but if its quality was anything like the workshop, I’m sure it must have been a fabulous success.