In previous posts I mentioned the importation of The Tate metadata into two databases – Neo4j and MongoDB. Why am I doing this?

Martin Fowler coined the term polyglot persistence, by which he means choosing the right database model / engine for each particular data requirement, even within a single application. Take for example my Tate-to-Neo4j import program (written in Java). I needed to create a lookup store to cross-reference the graph node ids that had been generated when inserting artists, movements, subjects, etc, so that they could be reused during the import. I chose to use Redis, which is a lightning-fast key-value store database. One application, two different data requirements, two databases – key-value and graph.

With regard to the social web app, it is entirely conceivable that I could handle the entire data storage on the Neo4j graph database. However, I don’t see that as a practical architecture – for performance, it is best to run Neo4j in embedded mode, which makes the database unavailable to other applications. So trying to use Neo4j as a general purpose database isn’t ideal. It is ideal, of course for all the highly-related data, e.g. friend-of-a-friend, exploring recommendations from user “appreciations”, personal galleries, similarity between artworks and artists using shortest paths, and so on. Each node would just contain a label and an identifier, not any additional properties (yes, Neo4j is a property-graph database, but that doesn’t mean I have to store lots of properties). Relationships, on the other hand, will often have properties, for example a friend of a user since a date and time, a user appreciated an artwork rating it 4 stars. But for doing the mundane stuff, like retrieving a complete artist or artwork metadata record, something else is likely to be better suited.

In my previous post, I updated the instructions for how to import Tate metadata into MongoDB on Linux, which was an update of an older post for Windows. For the aggregated data of the Tate JSON documents, MongoDB is ideal – the import is a direct one. For simple retrievals for display, for example, on a webpage, this is efficient. Everything is retrieved in one fast get operation – no joins to worry about. For the supplementary data, such as other recommended artworks, a service based on the Neo4j database can be used.

And the Tate themselves will use their own software and databases to manage their data – that’s not the focus of my project, which imports data from other sources.

This approach allows me to wrap the Neo4j-based social / recommendations application within a number of web-services, e.g. using REST. That the Neo4j database is embedded and inaccessible from other applications is then not such a concern – I just have to code the exposed services using, for example, Spring MVC.

What if I want direct access to the graph database for data analytics / visualisation? Shutting down the webservices application and taking a copy of the database for that purpose would be the way to go. The copy would be fast (few hundred MBs at most) and the webservices application could be restarted with little downtime.

I know many developers in the digital humanities will be familiar and comfortable with the relational data model. And that’s fine – an RDBS will get the job done, but it does lead to difficulties when developing applications, such as the impedance mismatch between relations and objects. This is something NOSQL databases, like MongoDB and Neo4j, address allowing for more rapid development, taking advantage of data-to-object mapping.

Just after the publication of the Tate metadata on GitHub, I wrote a post about how to import the JSON documents into MongoDB using a Windows command. Here I update that with how to do the same on Linux.

The command below assumes the collection has been cloned to /home/larkin/git/collection, but you should change to wherever you cloned it.

This assumes a database named “tate” in your MongoDB instance and that artists will go into an “artists” collection. For the artworks, just rename artist(s) to artwork(s).

You can run the resulting script by issuing the following command:

sh artist_import.sh

It’s not a very efficient process, but will eventually get the job done in about half an hour, give or take. I’ll follow up with further configuration of these documents and collections in MongoDB to aid searching, etc.

The previous post detailed (without going into code) how I imported the Tate’s metadata into a Neo4j database, and that is still a work in progress. And the Tate metadata is only the starting point. I plan on adding many more node types as I develop an overall social media platform with elements of crowdsourcing and gamification. However, there is already some highly-connected data to play around with. While most of the queries I can run on this data using Neo4j’s Cypher language could also be done using SQL, I can still do some interesting discovery in a very visual way – a way which SQL doesn’t lend itself to.

Example 1: Shortest paths between the artists William Johnstone and John, Augustus, OM (not sure how to rearrange that particular name).

What’s returned in visual format by the Neo4j browser is the following graph (you really need to click on the image to expand it):

The code snippet allshortestPaths((a1)-[*]-(a2)) retrieves all paths with a length equal to the length of the shortest path. In this case there are actually 570 possible paths with a length of 4 involving 82 separate nodes and 267 relationships.

Visually, we can immediately see what links the two artists. That they might have something like paper, oil paint and canvas in common is not a surprise or particularly useful, but that they share hope, cloud, tree and hill might be interesting.

The Cypher query is pretty dynamic. I don’t need to specify the types of nodes (using their labels) in between the two artists. With SQL, I would need to specify all the tables I wanted to join in the query.

It would be entirely possible to do this one in SQL in a relational database using multiple joins. It would likely be slower to run, though, and would lack the visual element of the CypherQL query. Just look at the MATCH criteria:

The (a) in the middle are the artists we want to match. We go in 2 directions from the artist: to the place of birth and to the subject via the artworks. SQL isn’t as semantically rich, lacking the arrows / connectors of CypherQL.

Example 3: Some recommendations.

Though the social aspect of the platform has yet to be developed, and it’s from this that much of the recommendation engine will be based on, we can make some initial additions to the artists and artworks recommendation engine. For example, if we were viewing an artwork featuring a bridge (let’s go with “Oriana”), we may be interested in other artworks featuring a bridge.

There are, however, 3,742 other artworks featuring a bridge, so you can see that the algorithm is a bit simplistic so far.

We could increase the connectedness of the current artwork and recommended artworks. For example, what about ones that share the same movements? We can extend the previous example (i.e. we are currently viewing the artwork Oriana which features a bridge).

In the previous post about importing the Tate’s open metadata into a Neo4j graph database, I mentioned that I was using Python. I quickly discovered a major performance issue when using Python to batch import into Neo4j. The main issue is that to use Python, I had to run the Neo4j database as a server, requiring REST-based access. This introduced a huge overhead and it crawled along processing a handful of artworks per minute. With nearly 70,000 artworks, it might have taken days for the import to run. Rather than spend too much time investigating whether this approach could be optimised to the point it would be usable, I decided to go with the obvious choice of development language for Neo4j – Java.

Neo4j is a Java-based database. One doesn’t always associated the words Java and performance in the same sentence. I mean, isn’t that the realm of C and C++ (e.g. Redis is C-based). But in Neo4j’s case, Java doesn’t seem to hold it back. There is no method of access to Neo4j that can be as fast as when using the Java libraries. However, while Neo4j is lightning fast in terms of retrieving nodes (graph traversal), there is a big overhead when writing data. Unlike a relational database, where the links between tables are logical using foreign keys and so writing data is a relatively inexpensive operation, the links between nodes in Neo4j are physical ones requiring their own data structures to be stored and result in generally a higher write latency. On top of that, Neo4j is transactional. This adds more of an overhead when writing. For a batch process, processing around 100,000 nodes and up to a million or more relationships, this is a non-runner.

The solution is Neo4j’s BatchInserter. This effectively bypasses Neo4j’s database management system and all its transaction handling and goes directly to the files on the disk. It is orders of magnitude faster when writing. However, because I was creating relationships between nodes, I needed a way to store the physical node ids for later lookup. The basic algorithm of my Tate batch importer is as follows:

For each artist read from its JSON file: Add the artist node For each movement within the artist: If the movement node was not already added, add the new movement node Link the movement node to the artist End For If the artist's place of birth was not already added, add the new place node Link the place of birth node to the artistEnd ForFor each artwork read from its JSON file: Add the artwork node Connect the artwork to the artist If the catalogue group was not already added, add the new catalogue group Link the artwork to the catalogue group For each movements within the artwork: If the movement node was not already added, add the new movement node Link the artwork to the movement End For For each subject within the artwork (having traversed from top level to second level to third level): If the subject node was not already added, add the new subject node (add as a person node if a "named individual") Link the artwork to the subject End For If the classification was not already added, add the new classification node Link the artwork to the classification node Parse the medium to its elements based on separators (commas, spaces, and, on) For each medium element: If the medium node was not already added, add the new medium node Link the artwork to the medium End ForEnd For

The main issue with the algorithm is the frequent lookups to see if a node was already added. There are currently close to a million of these lookups. Because I am using Neo4j’s BatchInserter, access to a facility to do a search for a node is not available. However, what the BatchInserter does provide is a physical node id for each added node. The solution is to store the ids from in the JSON file (and each artist, artwork, movement, subject, has an id, others such as medium and classification will just have the name as a key) along with the physical node ids generated by the BatchInserter in some kind of lookup store. The solution I chose was Redis, which is a very fast key-value store. I built the latest version from source code and ran it on the local host. It’s a very simple storage mechanism – I just prefix each of my keys with something like “artist:” or “movement:” and store the key and the value in the one giant lookup table. For example, using redis-cli to issue a commad to lookup the physical node for the artist with the id of 1234:

127.0.0.1:6379> get "artist:1234"
"4235"

While it is possible to run my importer on a Windows-based machine (e.g. using vagrant to host the redis service within a headless Ubuntu instance), the best option will be a unix-based machine. For example, using the vagrant solution on Windows [Core i3 laptop with 12GB RAM and a SSD drive], the process took about 23 minutes, which is more than tolerable. [Update: I installed Microsoft’s Windows-compiled version of Redis, which reduced overall time to about 8 minutes] However, the same process took about 3 minutes on a similarly spec’ed Linux Mint PC [Core i3 with 8GB RAM]. [Note: on a 7-year-old Linux laptop with 4GB RAM it took 10 minutes].

3 minutes to read through about 73,000 JSON files, map them to Java objects (Using Jackson), create 93,631 nodes and 719,766 relationships (edges) in Neo4j, using Redis to store 24,766 key-value pairs with almost 1.2 million key lookups. Doesn’t sound like too many bottlenecks there!

When I write graphing, I am not referring to graphing of mathematical functions or the production of fancy bar charts, though there is a highly visual component to it. I mean the creation of a graph database. The graph data model is one of the more niche NOSQL data models and is suited to the querying of highly-related / highly-connected data. By creating a graph of cultural heritage metadata (for artworks, museum objects, etc), the hope is that either connections between ‘entities’ can be discovered, or they can be discovered more easily or more quickly than before.

While the most obvious relationship between entities in the Tate metadata is the Artist-ContributesTo->Artwork (in the Tate metadata, there can be multiple contributors to an artwork), many additional entities can be conceived and connected. Examples include medium (paper, graphite, watercolour, etc), movement (Young British Artists, Pre-Raphaelite Brotherhood, etc), subject (house, tree, man, etc) and others. Artists can be related to movements, artworks can be related to mediums (I use mediums rather than media to avoid confusing with, well, the media) and subjects.

The following is a subset of the graph showing how several artworks relate to mediums. As you can see, even this very small subset is very dense. While visual, the purpose of the research isn’t all about visualisation.

In the following image, a subset of the subject hierarchy is shown.

In the Tate metadata, the subjects are a three-level hierarchy with the top level being a generic categorisation of the subject, the middle being more specific, then the third / bottom level being the one that specifically applies to the artwork. Initially, I will connect artworks only to the third-level / most specific subjects, but for performance purposes, I will look at connecting artworks to all three levels in the subject hierarchy. This would allow for more generic searches, e.g. search for artworks containing “animals: mammals” rather than having to specify “cow”, “sheep”, “horse”, etc.

More specifics in the next post, such as my use of Python, the py2neo module for Python, the Neo4J graph database, and more.

I decided to investigate. I discovered that the dataset is stored on gitHub and is available in CSV and JSON formats. The JSON version is the most comprehensive.

This was perfect for my teaching requirements. This coming week I am covering the concept of the document-oriented data model. JSON is ideally suited to this and can be loaded into most document-oriented databases, such as the one I will be using by way of example, MongoDB.

I encountered a slight issue, however. The dataset is organised into subfolders, with each artist and artwork in its own individual JSON file. MongoDB’s import command only works with single files. So I had to devise a way to recursively drill down through the folder structure downloaded from gitHub.

I achieved this by searching for a Windows shell command (would have been easier to find a Linux solution) that would recursively list the files in a folder. I then had to prepend the mongoimport command to import the documents into a collection in a new database. I did this to create an artworks collection and an artists collection.

If using Windows, you open the command prompt, change directories to go into the artworks folder and issue the following command (assumes MongoDB in installed in C:mongodb and that you want to use the database tate):

[Note: afterwards, you need to delete the listing of the out.txt file from out.txt]

Repeat for the artists collection, just changing the collection to artists and go into the artists folder in the collections downloaded from gitHub.

Easiest thing to do then is to copy and paste from the out.txt file and paste into the command prompt and see it add thousands of documents – difficult with the size of the artworks file. Otherwise develop it into a batch script. Linux is the better platform for that.