Share this:

If you are working with the Apache TinkerPopTM framework for graph computing, you might want to produce, edit, and save graphs, or parts of graphs, outside the graph database. To accomplish this, you might want a standardized format for a graph representation that is both machine- and human-readable. You might want features for easily moving between that format and the graph database itself. You might want to consider using GraphSON.

GraphSON is a JSON-based representation for graphs. It is especially useful to store graphs that are going to be used with TinkerPopTM systems, because Gremlin (the query language for TinkerPopTM graphs) has a GraphSON Reader/Writer that can be used for bulk upload and download in the Gremlin console. Gremlin also has a Reader/Writer for GraphML (XML-based) and Gryo (Kryo-based).

Unfortunately, I could not find any sort of standardized documentation for GraphSON, so I decided to compile a summary of my research into a single document that would help answer all the questions I had when I started working with it.

When to Use GraphSON

The advantages of GraphSON are that it is the most human-readable option of the three supported Gremlin I/O formats (i.e. GraphSON, GraphML and Gryo), JSON is widely used, and there is some support for it from graph-related applications outside of TinkerPopTM. The disadvantage is that it is verbose and has some redundancy, and is therefore not very memory efficient, so it may not be the best choice when storage is a limiting factor.

The TinkerPopTM documentation advises that it is generally best used in two cases:

A text format of the graph or its elements is desired (e.g. debugging, usage in source control, etc.)

The graph or its elements need to be consumed by code that is not JVM-based (e.g. JavaScript, Python, .NET, etc.)

Formatting GraphSON

The documentation of the Gremlin GraphSON Reader/Writer mentions nothing about the requirements for formatting GraphSON in a way that is readable to it; the resources outside of TinkerPopTM that reference GraphSON (usually in connection to a specific DB, like Titan) have examples, but they are sometimes incompatible with the TinkerPopTM/Gremlin Console built-in reader. The only example of GraphSON in the documentation is a single vertex, not a graph, but the download of the Gremlin Console includes four examples of GraphSON files (in fact, these four graphs are described in all of the different file formats for which Gremlin has a Reader/Writer). From these four examples, I have deduced the following rules and conventions, which I have been able to apply successfully to making my own graphs that can be parsed by the Reader:

Vertex Rules and Conventions

The file consists of a list of vertices, where each vertex is a dictionary that maps property names to property values, or sub-dictionaries with property values. Each line of the file contains exactly one vertex. The vertices are listed in order of unique ID, which, by convention, starts at 1 and increases sequentially.

“id” maps to a unique integer id (starting at 1, by convention, and increasing sequentially)

“label” maps to a string that represents the label associated with that vertex (think of this being like the vertex’s type, e.g. “person”)

“inE” and “outE” map to dictionaries that map the label (again, think type, e.g. “knows” may be a label between two people) of an edge to a list of edges of that type that involve the vertex of which these are sub-dictionaries. “inE” and “outE” encode directionality of the edges as the names suggest (“inE” contains the edges that terminate at the vertex, “outE” contains the edges that originate at the vertex). If a vertex has only one type of edge associated with it (e.g. only in edges), then the other key-value pair (e.g. “outE”:{}) may be omitted from this vertex.

“properties” maps to a dictionary where the keys are the labels of the properties and the values are dictionaries that contain the key “id” mapping to an integer ID for that property value (used for indexing) and the key “value” mapping to a value for that property. Every property value needs to have an ID (separate from the IDs for vertices and edges, these can start at 0), otherwise an error will occur in the GraphSON reader (“Error: property value cannot be null”). Example: "properties":{"name":{"id": 0, "value":"marko"}, "age":{"id":1, "value":29}}. The IDs are unique, and by convention, start at 0 and increase sequentially.

Edge Rules and Conventions

“id” maps to a unique integer ID. By convention, the edge IDs start immediately after the greatest vertex ID and increase sequentially, so if a graph has 6 vertices and 6 edges, the vertex IDs would be 1-6 and the edge IDs would be 7-12.

“inV” and “outV” map to the vertex IDs on each side of the edge. “outV” maps to the ID of the vertex from which the edge originates and “inV” maps to the ID of the vertex at which the edge terminates. For example, the edge 1-knows->2 would look like this: {"outV":1, "inV":2, "properties":{}}. Since in a GraphSON format, the edges are all listed as a property of vertices, and nowhere else, one of “outV” or “inV” is already implicit based on whether it is in the list mapped to by “inE” or “outE” and the ID of the vertex on that line, and, thus, may be omitted. For example, with the edge described above, in the dictionary that describes vertex 1, the edge would be present in the list of edges mapped to by “outE”, so “outV”:1 is implicit and may be omitted. It need not be omitted though, and is convenient to leave in some cases.

“properties” maps to a dictionary which maps the label of a property to its value. For example, properties:{"weight":0.5}. These properties are not indexed, and each edge has only one property associated to it.

Because of the redundancy of encoding the same edge in both the “outE” field of the vertex from which it emanates and in the “inE” field of the vertex at which it terminates, each edge will appear twice in the GraphSON format; the two instances of the edge should be identical, since both instances represent the same edge.

Example GraphSON Structure

The TinkerPopTM Modern example (data/tinkerpop-modern.json in the directory apache-tinkerpop-gremlin-console-3.2.5, i.e. the directory created by downloading the Gremlin Console):

Next Steps

Notes

A whole GraphSON file itself is not valid JSON. Rather, each line in the file is valid JSON. So, if you want to parse a file yourself, you should write or read the file using a JSON writer/reader line by line.

The examples, rules, and conventions here are designed to be compatible with the GraphSON Reader in the Gremlin console. For certain implementations, this Reader may be disabled, so it may be necessary to parse GraphSON yourself or utilize a bulk upload feature, which may have a slightly different format.

There may be a way to customize the reader in such a way as to follow the rules and format described above less closely, but the documentation about customization is vague and some of the provided code on the TinkerPopTM page is syntactically invalid for the Gremlin console.

There is a feature highlighted heavily in the Gremlin documentation, which is that GraphSON can embed the types (Java types) of the properties encoded in the graph. From the documentation page:

One of the important configuration options of the GraphSONReader and GraphSONWriter is the ability to embed type information into the output. By embedding the types, it becomes possible to serialize a graph without losing type information that might be important when being consumed by another source. This is useful for making sure numeric values are interpreted correctly by the application that loads the data.