Guest Post by Jenny Kim
This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I've learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

Part 1: Setup a Cassandra cluster

I followed Titan's EC2 instructions for standing up Titan on a Cassandra cluster using the Datastax Auto-Clustering AMI:

Step 1: Setting up Security Group

Navigate to the EC2 Console Dashboard, then click on Security Groups under Network & Security.

Create a new security group. Click Inbound. Set the “Create a new rule” dropdown menu to “Custom TCP rule”.

Add a rule for port 22 from source 0.0.0.0/0.

Add a rule for ports 1024-65535 from the security group members. If you don’t want to open all unprivileged ports among security group members, then at least open 7000, 7199, and 9160 among security group members.

Tip: the “Source” dropdown will autocomplete security group identifiers once “sg” is typed in the box, so you needn’t have the exact value ready beforehand.

[number-of-instances] in this configuration must match the number of EC2 instances configured on the previous wizard page (i.e. - 2). [cassandra-cluster-name] can be any string used for identification. For example:

On the Tags page of the Request Instances Wizard you can apply any desired configurations. These tags exist only at the EC2 administrative level and have no effect on the Cassandra daemons’ configuration or operation.

It is useful here to set a tag for ElasticSearch to discover this node when identifying its cluster nodes. We will revisit this tag in the ElasticSearch section.

On the Create Key Pair page of the Request Instances Wizard, either select an existing key pair or create a new one. The PEM file containing the private half of the selected key pair will be required to connect to these instances.

On the Configure Firewall page of the Request Instances Wizard, select the security group created earlier.

Review and launch instances on the final wizard page. The AMI will take a few minutes to load.

The database cache settings should be enabled in a Production environment. Full documentation is found here: https://github.com/thinkaurelius/titan/wiki/Database-Cache. For our purposes, we will just enable the db-cache, set the clean time (milliseconds to wait to clean cache), cache-time (max milliseconds to hold items in cache), and cache-size (percentage of total heap space available to the JVM that Titan runs in).

You should see Gremlin connect to the Cassandra cluster and return a blank Gremlin prompt. Success! Keep Gremlin open for the next Step.

Step 3: Run Indices

Now before we add any data to our graph, we need to do a one-time setup of any Titan and ElasticSearch property and label indices. This must be done with caution because in Titan, indexes cannot be modified, dropped, or added on existing properties (Titan Limitations).

Note that we created a script for our indices and tracked them in Github to quickly adapt our indices when updating and reloading a new Graph. Also keep in mind that Titan 0.4.1 has a new index syntax that is different and not backwards compatible with the old Titan 0.3.2 syntax.

In the Gremlin shell, copy and paste the indices script. If all indices run successfully, commit, shutdown and exit:

gremlin> g.commit()
gremlin> g.shutdown()
gremlin> exit

(Optional) Part 3: Load GraphSON

If you are doing a bulk load of GraphSON into Titan, you can do so via Faunus or Gremlin. The GraphSON format for each method is unique, so you will need to ensure that your GraphSON format adheres to the expected rules. This walkthrough will focus on the Gremlin GraphSON load.

Save the graphSON file (i.e. - gremlin.graphson.json) to the root of the Titan directory.

Note that the above configuration assumes that you are using the Datastax AMI which includes a raid0 directory, and thus logs to the /raid0/log/rexster directory. You must create this directory before starting the script:

mkdir /raid0/log/rexster

Save file, and start rexster with upstart:

sudo start rexster

Check the log to ensure successful startup:

cd /raid0/log/rexster
tail rexster.log

Success!

You now have a fully configured Cassandra, ElasticSearch, Titan, Rexster setup on a single node. Once you've applied this configuration to all your nodes, you can start Rexster Server on the entire cluster, set up an ELB that contains your Rexster instances binding port 80 to 8182, and start accepting Rexster requests from the ELB's domain.

Jenny Kim

Jenny Kim is a senior software engineer at Cobrain, where she works with the data science team. Jenny graduated from the Uuniversity of Maryland with a B.S. in Computer Science and a B.A. in American Studies. She acquired her Masters in Information Systems Technology from The George Washington University in December 2013. In her free time, Jenny enjoys volunteering at local film festivals, obsessive vacuuming, and relaxing with the family Shih Tzu.