A critical piece of that architecture is the ability for Terracotta BigMemory to act as a fast In-Memory buffer, accessible by both the real-time world and the batch world...effectively bridging the gap between the 2.
The Terracotta BigMemory-Hadoop connector is at the center of that piece, allowing hadoop to write seamlessly to BigMemory.

For general information, please refer to existing writings about this connector:

But in this post, I want to be "hands-on" and enable you to see it running for yourself on your own development box. I've outlined the 5 major steps to successfully install and test the Hadoop-to-BigMemory connector on your own development platform.
I'll be using as a guide the code I put together for "AFCEA Cyber Symposium Plugfest", available on github at https://github.com/lanimall/cyberplugfest

2 - Clone the git repository to get the cyberplugfest code:

In the rest of the article, we will assume that $CYBERPLUGFEST_CODE_HOME is the root install directory for the code.

3 - Extract the hadoop connector somewhere on your development box.

The content of the package has some simple instructions as well as a "wordcount" map reduce package.
If you want to explore and follow the default instructions + sample word count program, it works fine…but please note that I took some liberties when it comes to my setup…and these will be explained in this article.
In the rest of the article, we will assume that $TC_HADOOP_HOME is the root install directory of the terracotta hadoop connector.

5 - Install Hadoop

I used the pseudo distributed mode for development…Tuning and configuring hadoop is outside the scope of this article…but should certainly be explored as a "go further" step. The apache page http://hadoop.apache.org/docs/stable/single_node_setup.html is good to get started on that...
In the rest of the article, we will assume that $HADOOP_INSTALL is the root install directory of Apache Hadoop

7 - Start Hadoop in pseudo-distributed mode

Ok, at this point, you should have all the software pieces (big memory max and hadoop) ready and running in the background. Now it's time to build a map/reduce job that will output something in Terracotta BigMemory. For the "AFCEA Cyber Symposium Plugfest" which this article is based on, I decided to build a simple "Mean Calculation" map/reduce job…the idea being that the job would run on a schedule, calculate the mean for all the transactions per Vendor, and output the calculated mean per vendor into a Terracotta BigMemory cache.

1 - Let's explore the application-context.xml

a - Specify the output cache name for the BigMemory hadoop job

In the <hdp:configuration><hdp:configuration>, make sure to add the "bigmemory.output.cache" entry that specifies the output cache. Since our output cache is "vendorAvgSpend", it should basically be: bigmemory.output.cache=vendorAvgSpendNOTE: I use Maven resource plugin, so this value is actually specific in the pom.xml (in the property "hadoop.output.cache")

For hdjob-vendoraverage, it's the standard "org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"

output-path

It is not needed for the hadoop BigMemory job since it does not write onto HDFS...

reducer

For the hadoop BigMemory job, a different reducer implementation is needed (org.terracotta.pocs.cyberplugfest.VendorSalesAvgReducerBigMemory) because you need to return an object of type "BigmemoryElementWritable"

For hdjob-vendoraverage job, the reducer returns an object of type"Text".

files

In the hdjob-vendoraverage-bm job, you need to add the terracotta license file so the hadoop job can connect to the terracotta bigmemory (enterprise feature)

c - Specify the job to run.

Done in the <hdp:job-runner ...> tag. You can switch back and forth to see the difference...

In this ehcache.xml file, you'll notice our hadoop output cache (as well as several other caches that are NOT used by the hadoop jobs). The one thing that is needed is that it must be a "distributed" cache - in other word, the data will be stored on the BigMemory Max Server instance that should be already running on your development box (The "" and "" tags specifies that)

Master Step 3 - Prepare the sample data

In the real demo scenario, I use Apache Flume (http://flume.apache.org) to "funnel" near real-time the generated sample data into HDFS…But for the purpose of this test, it can all work fine with some sample data. All we need to do is import the data into our local HDFS.

Extract the sample at: $CYBERPLUGFEST_CODE_HOME/HadoopJobs/SampleTransactionsData/sample-data.zip.
It should create a "flume" folder with the following hierarchy:

flume/

events/

13-07-17/

events.* (those are the files with the comma separated data)

Navigate to $CYBERPLUGFEST_CODE_HOME/HadoopJobs/SampleTransactionsData/
Run the hadoop shell "put" command to add all these files into HDFS:

$HADOOP_INSTALL/bin/hadoop dfs -put flume/ .

Once done, verify that the data is in HDFS by running (This should bring a lot of event files…):

$HADOOP_INSTALL/bin/hadoop dfs -ls flume/events/13-07-17/

Master Step 4 - Compile and Run the hadoop job

Before that though, make sure the maven properties are right for your environment (i.e. hadoop name node url, hadoop job tracker url, terracotta url, cache name, etc…). These properties are specified towards the end of the pom file, in the maven profiles I created for that event (dev profile is for my local, prod profile is to deploy in the amazon ec2 cloud)

Then, navigate to the $CYBERPLUGFEST_CODE_HOME/HadoopJobs folder and run:

mvn clean package appassembler:assemble

This should build without an issue, and create a "PlugfestHadoopApp" executable script (the maven appassembler plugin helps with that) in $CYBERPLUGFEST_CODE_HOME/HadoopJobs/target/appassembler/bin folder.

Depending on your platform (window or nix), chose the right script (sh or bat) and run:

Final Words

Using this hadoop-to-bigmemory connector, you can truly start to think: "I can now access all my BigData insights at micro-second speed directly from within all my enterprise applications, AND confidently rely on the fact that these insights will be updated automatically whenever you hadoop jobs are running next".