The context

Last summer, I started an open source project called boontadata-streams.It is an environment where one can compare big data streaming engines like Apache Flink, Apache Storm or Apache Spark Streaming to name a few.

this second result is better. All the differences can be explained:those with the 2016-12-05 14:53:35 time window happen because no further event was sent that could trigger the calculation of this time Window;the 2016-12-05 14:51:10 time window for cat4 corresponds to an event that was sent too late by the IOT simulator (on purpose) so that the streaming engine cannot take it into account.

The project leverages Docker containers, it can run on a single Ubuntu VM with ~14 GB of RAM.Of course, you are welcome to contribute. We may even let you use one of our Azure VMs while you develop. You can contact me: contact à boontadata.io.

The pom and the code

In order to write that, I relied on sample code on the Internet about Flink consuming Kafka and Flink writing to Cassandra.Still, it took me some time to put everything together.

What I consumed most of my time on was the Maven configuration file: pom.xml.

I’m not a Java specialist, so I gave up on the JAR size optimization and preferred to have a uber-jar.

I couldn’t make the Kafka 0.10 client working so I used Kafka 0.8.2. This version of the client needs to communicate to Zookeeper. Still, it can read from Kafa 0.10.

creates a time window in order to remove duplicates (the key is the message id)

creates a time window in order to aggregate (based on device id, message category)

sends some debugging information to Cassandra debug table about events and their time windows

aggredates data

writes results to Cassandra agg_events table

Feel free to visit the GitHub repo, or use the code as samples, or fix it and create a pull request when you find ways to improve it, or add your own implementation on other streaming engines like Storm, Spark streaming or others.

A copy of the most significant pieces of code

The rest of this post is a copy of the most significant pieces of code

Conclusion

We have containers in a common network which is described only thru docker means.This works on a single host, and also on multiple hosts (in the example, a host had 1 container, another host had 2 of the 3 containers).

Once you’ve created the cluster, go to https://{yourclustername}-node0.{install-location}.cloudapp.azure.com:9443 and connect.

In my case, I named the cluster mapr34 and installed it in North Europe region, so it is https://mapr34-node0.northeurope.cloudapp.azure.com:9443.

NB: this mapr34-node0.northeurope.cloudapp.azure.comhost name can be found in the portal when you browse the resource group where the cluster is. It’s attached to the public IP of the node.

Use mapr as the username and the password you provided in step 2 of the wizzard as the password.

Select each node and check the disks where you want to install the distributed file system./dev/sdb1 is the cache disk. The 1023 GB disks are VHDs.

Use the Next button to move on

Click Install -> to start the installation process.

After a number of minutes, the installation completes.

On the final step, you can find a link to a short name. Unless you’ve created an SSH tunnel to your cluster, you may need to use the long name instead. In this example where my cluster is called mapr34 and is installed in North Europe, the URL is https://mapr34node1:8443/. I replace it by https://mapr34-node1.northeurope.cloudapp.azure.com:8443.

NB: this mapr34-node1.northeurope.cloudapp.azure.comhost name can be found in the portal when you browse the resource group where the cluster is. It’s attached to the public IP of the node.

ypou connect with the same credentials as before: mapr/{the password you provided in step 2 of the creation wizzard}.

Now that the MapR file system is installed. Let’s see it as HDFS. Let’s also check if we can access Azure blob storage.

If you go back to the installation page you’ll have the option to install additional services:

When you’re done, you can stop the services, before shutting down the Azure virtual machines. If there are many nodes, you may want to use Azure PowerShell module or Azure Command Line Interface (Azure CLI). You can find them in the resources section of azure.com.

You may also prefer to remove all the resources that consitute the cluster: VMs, storage, vNet and so on. Of course, all the data will be removed as well, so you are asked to type the resource group before deleting it.

Conclusion

We saw how to create a MapR cluster in Azure. You just have to enter a few parameters in friendly Web interfaces and wait for the cloud and MapR to create everything for you!