Monday, April 1, 2013

Akka, Finagle and Storm are 3 new open source frameworks for distributed parallel and concurrent programming. They all run on the JVM and work well with Java and Scala.

They are very useful for many common problems:

Real-time analytics

Complex website with different input and outputs

Finance

Multiplayer games

Big data

Akka, Finagle and Storm are all very elegant solutions optimized for different problems. It is confusing what framework you should use for what problem. I hope that I can clarify this.

The 30 seconds history of parallel programming

Parallel / concurrent programming is hard. It had rudimentary support in C and C++ in the 1990s.

In 1995 Java made it much simpler to do simple concurrent programming on one machine by adopting the monitor primitive. Still if you had more than a few threads you could easily get deadlocks, and it did not solve the bigger problem of spreading a computation over many machines.

MapReduce and Hadoop

Hadoop the open source version of Google's MapReduce is the most well known method for distributed parallel programming.
It does streaming batch computation, by translating an algorithm to a sequence of map and reduce steps.
Map is fully parallel.
Reduce collect the result from the mapping step.

Hadoop has a steep learning curve. You should only use it when you have no other choice. Hadoop has a heavy stack with a lot of dependencies.

Hadoop has long response time, and it not suited for real-time responses.

The Akka, Finagle and Storm frameworks are all easier to use than Hadoop and are suited for real-time responses.

Storm

Storm is created by Twitter and open sourced in 2011. It is written in Clojure and Java, but it works well with Scala. It is well suited for doing statistics and analytics on massive streams of data.

Storm can describe streaming computation very simply: You make a graph of you computation with some input data source called spouts at the top, below that computation nodes called bolts that can depend on any spout or bolt that has been computed above it, but you cannot have cycles. The graph is called a topology.

Features

Storm will deal with communication between machines and bolts

Consistent hashing to spread computation to right instance of a bolt

Error recovery due to hardware or network failure

Storm does not handle computations that fail due to inherent errors well

Can do analytics on Twitter scale input, literally

You can create a bolt in a non JVM language as long at it talks Thrift

Build in support for Ruby, Python, and Fancy

Word counter example in Storm

The hello world example for Hadoop is to count word frequencies in a big expanding text corpus.

Turns computation into a Future, a monadic composable asynchronous computation.

Use cases

Complex website using different services and supporting many different protocols

Web crawler

Side by side comparisons

Akka vs. Finagle

Akka and Finagle can both do two-way communication.

Akka is great as long as everything lives on one actor system or multiple remote actors systems. It is very simple to have them communicate.

Finagle is much more flexible if you have heterogeneous services and you need different protocols and fallbacks between servers.

Akka vs. Storm

Akka is better for actors that talk back and forth, but you have to keep track the actors, and make strategies for setting up different actor systems on different servers and make asynchronous request to those actor systems. Akka is more flexible than Storm but there is also more to keep track of.

Storm is for computations that move from upstream sources to
different downstream sinks. It is very simple to set this up in Storm so
it run computation over many distributed servers.

Finagle vs. Storm

Finagle and Storm both handle failover between machines well.

Finagle does heterogeneous two-way communication.

Storm does homogeneous one way communication, but does it in a simple way

Serialization of objects

Serialization of objects is a problem for all of these, since you have to send object between different machine, and the default Java serialization has problems:

My experience

It has been hard for me to decide what framework is the better fit in a given situation, despite a lot of reading and experimenting.

I started working on an analytics program and Storm was a great fit and much simpler than Hadoop.

I moved to Akka since I needed to:

Incorporate more independent sub systems all written in Scala

Make asynchronous call to external services that could fail

Setup up on-demand services

Now I have to integrate this analytics program with other internal and external services some in Scala some in other languages I am now considering if Finagle might be a better fit for this. Despite Akka being easier in a homogeneous environment.

Afterthought

When I went to school parallel programming was a buzz word. The reasoning was:

Parallel programming works like the brain and it will solve our computational problems

We did not really know how you would program it. Now it is finally coming of age.

Akka, Finagle, Storm and MapReduce are different elegant solutions to distributed parallel programming. They all use ideas from functional programming.

Storm and Akka are used for different things.Typical use cases for Storm and Akka:

Storm: Analytics on log files.

Akka: Low latency ad tech. You need a lot of actors that can act very fast, and you need some way to update the content in them and pass out the result.If you need to communicate with the outside world you should look at Spray: https://github.com/spray/spraySpray can turn an Akka actor into a HTTP client or servers.

So if you need Storm you should not try to do it in Akka.Hope this answered your question.

About Me

My interests are natural language processing, machine learning, programming language design, artificial intelligence and science didactic.
Author of open source software image processing project called ShapeLogic: https://github.com/sami-badawi/shapelogic-scala.
I have worked in NLP for several years, but spent many years working in the cubicles, at: Goldman Sachs with market risk, Fitch / Algorithmics with operational risk, BlackRock with mortgage backed securities, DoubleClick with Internet advertisement infrastructure, Zyrinx / Scavenger with game development. I have a master of science in mathematics and computer science from University of Copenhagen. For work I have been using these programming languages: Scala, Python, Java, C++, C, C#, F#, Mathematica, Haskell, JavaScript, TypeScript, Clojure, Perl, R, Ruby, Slang, Ab Initio (ETL), VBA. Plus many more programming languages for play.