LinkedIn open sources stream-processing engine Samza, its take on Storm

LinkedIn has open sourced a technology called Samza, which the company uses to process data in real time. It sounds an awful lot like Storm — the de facto stream-processing technology for web properties that has a home inside Twitter — only Samza is built on top of Hadoop and utilizes LinkedIn’s homemade Kafka messaging system.

But Storm and Samza are rather similar. As LinkedIn’s Chris Riccomini wrote in the blog post introducing Samza, “[It] helps you build applications that process feeds of messages—update databases, compute counts and other aggregations, transform messages, and a lot more.” Those are some classic Storm application and, indeed, the Samza documentation includes a page dedicated to comparing the two systems.

When LinkedIn was spreading the word of Samza through various forums and other online communities last month, one commenter on Grokbase noted the possible benefits of Samza:

“Like many we use Storm for near real-time processing our Kafka based streams. In addition we send this data to Hadoop for offline analysis. Consolidating these three environments to one is a win by itself.”

On paper, Samza seems like a good idea because of that consolidation and because of how well it appears to marry its two big components. Its Apache Software Foundation project home page lays out some of its features and highlights how Kafka and YARN (the processing framework on top of which Hadoop version 2.0 is built) work together. Among the highlights are:

Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.

Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.

If Samza is indicative of a bigger picture, though, it’s that YARN appears to be living up to all the hype the Hadoop community has been heaping upon it for the past 18 months. It runs Storm, it runs Samza, it could potentially run a lot of things. This matters because a lot of software vendors big, small and in between have banked their big data futures (some, their entire future) on Hadoop as the platform that will ultimately carry the day.

But you don’t have to take my word for any of this. If you’re in London, swing by our Structure: Europe conference taking place Wednesday and Thursday, or watch the live stream wherever you are. We’ll have LinkedIn’s data engineering boss Bhaskar Ghosh, among a slew of other speakers responsible at one point or another for managing some of the biggest systems on the web. We’ll also have technology executives from international corporations such as BMW. And I’m sure they all have an opinion on big data and the right technologies for doing it.