Hadoop Summit 2015 Takeaway: The Lambda Architecture

Hadoop Summit 2015 Takeaway: The Lambda Architecture

Laurent Bride joined Talend in 2014 as Chief Technical Officer. He came with 17 years of software experience during which he held various individual, management and executive roles in customer support and product development.Most recently, Laurent was CTO at Axway where he was responsible for R&D, Innovation and Product Management. He has also spent more than nine years in the Silicon Valley, working for Business Objects and then SAP. Laurent holds an engineering degree in mathematics and computer science from EISTI.

It has been a couple of weeks since I got back from the Hadoop Summit in San Jose and I wanted to share a few highlights that I believe validate the direction Talend has taken over the past couple of years.

Coming out of the Summit I really felt that as an industry we were beginning to move beyond the delivery of exciting innovative technologies for Hadoop insiders, to solutions that address real business problems. These next-generation solutions emphasize a strong focus on Enterprise requirements in terms of scalability, elasticity, hybrid deployment, security and robust overall governance.

From my perspective (biased of course!), the dominant themes at the Summit gravitated around:

Spark (the champion) stands out from the crowd because of its ability to address both batch and near real-time (micro batch in the case of Spark) data processing with great performance through its in-memory approach.

Spark is also continuously improving its platform by adding key components to appeal to more Data Scientists (on top of MLlib for machine learning, Spark R was added in the 1.4 release) and expand its Hadoop footprint.

Spark projects in the Enterprise are on the rise and slowly replacing Map/Reduce for Batch Processing in the mind of developers. IBM’s recent endorsement and commitment to put 3500 researchers and developers on Spark related projects will probably accelerate Spark adoption in the hearts of Enterprise architects.

But, because there’s a champion, there must also be a contender…

This year, I was particularly impressed by the new Apache Flink project, which attempts to address some of Spark’s drawbacks like:

- Not being a YARN first class citizen yet

- Being Micro Batch (good in 95% of the cases) versus pure streaming

- Improved/easier Memory Management

If you look at Flink “marchitecture”, you can almost draw a one for one link between its modules and Spark’s. It the same story when it comes to their APIs, they are very similar.

With our Talend 5.6 platform, we delivered a few Spark components in Tech Preview, since then we have doubled down on our Spark investments and our upcoming 6.0 release will see many new components to support almost any use case, batch or real-time. From a batch perspective, with 6.0, it will be easier to convert your MapReduce jobs into Spark jobs and gain significant performance improvements along the way.

It’s worth highlighting that the very famous and advanced tMap component will be available for Spark Batch and Streaming, allowing advanced Spark transformation, filtering and data routing from single or multiple sources to single or multiple destinations.

As always, and because we believe native code running directly on the cluster is better than going through proprietary layers, we are generating native Spark code, allowing our customers to benefit from the continuous performance improvements of their Hadoop data processing frameworks.