The past, present, and future of streaming: Flink, Spark, and the gang

Reactive, real-time applications require real-time, eventful data flows. This is the premise on which a number of streaming frameworks have proliferated. The latest milestone was adding ACID capabilities, so let us take stock of where we are in this journey down the stream -- or river.

Spark: The big data tool du jour is getting automationSpark is the hottest big data tool around, and most Hadoop users are moving towards using it in production. Problem is, programming and tuning Spark is hard. But Pepperdata and Alpine Data bring solutions to lighten the load.

To begin with, as Baer noted, there is an API for Flink that can be downloaded from GitHub, but it only works for a single stream. The version with the "runner" for multiple parallel streams is part of the data Artisans Platform - the commercial incarnation of Flink.

This is not at all surprising, as data Artisans, the vendor that provides support for Flink and employs a big part of its full-time contributors has an open core policy. That's a very common policy in the open source world, and one that data Artisans/Flink's main competitor, Databricks / Apache Spark, is also taking.

How many streaming engines does the world need?

As Baer would say, how many streaming engines does the world need? Good question, which may also be rephrased as two follow-up questions: How many vendors can survive doing what data Artisans and Databricks do, or how do you choose a streaming engine?

The answer to the first question is exactly two, at this point: data Artisans and Databricks. A third competitor, DataTorrent, and its Apache Apex engine, which we covered a while back, went belly up. Seems like the unusual "we'll do anything including building on our competitor's engine" message was one last effort to stay afloat by adopting an approach more apt to a consultancy than a vendor behind an open source project.

Either way, this means there are a number of orphans in the open-source streaming solutions space now: Platforms without a vendor to provide support, a hardened version, and steer their development. Besides Apex, the list also includes Apache Storm and Apache Samza. Storm is older and more mature than Samza, and also has some support from Hortonworks.

Hortonworks' core business is not streaming, however, and if you want to use Storm and have enterprise support levels, it seems you'll have to go for the entire Hortonworks stack, too. We don't know whether Hortonworks has plans to step up for Storm, but we don't have any such signals at this point.

There also are a number of closed-source solutions for streaming, but it looks like they have an uphill battle to fight. They may have their merits and customer base to show for, but much of that is based on legacy contracts and relationships. In a "try before you buy," fast-paced, open-source world, and an expanding market for streaming, winning new contracts won't be easy.

And then we also have the cloud vendors, of course: AWS with Kinesis, Google Cloud with Dataflow, and Azure with Stream Analytics. The usual motif plays out here, as well. These engines may or may not be the ones best suited to your needs. But if you're already using AWS, Google Cloud, or Azure, they will make it really easy and tempting for you to sign up and integrate their streaming solution in your applications.

Streaming engines adoption and competition

Discussing the streaming market with Kostas Tzoumas, data Artisans' CEO, Tzoumas was clear about what he sees as the biggest competition for data Artisans: Legacy. Tzoumas deliberately refrained from comparing data Artisans/Flink to other options, focusing instead on their efforts to reach out and scale up in terms of evangelizing and sales.

His views resonated with many Flink Forward attendants, including some of data Artisans most high-profile clients. Delegates with loads of technical hands-on experience from the likes of Alibaba, Netflix, and Microsoft, all emphasized that changing the paradigm and learning to work with streaming is something they have to master and spread the word for every day.

Some of their comments were around things such as the need to have streaming work with all the reliability that is a given in the batch world, to learn to program in a more thoughtful way compared to single-threaded applications, and to raise the abstraction level. data Artisans seems to be listening, judging from what is in its agenda.

The evolution of streaming. (Image: Data Artisans)

We already mentioned the introduction of ACID to cater for reliability, which was to a large extent driven by the requirements of large financial and eCommerce organizations that use the data Artisans Platform. Another major bet for Flink is the advance toward the unification of APIs for streaming and batch, which Alibaba has been working on and is about to be integrated in the core Flink codebase.

Flink has a number of APIs -- data streams, data sets, process functions, the table API, and as of late, SQL, which developers can use for different aspects of their processing. Ideally, people would like to use SQL for everything. This would not only simplify the lives of developers, but also make Flink more approachable for non-technical users.

The need to make data Artisans sustainable may have something to do with other choices made too. The fact that data Artisans Platform is not available in the cloud, for example, is a striking difference with Databricks, which touts a cloud-only strategy for its own platform, playing the iPaaS card.

But when your main clients are behemoths with their own infrastructure, as seems to be the case for data Artisans, offering them a cloud version makes less sense. That may also explain Tzoumas' comment when he said that they do not compete with Databricks/Spark much. Not that Flink is not attractive for smaller organizations, but the story of using Flink plus some support and consulting, rather than the data Artisans Platform, was one we heard more often from them.

Data Artisans and Apache Flink going forward

Apache Flink's (twin) versions 1.4 and 1.5 were of the kind to introduce somewhat unglamorous, not very popular, but highly needed improvements. They were all about production deployment and stability options, and they meant some backwards compatibility had to be broken. This is why we heard many users still rolling with 1.3, even though improvements in 1.6, mostly in streaming SQL, tempted some to take the plunge and upgrade.

Now, that hard, unglamorous work is mostly over. One important part that data Artisans aims to address is the containerization of Flink, or being able to use it as a library with Docker and Kubernetes, in what they call Reactive mode.

Other items in the agenda for the near future include auto-scaling, time-versioned table joins (a much needed feature in a world where data is constantly updated), and SQL for pattern analysis. SQL has been extnded with the MATCH_RECOGNIZE capability toward this end, and data Artisans wants to bring this to Flink.

Another interesting direction is opening up to Python via Apache Beam. Although Beam and Flink are conceptually rather close, as data Artisans CTO Stephan Ewen noted up to now Flink did not have any tangible benefits to reap by being aligned with Beam. But support for Python is changing that.

Beam is introducing a framework through which APIs in languages other than Java can be supported, and Python is the first one. According to the Apache Beam people, this comes without unbearable compromises in execution speed compared to Java -- something like 10 percent in the scenarios they have been able to test.

Databricks/Spark on the other hand has had support for Python for a while now, which may help explain what we perceive as a broad differentiation between the two platforms: Flink is used more as a fast processing stateful engine, with ACID reinforcing its position as the integration hub for the real-time enterprise, while Spark is used more as a data science -- analytics backbone, with Python and notebook integration contributing to its popularity.

Of course, there are overlaps, and things are not as clear-cut as that. In any case, it is worth noting that data Artisans ACID support is patented and part of data Artisans Platform, which means that unlike stateful streaming, Databricks will not be able to introduce it in its own platform as easily. Regardless, Databricks and Spark have been making progress on their own trajectory, and we will be sharing more on that soon.

While it has yet to draw critical mass commercial support, Apache Flink promises to fill a gap not addressed by other open source streaming engines: adding replay and rollback to your streaming application.

Streaming is hot in big data, and Apache Flink is one of the key technologies in this space. What makes it different, what new features are included in its latest release, and what is its role in conquering the big data world?

Thank You

By registering you become a member of the CBS Interactive family of sites and you have read and agree to the Terms of Use, Privacy Policy and Video Services Policy. You agree to receive updates, alerts and promotions from CBS and that CBS may share information about you with our marketing partners so that they may contact you by email or otherwise about their products or services.
You will also receive a complimentary subscription to the ZDNet's Tech Update Today and ZDNet Announcement newsletters. You may unsubscribe from these newsletters at any time.