The next steps for Spark in the cloud

Serdar Yegulalp |
June 8, 2016

Simply having Spark in the cloud isn't enough. What matters is what it can connect to and how easy it is to use

Over the course of the last couple of years, Apache Spark has enjoyed explosive growth in both usage and mind share. These days, any self-respecting big data offering is obliged to either connect to or make use of it.

Now comes the hard part: Turning Spark into a commodity. More than that, it has to live up to its promise of being the most convenient, versatile, and fast-moving data processing framework around.

There are two obvious ways to do that in this cloud-centric world: Host Spark as a service or build connectivity to Spark into an existing service. Several such approaches were unveiled this week at Spark Summit 2016, and they say as much about the companies offering them as they do Spark's meteoric ascent

Microsoft

Microsoft has pinned a growing share of its future on the success of Azure, and in turn on the success of Azure's roster of big data tools. Therefore, Spark has been made a first-class citizen in Power BI, Azure HDInsight, and the Azure-hosted R Server.

Power BI is Microsoft's attempt -- emphasis on "attempt" -- at creating a Tableau-like data visualization service, whileAzure HDInsight is an Azure-hosted Hadoop/R/HBase/Storm-as-a-service offering. For tools like those, the lack of Spark support is like a bike without pedals.

Microsoft is also rolling the dice on a bleeding-edge Spark feature, the recently revamped Structured Streamingcomponent that allows its data to stream directly into Power BI. Structured Streaming is not only a significant upgrade to Spark's streaming framework, it is a competitor to other data streaming technologies (such as Apache Storm). So far it's relatively unproven in production, and already faces competition from the likes of Project Apex.

This is more a reflection of Microsoft's confidence in Spark generally than in Structured Streaming specifically. The sheer amount of momentum around Spark ought to ensure that any issues with Structured Streaming are ironed out in time -- whether or not Microsoft contributes any direct work to such a project.

Until now, IBM has leveraged Spark by making it a component of already established services -- e.g., Bluemix. IBM's next step, though, will be to provide Spark and a slew of related tools in an environment that is more free-form and interactive: the IBM Data Science Experence. It's essentially an online data IDE, where a user can interactively manipulate data and code -- Spark for analytics, Python/Scala/R for programming -- add in data sources from Bluemix, and publish the results for others to examine.