At this year’s Strata + Hadoop World, Syncsort’s Paige Roberts caught up with Jules Damji (@2twitme), the Spark Community Evangelist for Databricks and had a long conversation. In this second post of our four-part interview, they discuss the trend of the Spark and Hadoop technologies and communities merging over time, and how that’s creating a science fiction novel kind of world, where artificial intelligence is becoming commonplace.

Paige Roberts: One thing I’ve noticed over the last few years is that to a certain extent; the Spark and Hadoop communities seem to be merging. We just had a Hadoop focused conference, and yet half the sessions were about Spark. Why do you think that is?

Jules Damji: Apache Spark is such an integral part of Big Data because it allows people to deal with and process large scale data in a very quick manner. It allows people to run different workloads on a single, unified engine. That’s one of the main attractions.

If you look at the history of Big Data, you had all these different systems and you had to stitch them together to do your end-to-end job pipeline. It was difficult. You had to learn five different systems.

Another reason people are rallying around Apache Spark is that it works very well with the Hadoop ecosystem. You can store your data in HDFS or S3 or whatever. The API works well with the storage level. It works well with the applications. Apache Spark talks to BI tools, to Sqoop, to all these third-party data ingestion tools. And it can be deployed in different environments as well. You can have it running on YARN, on its own cluster, or on Mesos.

These dimensions of Apache Spark’s flexibility make it an integral part of Hadoop or Big Data in general. Today there’s not a single conversation that’s happening in the world where Big Data and Apache Spark are not mentioned in the same sentence.

Roberts: Right. I see that, too.

Damji: We are in the Big Data era. We have seen data coming in fast and we need this real-time end-to-end solution. If I get data, I should be able to make a decision fast. And I should be able to consult either my machine learning model in split second time or I should be able to interact with my stored data. One of the things that Apache Spark provides through Structured Streaming is the ability to write a continuous application.

Today, you heard Reynold Xin speak about the ability to write fault-tolerant applications that give you the ability to interact with streaming data and query it as if you were querying your old, stationary data. It gives you the ability to do ad hoc analysis on the fly. Before, it took you a long time to do this after you finished getting the data. Now, you can do it instantly. That’s one thing.

The other thing I see is that Artificial Intelligence (AI) has come to its fore, and Spark is going to play a big role in the democratizing aspects of Big Data and AI.

Yes, exactly, self-driving cars, image and voice recognition, recommendation engines, and so much more. At the center of that is the ability to do advanced analytics quickly. The ability to employ popular framework like TensorFlow with Apache Spark, to be able to do machine learning using Apache Spark’s library at scale, to be able build deep neural networks, and do computational analysis quickly. That enters us into this new era of Artificial Intelligence. We now have some of these AI systems, which used to be science fiction. Now, they are taking realistic form.

The science fiction novels that I read as a kid are now old hat. Yeah, we did that last year.

You will see more and more Apache Spark playing an integral role in this Big Data and Artificial Intelligence era, what I call the Zeitgeist of Big Data. At the core is the ability to process a lot of data fast, ability to manage large clusters seamlessly, ability to transform data at immense speeds, ability to process myriad kinds of data, such as text, video, unstructured, and structured data. It can all be done through the same processing engine such as Spark.

Streaming, batch, …

We’re streaming, we’re doing batch. Before, all these different systems had different formats of data, and different engines.

Yeah, different engines, and different APIs…

Right. But now you have a unified API. You have workloads that run on the same engines so that makes things a little easier. It’s the stepping stone to this powerful digital revolution. No previous industrial industrial revolutions had so many fast technology trends and innovations than this digital revolution. Just in less than few years, I mean, look at what we are going through with Apache Spark.

Yeah, it’s amazing!

And 10 years from now you might have something else which might be different from Spark, but the next five, we will see Apache Spark growing. We’ll see more and more intelligent application built on top of machine learning techniques that Apache Spark facilitates and catalyzes. And we’ll see huge performance improvements.

Tungsten is the second generation of how you can have 10X to 40X the performance. The need is there. The need is not new. That you have data coming in at enormous velocity is new. So, you need capacity to process it instantly. In order to do that, you need very performant distributed systems. And I think you and I are both living in the heart of this data Zeitgeist.

This revolution has been a lot like being in the center of a tornado. Everything around you changes so quickly. So, how do you like the Strata conference?

Oh, this has been wonderful. Like you said, this is a Big Data Hadoop conference and to see how many Apache Spark talks were there was amazing. A testament that Spark is an integral part of Big Data.

It’s certainly is, yes.

It is. Spark Summit is growing fast, too. Big Data and Apache Spark, that’s become a very symbiotic relationship. It’s very complimentary. You can’t really talk about Big Data and not talk about Apache Spark.