The new major version release of Spark has been getting a lot of attention in the Big Data community. One of the most significant strides forward has been the introduction of Structured Streaming.

At the last Strata + Hadoop World in San Jose, Syncsort’s Big Data Product Manager, Paige Roberts sat down with Reynold Xin (@rxin) to get the details on the driving factors behind Spark 2.0 and its newest features. Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. He had just finished giving a presentation on the full history of Spark from taking inspirations from mainframe databases to the cutting edge features of Spark 2.0.

In part 1 of this interview, he talked about some of the driving factors behind this major change in Spark. In today’s part 2, Reynold Xin gives us some good information on the differences between stream and Structured Streaming; how to integrate Structured Streaming with Apache Kafka; and some hints about the future of Spark.

Reynold Xin: Sure. In many ways, you can think of it as the RDD API and the DataFrame API. The old stream was built on top of the RDD API. The way it works is to keep re-running the RDD API over and over again. This fact is reflected in the API itself.

Whereas Structured Streaming moves the API higher level. It just asks the user, “What kind of business logic do you want to happen?” And then the engine will automatically incrementalize the operation. For example, in Structured Streaming, you just say, “I want a running sum on my data.” Versus if you go back to Spark Streaming, you have to think, “How do I compute a running sum? Well, the way I compute a running some is for each of the batches to compute a sum, and then I’ll be summing them all up myself.” So, this is one big difference.

The other big difference is Structured Streaming takes the transactional concept as a first-class citizen. Essentially, the users don’t have to worry about how to guarantee exactly once delivery. And, data integrity is a first-class concern. We made it very difficult for users to screw it up. Because in the past we have seen with Spark Streaming, while exactly once was possible to do, often the users would screw up data integrity accidentally.

Last, but not least, the API and the integration with the batch component, it just works with batch. The API is the same, so you can write to the same data destination. You can read it back directly. You get stuff that actually makes sense. It’s just a lot easier.

Roberts: Yeah. Okay, um, so again more ease of use and I see the emphasis on data integrity. You also get the batch and streaming together. That’s nice, that you don’t have to re-rewrite. I guess before Spark Streaming was very micro-batch, and it sounds like Structured Streaming can do true streaming processing?

Xin: From the API point of view there is no concept of a batch. Now from the execution engine point of view, it is still going through micro-batching. But the comparison of micro-batch and true streaming is kind of a misnomer in my mind. Typically, people think if you do event at the time, that’s real streaming.

If you do three events at a time …

Right, if you do a bunch of events at a time, it’s batch. But in reality, everything else out there is batch. There’s no true event-at-a-time engine, because processing one event at a time has enormous overhead. It’s basically not practical when you have a huge amount of data. So, every other engine actually batches to some degree.

Alright. Okay, so I saw in your presentation that you guys integrate with Kafka really well.

Yeah.

Is that integration difficult to accomplish? I mean if you’re feeding into Kafka, you’re going into Structured Streaming, and you’re using the DataFrame API, is there a special procedure or something?

It just works out of the box. We took care of all the details so the user doesn’t have to worry about it. All they need to do is spark.readstream and then the Kafka stream information, and put in the topic you want to subscribe to, and now you’ve got a DataFrame.

That’s really simple.

In many cases, it even automatically infers a schema. For example if you have JSON data coming in, Spark will infer the schema automatically. You usually don’t have to even declare it. Although for sanity’s sake, I would recommend users do declare it. Because you can’t guarantee that your data will be perfect and you’d want the right error to surface when they are not.

Does it integrate at all with the schema registry for Kafka?

It doesn’t currently integrate with that part. I think that part is fairly new.

It is very new, yeah. Okay well, you mentioned the Tungsten project and I saw you said something about in the future, it’s looking to move into other execution platforms like GPUS, and other ways to make that even more efficient, or at least more flexible. How far ahead is that? Is that like way ahead or…

It’s in an exploratory stage right now. And there are also different teams outside of Databricks looking at that. It’s a pretty major project. We should only do it if it makes sense. For example, sometimes it might not make sense at all because all the data is not in a specific part of the processing but rather for example, reading IO. What we have found, at least for a lot of the Databricks projects, IO is currently the biggest bottleneck, so we have a lot of work that’s coming that will address IO performance. And then uh maybe processing becomes the next bottleneck in there.

Yeah, it seems to ricochet back and forth. It’s CPU bound, now it’s I/O bound, now it’s CPU bound …

When you optimize one more, the other one becomes more devolved.

Yeah. So, is there anything exciting coming up fairly soon that you would like to talk about?

Oh, yeah. For open source Spark, I think there are a few things. One is that Structured Streaming will GA, hopefully soon. It will be a pretty important milestone of the project. Another thing is we are looking at how we can make Spark more useable on essentially a single node laptop. This includes, for example, being able to publish Spark to the Python package index. So the users can just go to pip install Pyspark, and Spark shows up on your laptop. It’s becoming more efficient, and this broadens the addressable users. Acquiring that kind of market, but addressable users for the open source project. It would be really nice if, with a single tool, you could process a small amount of data on your laptop, and then when you want to scale up to a larger amount of data, it just runs on the cloud.

Being able to move from laptop to server to cluster to cloud is something Syncsort has been all about for a while now so we’re really happy to see that. We call it Design Once, Deploy Anywhere. So, that’s great to hear! Is there anything else you wanted to mention?