Shortly after Strata in New York last year, Syncsort Big Data Product Manager, Paige Roberts, caught up with Sean Anderson, who is in charge of product marketing for data science and engineering at Cloudera. In earlier parts of this interview series, he provided a lot of information on Spark 2.0 new features and improvements including, Spark on the Cloud, and Spark Structured Streaming. In this third and final portion of the interview, Sean Anderson dug into two new projects, Apache Livy and Apache Spot.

Sean Anderson: Recently, we helped build and launch to an open source project called Apache Livy.

Livyis a open-source REST service for Apache Spark jobs and has some great features along the lines of remote snippet and job execution. So I can take a very specific snippet of code, and use Apache Livy as a web interface to get that into my Spark context.

Paige Roberts: Wow. I didn’t know about that one at all. That’s brand new to me. So, is it sort of Oozie-like or … Can you give me some more detail?

Anderson: It’s an open source REST service for Apache Spark. It’s in Cloudera labs right now. It’s up for inclusion at Apache.org. We think it will be included in the Apache projects pretty soon.

Apache Livy is a REST service. It’s specifically valuable for long-running Spark contexts where multiple production jobs are present. It gives you the ability to manage the multiple contexts simultaneously. You can run them on the cluster via YARN using the Livy service if you want to get better fault tolerance. You can submit jobs, and there’s some better security integration for that as well.

Roberts: How do you see that getting used?

For us, at Cloudera, it’s really all about: How do we make sure that we can iterate and develop on Spark workloads that are existing, without having to take them out of production? Livy allows us to do that in a pretty nice, elegant way.

I’ll have to do more research on that one. I was just talking to Doug Cutting at the Cloudera customer award ceremony, and he told me about Spot which I had not heard of before. It seems like all I’ve got to do is talk to you guys, and get great name ideas for my dogs, and also learn about all the new cool tech! [laughs]

[laughs] Yes! About Apache Spot, it was previously called ONI or Open Network Insight. It’s a pretty cool project. That’s something we launched in collaboration with INTEL that is now incubating.

Spot aims to be a common platform for cyber security, network intrusion detection with Hadoop as the underlying platform. Traditionally we see SEM systems that are only doing a couple of network end-points. But increasingly, people need the ability to ingest massive amounts of data, and to coalesce that with other sources. So Apache Spot is pretty nice, and it is gaining traction in record time.

So it supports multiple end points, not just particular types of hardware?

Right. In the same way that you can mix sources for an analyst’s perspective, you can do that with Apache Spot. So you may have network flows, you may have DNS, you may have proxy logs. They’re all streaming into a centralized system. There may be some machine learning or some event monitoring that’s happening on that system. Then you have the ability to operationalize those into scoring systems, stuff like that. It really helps you build these robust cyber capabilities.

That sounds like something that’s going to make a lot of IT ops guys happy. So, are these new projects where you guys are putting your energy these days?

Yeah. For Cloudera, things like streaming, machine learning and Spark in the Cloud are going to be big areas of focus. We see this as really robust capabilities that are not only evolving Spark ecosystems, but we now have production customers that demand very specific streaming performance. Or, they have high demands on the amount of machine learning algorithms they can launch. So that’s just going to be a huge focus for us moving forward. We often follow the lead of our customer heroics and we seem them gravitating to streaming and machine learning solutions.