Back in March at Strata + Hadoop World, Syncsort’s Paige Roberts caught up with Jules Damji (@2twitme), the Spark Community Evangelist for Databricks, and had a long conversation. In today’s first part of this four-part series, we look at one of the important themes that came up again and again: the importance of the Apache Spark community to the continued development of Apache Spark, including the depth of the Databricks relationship with the Apache Spark community and how that affects the development of Spark over time.

Paige Roberts: Let’s start by having you introduce yourself.

Jules Damji: As the Spark Community Evangelist at Databricks, my job is to reach out to the developers of Apache Spark and evangelize both the benefits of Apache Spark and of the Databricks Unified Analytics Platform.

More importantly, my job is to listen to the community: it’s a two-way conversation. I think that’s an important part in any advocacy. If you really want to win the hearts of the developers, you must understand what their needs are. That’s the crucial link that I have between the community and Spark engineering group within Databricks. It’s an important role and I just love it.

Roberts: For three years, I had the title of Hadoop and Analytics Evangelist so I totally understand the role. How do you perceive your relationship to the Apache Spark community?

Damji: When Databricks was founded, we realized that Apache Spark was this new big thing. We wanted to give it to the community. We wanted to make sure that the community could contribute to it. We believed from early on that innovation happens in collaboration, not in isolation.

So even though we produce and contribute a lot of the code, we also take a lot of the community code, and one of the ways we do that is through evangelism. I go out there and go to meet-ups. I listen a lot to find out what the community wants, and I bring back those requests from the developers and community, and give feed back to the Databricks engineering team. Then Databricks, and the community, create pull-requests in GitHub to create those new features.

So, what Databricks as a company brings to the community is the stewardship to ascertain with help of the community what new features are needed, and to ensure features get in through the release cycles by the PMC release managers. Like today, we attended the Structured Streaming talk, and there were a lot of things that came up that we hadn’t thought about. We hear from our customers, we hear from our vendors, we hear from the community, and we hear from developers who are working with other customers who are building these applications.

Can you give me an example of a feature like that?

Okay. For example, today, watermarking is a feature that came about from customer need. What do you do with events after you drop them? In certain fraud detection scenarios, we want to keep the events. Suppose you need a certain event after 10 minutes. What we do right now is we just drop them. But there might be a need for auditing, for example, in the government sector.

Good point.

What if there was a dispute of some sort? If you don’t have that particular record, because you dropped it, there’s no way to reclaim it. There’s no way to investigate. So, having those watermark events is one example of where the community raises the issue and we have the ‘Aha’ moment and we say, “Yes, we should implement that.” That’s a big thing that Databricks does. Databricks brings leadership and stewardship to the Apache Spark community.

What else does Databricks bring to the table?

Well, if you run our Databricks Unified Analytics Platform, we have the core competency to make your Spark experience the best, in more ways than one. And obviously, there are some additional technologies and benefits that come with the Databricks platform, that are added on top of Apache Spark, that makes Unified Analytics Platform the best place to run Apache Spark. But at the core, it will always be powered and anchored by Apache Spark. It will always work in conjunction with the community. The whole premise of Databricks from the beginning is: community is vital to the vibrancy of any technology.