Pages

Thursday, 9 July 2015

Podcast Interview with Jaroslaw Palka, Allegro Group

Waw. Since we started doing these Neo4j Graph Database podcast, I have spoken to 30 (!!) different people. Ah-may-zhing!!! They have been truly wonderful conversations, all of them, and I have truly enjoyed this ride :) ...

Today I am publishing the 30th episode, which is a great conversation with Jaroslaw Palka, of Allegro Group. Jaroslaw is a long time member of the Neo4j ecosystem, with a lot of interesting perspectives on it. Here's the recording:

Here's the transcript of our conversation:

RVB: Hello, everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here I am again recording a podcast for our graph database, Neo4j podcast series. Today I have a guest joining me on Skype, all the way from Poland. Hello, Jaroslaw, Jaroslaw Palka.

JP: Hello.

RVB: Hi, and thanks for joining me. I appreciate it. Very cool. Jaroslaw, you've been in the Neo4j ecosystem for a while. Do you mind introducing yourself, and what's your relationship to the wonderful world of graph databases?

JP: I work in Krakow and I'm with JVM and Java for the last, I think, since '99. I worked many years an architect and coach doing executive trainings for people and separate organizations. My journey with graphs, I think it started in 2005 or '06 - it's hard to remember all the dates [chuckles] - when we were trying-- when one of the organizations we were trying to migrate a large database which was supporting the online flight shopping - so basically searching for flights, the best flights, the shortest or the cheapest flights, and we were trying to migrate from the relational database to the graphs. Because what we found out is basically the structure we were working with is a graph and the problems we are solving is a typical graph problem so finding the shortest, the cheapest, but it is not always shortest path. Sometimes you are looking for the quickest flight so you don't have a lot of stops or you are looking for a cheapest flight or you are looking for the flight when your flight with specific airlines or--

RVB: So that use sort of got you going in the world of graph databases. You've done some other use cases as well, right? You were telling me about about recommendations, access control, all that wonderful stuff as well.

JP: Yes. The problem is that when you start with it, from my perspective, after first project, I started to see graphs everywhere, and everything in my life started to be either a node or an edge [chuckles].

RVB: I know the feeling, Jaroslav.

JP: Yeah, I think this is one of so-called dangers of graph databases and thinking in graphs that it is pretty easy to translate your problem into the structure of graph. I think this is the most appealing thing for me, but you don't need specialized training and you don't need to read tons of books about the model because it is so natural to think about it, about things this way.

RVB: That sort of leads me into the second topic that I always ask on this podcast series. What do you like about graphs? What is so good about it in your opinion? I'm hearing the modeling advantages that you just mentioned right there. Want to give us your perspective there?

JP: Yeah, sure. So for first modelling, it is also important but also quite easy. If you don't get into much details about the directed and undirected graphs, hypergraphs and all this stuff, and just focus on graphs, you can explain to a nontechnical person how it works. It's pretty easy. When you work with business people, it would soon get common vocabulary. It is real easy to explain. You don't need advance modelling tools, and just whiteboards and brain to start drawing and planning the graph.

JP: The second thing, which is really close to my heart, I truly believe in emerging architecture so I don't believe we can plan everything ahead of time and knowing how it worked when the requirements change fast and the customer sometimes doesn't really know what he needs and we discover what he needs. Over the time, one wonderful thing is that I can build my initial structure in all connections and nodes, and during the time I can evolve the structure of the graph. Especially when the one thing I like is the graphs really start pretty simple, but as I start to write queries and add Cypher to it, I start to see that there are actually connections they haven't seen so I cannot materialize those connections and enrich my graph with additional connections or additional nodes just to make sure what my Cypher query is the fastest possible query I can have.

RVB: There's actually a really good match with things like Agile development methodologies and those types of things. Is that what I'm hearing?

JP: Yes, this is for me really important, that I don't have to plan everything ahead and I can build the queries, build the database, as I need, as the product changes.

RVB: I think that's really cool, cool perspective. I think you're totally right about it so I really appreciate that. What do you think this is going, Jaroslaw? Do you have any wishes or ideas about where this technology should be going in the next couple of years? Anything you want to throw in there?

JP: Yeah, sure. I think that the biggest challenge and it's not only a Neo4j problem because, let's be honest [chuckles], it's not the only graph database engine in the world, but for me, it's the best because I really like Cypher and this is one of the best decisions to the language based on others to query a database. I think the problem with the graph model - and we all need to really hard think about it - is the size of the data sets we are dealing with. As you know, we don't have a good way to split a graph into sub-graphs, having these things on the separate machines and because of the connected nature of the data, we need to be able to squeeze our data set onto one machine.

JP: Yeah, yeah. We don't have a good program for it developed still so [chuckles] no strong theoretical foundation. I think the guys from academia needs to meet with the people that work with graphs...

RVB: Well, a lot of work has been done around partitioning specifically graphs so in a general case it's extremely difficult, and you can actually almost prove that it's impossible. But there has been a lot of work and also at Neo4j on coming up with partitioning algorithms that will be specific to your domain. So if you would tell us more about your data, then we would be able to make much more sensible judgments about where the data should go on which machine and as you probably know we've done our homework already and we're hoping that will lead into a product in a future version of Neo4j. But it is early days still. It's a very complicated [crosstalk]--

JP: You know, it is pretty easy to partition because you will treat your graph database, Neo4j database, as your usual data source for a separate application, so I think it would be easier but if you have the partition data and you want to run the shortest possible path over the whole graph, that can be a trick [chuckles].

RVB: Yeah, as soon as you hit the machine boundary you have a problem, right? So it's a very difficult problem to solve. We are trying to make a solid dent into that problem and we're-- there's a lot of work going into that. People like Jim Webber are really actively involved with that. But I think it's a difficult one from multiple perspectives. This is just my personal perspective, but on the one hand, there's this hugely complicated problem, and on the other hand you have a situation where the vast majority of users and clients don't really need that. You know what I mean?

JP: Yes, that's true. And one important thing, when, for example, on, my first Neo4j project, is that we pushed all of the data we had in SQL because it was basically migration from SQL to Neo4j, so we pushed everything to Neo4j, and I think it was the biggest single mistake-- one of the biggest single mistakes in my life because actually you don't need everything so I truly believe in the polyglot persistence so you push to the graph only the data you will need doing the queries. And all the additional, heavy things, you can have a separate store so you can manage. For us, fortunately, we are in a place where people start to think that having two, three different databases in a single system is not a bad thing so that's what we see at the moment with Neo4j.

RVB: That's completely along the lines that we're thinking. Combine different data stores for different problem sets and have a much more task-oriented setup. I think that's very much a recurring design pattern, I think.

JP: Yeah, so basically this is the pattern I see with this. Neo4j is quite often used as a supplemental database. I know what a tricky index what we can do [chuckles] so people can remain as SQL databases and they are thinking the many Neo4j instances and we are asking different questions because of the different structures in place. So at the moment this is where I see organizations are planning with Neo4j. As an engine you can ask really tricky questions so if you are asking about the future, I think the next step is to push organizations to think that actually your database can be your master data. Because at the moment it's mostly SQL and the database - the name we shouldn't use [chuckles] -is comfortable in this space in all the things like Neo4j, Cassandra, are just supplementary for the SQL model.

RVB: Very good. Thank you so much for sharing your thoughts on that. I think that was very, very interesting and useful. I really appreciate it, and I think we're going to wrap up the podcast now. I look forward to speaking to you again at one of the future events. Thank you, Jaroslaw.