Friday, February 20, 2015

Thoughts on spark, storm, and classpath isolation

Currently i use apache-spark and apache-storm. And in both projects i always end up with a library mess. They depend on libraries, which i use in my code as well. But mostly in a newer version. For example, i use google's latest guava. Everyone else in the java world uses guava, but most likely in a different version. And guava constantly adds, deprecates and then removes functionality. So it happens, that when i deploy my code to storm or spark, i might break code. On my side or on their side, if i am not careful.

I do not want to go into further detail. It just causes a lot of pain. And the best way to stay out of trouble is to stick to the versions, that storm and spark use. And avoid latest versions. :-(

I originally come from the servlet container world, where this problem is solved. A servlet container runs web applications. And the dependencies of both, the servlet container and the web application, are isolated from each other. The servlet container only makes those dependencies visible to the web application, that are required by the servlet specification, e.g. the servlet API.

In the future i would assume, that spark or storm go a similar way. They run user code with a custom classloader and only the spark or storm API is visible to the user code, to program against it.

By the way, a simple workaround at the moment is to change the classpath manually. IMHO spark has an option to say explicitly: put user dependencies first on the classpath. And in storm, one can rewrite the shell script to put the user dependencies first.

But those solutions has a drawback. It might break storm's or spark's own dependencies.
Also it assumes that the classloader has a defined order on how to find classes: From left to right in the classpath. But I think that assumption is incorrect. There might be a classloader, that scans the classpath in parallel. In that case, the mess might become even worse, if there are two versions of guava in the classpath.

(This topic actually touches something, that i am interested in for quite some time. The virtual and definitly the real world is constantly changing and with it the libraries and the interfaces, that we use. One can barely rely on stability. And one would have to constantly recompile the world against old interfaces and new interfaces and so on to see, if progress is possible without breaking anything. This tickles my mind a lot. On the one hand i love to see order, stability, predictability, reproducibility. I like to conservate. On the other hand, that is a very static look onto the virtual world, where we break down things all the time, replace them or build them differently.)