Adventures with KNIME: pipeline wrapping

One of the items that has been on my to-do list for quite some time is to dig back into the KNIME universe, and interface it with the growing collection of cheminformatics algorithms that I have been assembling over the last few years. Given that my company’s software stack has its own workflow/pipelining infrastructure, but no user interface, it makes a certain amount of sense to look into connecting them together.

This isn’t the first time I have undertaken such a project – the first time involved a similar approach that enabled functional units of code written using SVL from the Chemical Computing Group‘s MOE to be “compiled” into KNIME nodes that can be plugged into an existing setup. A lot of Internet time has passed since I handed it off, but the “KNIMOE” project is apparently still going strong, and I have been hearing rumours through the grapevine that the KNIME project itself is maturing quite nicely.

So back to my current predicament: I have a Java-based cheminformatics toolkit that is proprietary, and has no user interface, and no official commercial distribution model. I use it extensively for a variety of skunkworks projects (like slicing & dicing ChEMBL, and scaffold analysis, among many other things) and driving the molsync.com site. The toolkit has its own internal pipelining system, whereby basic units of functionality are coded up as compact Operation classes, and a bunch of these can be strung together as a Workflow. In the absence of a graphical editor, workflows are created by editing XML files with a text editor, which is hardly fun, but still much more convenient than actually writing real code just to string together existing algorithms in a different order.

The workflow infrastructure that I built several years ago for my own company’s toolkit is different from most of the open source node-based graphical options insofar is that it is designed for streaming only, and nothing else. This is the opposite of the first few major versions of KNIME, which could only operate by running each step sequentially, and saving the intermediate data content each time. The true pipelining approach basically shoves everything through a pipe, and doesn’t save anything in the middle: each processing operation passes on each row to the next step as soon as it is done. Intermediate collections are stored in modestly-sized buffers, which are only stored in transient memory. Pipelines are typically terminated by a file writer of some kind, which saves the final results. The benefits of a true pipelining architecture are high degree of parallelism (i.e. multitasking happens for free) and the lack of need to store an extra copy of the data for each node in the workflow, which lends itself toward high efficiency (though not always). There is heated debate as to whether this generally matters, and of course it depends: sometimes the benefits of true pipelining are negligible (e.g. for small datasets), and sometimes caching intermediates is a good thing (e.g. a long calculation at an early step which should ideally not be repeated). The other drawback of true pipelining is that some kinds of operations are incompatible with it: sorting being a good example (you have to wait for all of the data to pile up before anything can be handed off to the next node, which makes it blocking by definition). The decision to go with true pipelining has influenced the design of all the operations so far, because each of the data flow algorithms have been thought through with the very explicit purpose of making sure that complex problems using large datasets can be solved without breaking the paradigm (unless there’s absolutely no other way, though that happens less often than you might think).

But back to KNIME: the project itself was designed from the beginning to be extensible by anyone and everyone, and for that reason there is a huge library of 3rd party nodes to choose from, many of them from the major vendors in cheminformatics. By creating a project that uses Java’s capability for introspection, each of the Operations from the Molecular Materials Informatics toolkit is mapped to a programmatically assembled KNIME node (a new Java file and a corresponding XML file). These auto-generated files reference a handful of classes that perform the work of mapping the flow of molecular datasheets into & out of the KNIME tables, and also providing a custom datatype for its own flavour of molecules. The resulting bundle can be copied into a KNIME plugin collection:

The various operations are made up of a few sources & sinks (for reading & writing files), and the rest are manipulators, which consume at least one stream, and produce at least one. Running the nodes works just the same way as the rest of the KNIME ecosystem: the wrapper system shuts down the true pipelining, and gathers the intermediate data at each junction (for now anyway: it may be levelled up later to use the newer capabilities).

There are also nodes for import/export to convert the native molecule type to the industry standard SDF/Molfile/CTfile format, which means that these extra nodes can easily play nice with the various other cheminformatics plugins.

Each of the wrapped nodes has its own special viewer, which provides a quick way to look at the content:

This is quite similar to using the built in Interactive Table node, which is extended by a custom viewer for the native molecule type:

(Note the insane non-molecule at the top: this is just a nonsense example to make sure appropriate fields are preserved, even when used improperly.)

At the present time, there are a couple more things to sort out before this is useful, but the immediate driver is to allow me to share some of the functionality that my company has accumulated with some of my collaborators. As for where it goes after that, time will tell.