Revision as of 19:27, 26 August 2011

BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs)[1], which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure.

Such a highly useful and powerful library, which has been used in such projects as TorusDBN[2], Basilisk[3], FB5HMM[4][5] with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has
no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.

Project objectives

The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.

Source Code

Progress

Definitions

For the uninitiated to the Mocapy++ terminology, it is necessary to understand what ESS computer and Densities are, as these terms shall be used throughout this article.

An ESS Computer is a model of the ESS class that contains the Expected Sufficient Statistics (ESS) for a given node. It implements how a sample (from the Gibbs sampler) should be added and stored, such that estimation of the distributions parameters can be done. An ESS computer class must be implemented for each node type. Also, it must satisfy the ESSConcept requirements.

Densities is a model of Densities class that implements the estimation and sampling procedures of a given distribution. Densities must satisfy the DensitiesConcept requirements.

ESSConcept

In the following table, X denotes an ESS computer class, and u is a mutable value of X.

ESSConcept requirements

expression

return type

pre/post-condition

u.construct(parent_sizes)

void

the appropriate shape of the ESS is defined

u.estimate(ess)

void

adds a sample point to the ESS

The class mocapy::BippoESS is an example model for an ESS computer.

DensitiesConcept

In the following table, X denotes a Densities computer class, v is a const value of X, and u is a mutable value of X.
Note that ptv stand for "parent and this node’s values": the values of the parents nodes and the value of the node to which the method belongs.

DensitiesConcept requirements

expression

return type

pre/post-condition

u.construct(parent_sizes)

void

parameters (mean, covariance, CPD, etc.) are initialized

u.estimate(ess)

void

the parameters of the node based on the given ESS are estimated

u.sample(ptv)

std::vector<double>

returns a sample based on indicated parent values. Note, that subsequent calls to the sample member function may return different values, so this operation is mutable

v.get_lik(ptv, log)

double

parent)

v.get_parameters()

std::vector<MDArray<double> >

returns the distributions parameters

v.get_node_size()

unsigned int

returns the node size

v.get_output_size()

unsigned int

returns the output size of the node

The class mocapy::BippoDensities is an example model for a Densities computer.

MocapyEXT implementation

Embedding the Python interpreter

For a client program to be able to execute Python code, it is necessary to initialize the scripting environment. It is done by invoking the Py_Initialize() function. The interpreter is released by invoking the Py_Finalize() function.

It is important to note, however, that any statement between Py_Initialize() and Py_Finalize() calls could throw an exception. If an exception is thrown, it must be either handled in a try/catch block, or the program must be terminated. Having this in mind, an previous example program could be made more robust by enveloping the statements between Py_Initialize() and Py_Finalize() in a try/catch block. Py_Finalize() safety could not be overstated. Currently Boost.Python has several global (or function-static) objects whose existence keeps reference counts from dropping to zero until the Boost.Python shared object is unloaded. This can cause a crash because when the reference counts do go to zero, there is no interpreter. This poses a question if such a method of initializing the Python interpreter could be considered to be "easy to use", "safe" or even "non-intrusive", which are the major MocapyEXT design tenets.

The solution to this issue that is easy to use, safe and non-intrusive is surprisingly elegant and showcases the RAII (Resource Acquisition Is Initialization) idiom.

The lifetime of the Python interpreter

The Python interpreter is initialized before the entry to the main function, and is released after the exit from the main function. No additional end-user effort is required, except to include the headers that define the necessary Python plug-in wrappers.

Compile/link dependencies

The Python interpreter shall be linked to the client program (not to the Mocapy++ library). The end-user of the MocapyEXT shall be responsible for additional include and/or library paths to the Python headers and libraries that are necessary to compile and link the client program successfully.

MocapyEXT also instantiates its static data members in such a manner that they are instantiated only once and in the library's compilation unit. In the C++ standard it is explicitly stated that

"... in particular, the initialization (and any associated side-effects) of a static data member does not occur unless the static data member is itself used in a way that requires the definition of the static data member to exist."[cpp_std2003, temp.inst]

The user shall not need to manage the instantiated static data members.

Thread safety guarantees

The MocapyEXT plug-ins have the following thread-safety guarantees:

it is safe to call plug-in member functions from multiple threads;

it is safe to perform concurrent read-only accesses of the passed parameter containers;

it is safe to perform concurrent read/write accesses, if there is only one read or write access to the passed factual parameters at a time.

The module search path

When a module foo that contains ESS or Densities definitions is being imported, the embedded Python interpreter searches for a file named foo.py in the list of directories specified by the environment variable PYTHONPATH and the current path.

Before proceeding to import a module, sys.path list is explicitly appended with the current path. A user that relies on scripts being in some other directory than a client program, should list the paths in the PYTHONPATH environment variable.

Note that because the interpreter also searches in an installation-dependent default path, it is important that the file containing ESS or Densities definitions not have the same name as a standard module.

Plugin registration

The minimal interface import example shows the usage of three plug-in classes: densities_plugin, ess_plugin and aggregate_plugin. The intended usage pattern for densities_plugin and ess_plugin are cases when types ESS and Densities are implemented in separate files -- one-class-per-file. Each class is loaded in their own interpreter context. The purpose of aggregate_plugin is to simplify the management of the the loaded types and to optimize the resources allocated for the embedded Python interpreter as aggregate_plugin creates only one instance of the embedded interpreter for both types ESS and Densities.

Basically, a user provides a name of a python library, a name of class and a list of factual arguments necessary for the construction of the model of the class. The MocapyExt library in turn then returns an instance of the new class.

The given argument list (up to N = 6 arguments are supported) would parametrize the initialization of Densities object. Parameters are forwarded to the Python API and converted using the type conversion facilities provided by the Boost.Python library. The mechanism is general and the user can convert any arbitrary type T to its equivalent in Python.

The parameters are forwarded via const-references. This method accepts and forwards arbitrary typed arguments, at the cost of always treating the argument as const. This solution is typically used for constructor arguments. An esoteric problem with this approach is that it is not possible to form a const reference to a function type, but this deficiency, which is being addressed in Core Issue 295, actually works in our favor, because this prevents ill-intended attempts to pass a function to the Python interpreter.

Any further invocation of ess and dens member functions automatically calls a corresponding class method in Python.

The lifetime of a class instance

The lifetime of ESS computer and Densities objects (respectively, ess and dens) may exceed the lifetime of their respective factory plugin(s). Objects ess and dens preserve reference counted pointers to the Python interpreter, thus a valid Python interpreter instance exists until the destruction of objects ess and dens.

Creation of a node

This is a simple example that showcases two different methods to initialize a plugin node. In this particular case, the module plugin_tests that has been used in the program implements a dummy discrete node of fixed length 1, and always returns [0,] as sampled value.

plugin_node_type is a typedef of the ChildNode class template. Here we also note that the lifetime of a node is independent from the lifetime of plug-in factories.

Node output

A plugin node is Streamable, i.e., it can be output to a std::ostream via the operator

std::ostream& operator<<(std::ostream&, const plugin_node_type&);

The content of the output is the same as repr(Z), where Z is either an ESS computer or Densities object.

Node serialization

The MocapyEXT preserves the old Mocapy++ ChildNode class template serialization behaviour, which means that for ESS computers that do not need to be serialized (this includes kentess, gaussianess, poissoness, and others) no serialization will happen. However, the MocapyEXT necessarily has to serialize the Python plugin that implements an ESS computer in order to preserve the module and the name of the class, so that the ESS computer could be later deserialized.

It also must be noted that after deserialization the ESS computer and Densities will not share the same Python interpreter instance, even if they are implemented in the same module.

Measuring performance

It is interesting to measure how much plugin interface costs compared to the implementation of ESS and Densities computers themselves.

We measure the relative performance of tests named N, S, and A. All tests invoke the member functions of the ESS and Densities computers, albeit no particular computation is carried out.

N test stands for N(ative), which means that pure C++ implementation of ESS and Densities computers were used in the test.

Repeated tests without the construction of ESS and Densities objects show that invocation of a method through the MocapyEXT interface is only ~2,5 times slower than to invoke a member function of natively implemented ESS and Densities objects. It is obvious that in a real-world scenario node construction, sampling, likelihood computation, etc. would be more likely to constitute the better part of the running time as compared to the invocations of methods themselves. Basically, these tests show that member function invocation takes much less time than even such a 'lightweight' operation as node construction. In principle it is now possible to write quite
generic algorithms. Consider this example:

We can be reasonably sure that performance penalty of using MocapyEXT, while not negligible, will not outweight the added value of genericity. And in the pure C++ scenario it does not incur any performance cost.

Roadmap

The MocapyEXT is an ongoing project. In the current state it is usable, however, it is important to stress that the described features are by no means complete and may work differently in the future. Numerous changes have been done in Mocapy++ to support the MocapyEXT infrastructure. Most notably, the Mocapy++ library has been made partially const-correct in the development branch. The importance of const-correctness cannot be overstated as it affects the parameter passing protocol from and to the Python code. The changes, however, have been done transparently. If an end-user did not rely on the side effects of the now immutable operations, she should be able to compile and run the code with minor modifications. Undoubtedly, many other changes await. For example, in some contexts it would be useful to have the ability to clone the Densities by passing their parameters to the constructor:

It is also my belief that the current parameter initialization scheme is ill conceived because the pointer to a random number generator is stored in Densities. It would be much better to pass the random number generator as a parameter. This would singlehandedly solve problems with object ownership in Mocapy++ Python bindings, fix some left-over const-correcness issues and so on. However, this would introduce a breaking change, and a lot of projects depend on Mocapy++.

During the project one of the interesting technical exercises has been to convert the values of the MDArray class template to Python's ndarray. The conversion routines have been written and all of them fully work as expected, however, the MDArray class template has its problems. It is one of the classic examples of failed separation of concerns. Not only it serves as a multi-dimensional array, but it is used in a manner that one would use a Swiss army knife: the MDArray also serves as a matrix, as a vector (in the mathematical sense), it has curiously named member functions like void keep_eye() (I still do not know what it does), it could be rotated, axis permuted, and whatever else one can think of -- all of which is implemented in member functions. MDArray interface is enormous, and this is clearly wrong. Still, touching MDArray is a delicate matter that ought to be addressed with utmost care; most likely by introducing free functions to do the job, and slowly phasing out the member functions.