* Formulate probabilistic models using Python-Mocapy++. Apply the models to solve biological problems. Examples of problems that can be solved using dynamic Bayesian networks include deciding if a pair of sequences is evolutionarily related, finding sequences which are homologous to a known evolutionary family and predicting RNA secondary structure.

* Formulate probabilistic models using Python-Mocapy++. Apply the models to solve biological problems. Examples of problems that can be solved using dynamic Bayesian networks include deciding if a pair of sequences is evolutionarily related, finding sequences which are homologous to a known evolutionary family and predicting RNA secondary structure.

* Learn the statistical methods used in Mocapy++ and relevant concepts in structural biology.

+

== Project Progress ==

−

'''Community bonding period

+

=== Options to create Python bindings to C++ code ===

−

'''

+

−

''April 25 - May 22 (4 weeks)

+

==== Swig ====

−

''

+

−

* Familiarize myself with Mocapy++'s functionality and architecture.

+

There is already an effort to provide bindings for Mocapy++ using Swig. However, Swig is not the best option if performance is to be required.

+

The Sage project aims at providing an open source alternative to Mathematica or Maple. Cython was developed in conjunction with Sage (it is an independent project, though), thus it is based on Sage's requirements. They tried Swig, but declined it for performance issues. According to the [http://sage.math.washington.edu/tmp/sage-2.8.12.alpha0/doc/prog/node36.html Sage programming guide] "The idea was to write code in C++ for SAGE that needed to be fast, then wrap it in SWIG. This ground to a halt, because the result was not sufficiently fast. First, there is overhead when writing code in C++ in the first place. Second, SWIG generates several layers of code between Python and the code that does the actual work". This was written back in 2004, but it seems things didn't evolve much.

+

The only reason I would consider Swig is for future including Mocapy++ bindings on BioJava and BioRuby projects.

−

* Configure a development environment and explore the Mocapy++ and Bio.PDB source code more thoroughly. Document the code during exploration.

+

==== Boost Python ====

−

* Study Mocapy++ and Bio.PDB test cases in order to understand the Mocapy++ interface requirements.

+

Boost Python is comprehensive and well accepted by the Python community. I would go for it for its extensive use and testing. I would decline it for being hard to debug and having a complicated building system. I don't think it would be worth including a boost dependency just for the sake of creating the Python bindings, but since Mocapy++ already depends on Boost, using it becomes a more attractive option. In my personal experience, Boost Python is very mature and there are no limitations on what one can do with it. When it comes to performance, Cython still overcomes it. Have a look at the [http://blog.behnel.de/index.php?p=38 Cython C++ wrapping benchmarks] and check the timings of [http://www.behnel.de/cycppbench/ Cython against Boost Python]. There are also previous [http://telecom.inescporto.pt/~gjc/pybindgen-benchmarks/ benchmarks comparing Swig and Boost Python].

−

* Compare options to create python bindings to C++ code (Boost Python, Cython, Swig). Perform the necessary assesments and gathering of requirements to determine the best library to create the Python bindings.

+

==== Cython ====

−

* Try fixing bugs in Biopython in order to get familiar with the development cycle.

+

It is incredibly faster than other options to create python bindings to C++ code, according to several benchmarks available on the web. Check the Simple benchmark between Cython and Boost.Python. It is also very clean and simple, yet powerful. Python's doc on porting extension modules mentions cython: "If you are writing a new extension module, you might consider Cython."

+

Cython has now support for efficient interaction with numpy arrays. it is a young, but developing language and I would definitely give it a try for its leanness and speed.

+

Since Boost is well supported and Mocapy++ already relies on it, we decided to use Boost.Python for the bindings.

−

'''Begin of coding phase

+

For further information see [https://docs.google.com/document/d/1E72Qysp3pMd69hSYfIXJKgBLdeSMbzSol9RYD2rKHlI/edit?hl=pt_BR&authkey=CPmFxK0H Mocapy++Biopython - Box of ideas].

−

'''

+

−

''May 23 - June 5 (2 weeks)

+

=== Bindings Prototype ===

−

''

+

−

* Design and implement test cases for the Mocapy++ Python interface. The tests must include data structures such as the Multi-Dimentional array; and statistic models such as the hidden Markov Models, a special case of dynamic Bayesian networks.

* Implemented the bindings to provide a minimum subset of functionality, in order to run the implemented examples.

−

''

+

−

* Implement Python bindings for the remaining Mocapy++ functionality that composes its interface.

+

* Compared the performance of C++ and Python versions.

−

* Run and improve the previously designed test cases. Make sure the newly designed interface meets the requirements of the use cases.

−

* Do performance analysis (code profiling) in order to make sure the Python interface of Mocapy++ is fast enough to be usable.

+

Mocapy++’s interface remained unchanged, so the tests look similar to the ones in Mocapy/examples.

−

* Try other options to create Python bindings to C++ code, in case the implementation doesn't meet the speed requirements.

+

In the prototype the bindings were all implemented in a single module. For the actual implementation, we could mirror the src packages structure, having separated bindings for each package such as discrete, inference, etc.

+

It was possible to implement all the functionality required to run the examples. It was not possible to use the [http://www.boost.org/doc/libs/1_42_0/libs/python/doc/v2/indexing.html vector_indexing_suite] when creating bindings for vectors of MDArrays. A few operators (in the MDArray) must be implemented in order to export indexable C++ containers to Python.

−

''June 20 - July 3 (2 weeks)

+

Two Mocapy++ examples that use discrete nodes were implemented in Python. There was no problem in exposing Mocapy’s data structures and algorithms. The performance of the Python version is very close to the original Mocapy++.

−

''

+

−

* Integrate the Mocapy++ Python interface with Biopython.

+

For additional details have a look at the [https://docs.google.com/document/d/1JPkCbvJ9Gk3b6LmQ68__UUt4Yz2P1-c286p6FVEqyXs/edit?hl=pt_BR&authkey=CJTCpZgL Mocapy++ Bindings Prototype] report.

−

* Implement integration tests for Mocapy++ Biopython.

−

* Test the operation of each module of the modified source code.

+

=== Bindings Implementation ===

−

''June 4 - July 10 (1 week)

+

==== Bindings for the core functions and data structures ====

−

''

+

−

* Make further changes in the code to improve robustness and functionality.

+

''' Data structures '''

−

* Gather, organize, and improve the documentation written during the previous weeks.

+

Mocapy uses an internal data structure to represent arrays: MDArray. In order to make it easier for the user to interact with Mocapy's API, it was decided to provide an interface that accepts numpy arrays. Therefore, it was necessary to implement a translation between a numpy array and an MDArray.

+

The translation from MDArray to python was done through the use of Boost.Python [http://www.boost.org/doc/libs/1_43_0/libs/python/doc/v2/to_python_converter.html to_python_converter]. We've implemented a template method convert_MDArray_to_numpy_array, which converts an MDArray of any basic type to a corresponding numpy array. In order to perform the translation the original array's shape and internal data are copied into a new numpy array.

−

Mid-term evaluations (July 11)

+

The numpy array was created using the [http://docs.scipy.org/doc/numpy/reference/c-api.array.html Numpy Array API]. The creation of a new [http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArrayObject PyArrayObject] using existing data (PyArray_SimpleNewFromData) doesn't copy the array data, it just stores a pointer to it. Thus, one can only free the data when there is no reference to the object. This was done through the use of a [http://docs.python.org/c-api/capsule.html Capsule]. Besides encapsulating the data, the capsule also stores a destructor to be used when the array is destroyed. The PyArrayObject has a field named "base" which points to the capsule.

−

''July 11 - July 24 (2 weeks)

+

The translation from Python to C++, i.e. creating an MDArray from a numpy array is slightly more complex. Boost.Python will provide a

−

''

+

chunk of memory into which the new C++ object must constructed in-place. See the [http://misspent.wordpress.com/2009/09/27/how-to-write-boost-python-converters/ How to write boost.python converters] article for more details.

A translation between std::vector of basic types (double, int...) and Python list was also implemented. For std::vector of custom types, such as Node, the translation to a Python list was not performed. If done the same ways as for basic types, a type error is raised: "TypeError: No to_python (by-value) converter found for C++ type". When using [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/v2/indexing.html#vector_indexing_suite_class vector_indexing_suite] this problem was already solved. See [http://mail.python.org/pipermail/cplusplus-sig/2005-July/008865.html Wrapping std::vector<AbstractClass*>]. The only inconvenience of using the vector_indexing_suite is creating new types such as vector_Node, instead of using a standard Python list.

−

* Offer the code to the community for testing and suggestions for further functionality.

+

The code for the translations is in the [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/bindings/mocapy_data_structures.cpp?revision=340&view=markup mocapy_data_structures] module.

−

''July 25 - August 7 (2 weeks)

+

''' Core functions '''

−

''

+

−

* Create applications which use the data mining provided by Bio.PDB together with the machine learning features of Mocapy++.

+

The mocapy Python packages follow Mocapy's current source tree. For each package, a shared library with the bindings was created. This makes compilation faster and debug easier. Also, if a single library was created it wouldn't be possible to define packages.

−

* Profile the code and work on speed and robustness improvements.

+

Each of the libraries is called libmocapy_<nameofthepackage>. For example, libmocapy_gaussian provides bindings for the gaussian nodes and probability distributions. The libmocapy_data_structures is used by other libraries and, therefore, must be imported first. This is done on the Python side. Each of the libmocapy_* libraries is imported in the corresponding package. See [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/tutorial/doc/html/python/techniques.html#python.creating_packages Creating Packages].

+

The bindings code can be found in the [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/bindings/ Bindings directory].

−

''August 8 - August 14 (1 week)

+

Currently, tests to the just created interface are being developed. There are a few tests already implemented under the framework package: [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/python/mocapy/framework/tests/ mocapy/framework/tests]

−

''

+

−

* Perform last review and improvements on code and documentation.

+

==== Bindings for the remaining Mocapy++ functionality ====

−

''August 15 - August 21 (1 week)

+

''' Data structures '''

−

''

+

−

* A buffer for any unpredictable delay.

+

While implementing the bindings for the remaining Mocapy++ functionality there were problems with methods that take pointers and references to an mdarray:

+

* It is not possible to call a method which takes a pointer if the object is created on the python side. See [http://stackoverflow.com/questions/3881457/boostpython-howto-call-a-function-that-expects-a-pointer how to call a function that expects a pointer?].

−

'''Firm pencils down (August 22)

+

* It is not possible to automatically translate a non const reference. The custom rvalue converters only match functions with the following signatures:

For further details see [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/v2/faq.html#question2 How can I wrap functions which take C++ containers as arguments?]

−

=== Options to create Python bindings to C++ code ===

−

==== Swig ====

+

The mdarray is created in python using a numpy.array that is translated to c++ using [http://www.boost.org/doc/libs/1_42_0/libs/python/doc/v2/faq.html#custom_string custom converters]. The custom converters are registered in the global Boost.Python registry near the top of the module initialization function. Once flow control has passed through the registration code the automatic conversions from and to Python.

−

There is already an effort to provide bindings for Mocapy++ using Swig. However, Swig is not the best option if performance is to be required.

+

Because of this automatic conversions, it was necessary to create wrappers for functions which take pointers as arguments and change the functions which take references, to get const references. Because Mocapy++ is not [http://en.wikipedia.org/wiki/Const-correctness const correct], changes are needed to use the const references properly. While the changes are being done, some const_cast have been used. When using const_cast one must be aware [http://stackoverflow.com/questions/357600/is-const-cast-safe it is not always safe].

−

The Sage project aims at providing an open source alternative to Mathematica or Maple. Cython was developed in conjunction with Sage (it is an independent project, though), thus it is based on Sage's requirements. They tried Swig, but declined it for performance issues. According to the [http://sage.math.washington.edu/tmp/sage-2.8.12.alpha0/doc/prog/node36.html Sage programming guide] "The idea was to write code in C++ for SAGE that needed to be fast, then wrap it in SWIG. This ground to a halt, because the result was not sufficiently fast. First, there is overhead when writing code in C++ in the first place. Second, SWIG generates several layers of code between Python and the code that does the actual work". This was written back in 2004, but it seems things didn't evolve much.

+

−

The only reason I would consider Swig is for future including Mocapy++ bindings on BioJava and BioRuby projects.

+

−

==== Boost Python ====

+

The [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/tutorial/doc/html/python/functions.html#python.call_policies call policies] were also reviewed. When using an incorrect return value policy, you won't get a compile error, but your code will crash at runtime.

−

Boost Python is comprehensive and well accepted by the Python community. I would go for it for its extensive use and testing. I would decline it for being hard to debug and having a complicated building system. I don't think it would be worth including a boost dependency just for the sake of creating the Python bindings, but since Mocapy++ already depends on Boost, using it becomes a more attractive option. In my personal experience, Boost Python is very mature and there are no limitations on what one can do with it. When it comes to performance, Cython still overcomes it. Have a look at the [http://blog.behnel.de/index.php?p=38 Cython C++ wrapping benchmarks] and check the timings of [http://www.behnel.de/cycppbench/ Cython against Boost Python]. There are also previous [http://telecom.inescporto.pt/~gjc/pybindgen-benchmarks/ benchmarks comparing Swig and Boost Python].

+

''' Examples '''

−

==== Cython ====

+

Mocapy++'s examples were implemented in Python, using the exposed API and data type conversions.

It is incredibly faster than other options to create python bindings to C++ code, according to several benchmarks available on the web. Check the Simple benchmark between Cython and Boost.Python. It is also very clean and simple, yet powerful. Python's doc on porting extension modules mentions cython: "If you are writing a new extension module, you might consider Cython."

−

Cython has now support for efficient interaction with numpy arrays. it is a young, but developing language and I would definitely give it a try for its leanness and speed.

−

Since Boost is well supported and Mocapy++ already relies on it, we decided to use Boost.Python for the bindings.

+

==== Testing and Improving Mocapy Bindings ====

−

For further information see [https://docs.google.com/document/d/1E72Qysp3pMd69hSYfIXJKgBLdeSMbzSol9RYD2rKHlI/edit?hl=pt_BR&authkey=CPmFxK0H Mocapy++Biopython - Box of ideas].

+

Before integrating to Biopython, some unit testing was required, to detect possible errors and make sure future changes that break functionality won't go unnoticed.

−

=== Bindings Prototype ===

+

For every Python package, it was created a "tests" directory which contains the unit tests created for each module. Here is one example of the tests created for the framework package:

For further details, see [http://www.boost.org/doc/libs/1_46_0/libs/python/doc/v2/faq.html#ownership How can I wrap a function which needs to take ownership of a raw pointer?]

−

Mocapy++’s interface remained unchanged, so the tests look similar to the ones in Mocapy/examples.

+

Pointers returned via [http://wiki.python.org/moin/boost.python/CallPolicy#manage_new_object manage_new_object] will also be held by auto_ptr, so the transfer-of-ownership works correctly. When using this call policy the caller is responsible for deleting the C++ object from the heap.

−

In the prototype the bindings were all implemented in a single module. For the actual implementation, we could mirror the src packages structure, having separated bindings for each package such as discrete, inference, etc.

+

*''' Translation from numpy.array to a float mdarray'''

+

If the numpy array is an integer array, the translation creates an mdarray<int> and this is passed to a method which expects an mdarray of floats. This generates incorrect results.

−

It was possible to implement all the functionality required to run the examples. It was not possible to use the [http://www.boost.org/doc/libs/1_42_0/libs/python/doc/v2/indexing.html vector_indexing_suite] when creating bindings for vectors of MDArrays. A few operators (in the MDArray) must be implemented in order to export indexable C++ containers to Python.

+

The way to deal with that from the user perspective is either using floating pointer numbers to create the array or setting the ndtype parameter when creating the array:

+

<python>

+

x = numpy.array([[1,2,3,4,5,6]], dtype=numpy.float64)

+

</python>

−

Two Mocapy++ examples that use discrete nodes were implemented in Python. There was no problem in exposing Mocapy’s data structures and algorithms. The performance of the Python version is very close to the original Mocapy++.

−

For additional details have a look at the [https://docs.google.com/document/d/1JPkCbvJ9Gk3b6LmQ68__UUt4Yz2P1-c286p6FVEqyXs/edit?hl=pt_BR&authkey=CJTCpZgL Mocapy++ Bindings Prototype] report.

+

==== Building and Distributing Mocapy as a Package ====

+

[http://docs.python.org/distutils/index.html Distutils] was used to distribute Mocapy's Python modules.

−

=== Bindings Implementation ===

+

Besides distributing the python code, it was also necessary to build the extension modules.

+

[http://wiki.python.org/moin/boost.python/BuildingExtensions Building Extensions with boost.python] describes ways to build extensions using distutils.

+

Mocapy's setup.py can be found at http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/python/setup.py?revision=418&view=markup

* Build the mocapy library using cmake (usual procedure described in mocapy docs);

+

* Issue "python setup.py build", to build the extension modules;

+

* Issue "python setup.py install", to install the package (normally, the install procedure does the step above in case you didn't).

−

''' Data structures '''

−

Mocapy uses an internal data structure to represent arrays: MDArray. In order to make it easier for the user to interact with Mocapy's API, it was decided to provide an interface that accepts numpy arrays. Therefore, it was necessary to implement a translation between a numpy array and an MDArray.

+

=== Integration with Biopython ===

−

The translation from MDArray to python was done through the use of Boost.Python [http://www.boost.org/doc/libs/1_43_0/libs/python/doc/v2/to_python_converter.html to_python_converter]. We've implemented a template method convert_MDArray_to_numpy_array, which converts an MDArray of any basic type to a corresponding numpy array. In order to perform the translation the original array's shape and internal data are copied into a new numpy array.

+

==== API Design ====

−

The numpy array was created using the [http://docs.scipy.org/doc/numpy/reference/c-api.array.html Numpy Array API]. The creation of a new [http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#PyArrayObject PyArrayObject] using existing data (PyArray_SimpleNewFromData) doesn't copy the array data, it just stores a pointer to it. Thus, one can only free the data when there is no reference to the object. This was done through the use of a [http://docs.python.org/c-api/capsule.html Capsule]. Besides encapsulating the data, the capsule also stores a destructor to be used when the array is destroyed. The PyArrayObject has a field named "base" which points to the capsule.

+

In order to use Mocapy in conjunction with Biopython, a new module for PDB-specific features was added to Bio.PDB. This is where the API is being designed.

−

The translation from Python to C++, i.e. creating an MDArray from a numpy array is slightly more complex. Boost.Python will provide a

+

Mocapy is added as an optional dependency in Biopython. Inside the function or module that requires Mocapy, "import mocapy" is wrapped in a try/except block. A MissingPythonDependencyError is issued if the import fails.

−

chunk of memory into which the new C++ object must constructed in-place. See the [http://misspent.wordpress.com/2009/09/27/how-to-write-boost-python-converters/ How to write boost.python converters] article for more details.

+

−

A translation between std::vector of basic types (double, int...) and Python list was also implemented. For std::vector of custom types, such as Node, the translation to a Python list was not performed. If done the same ways as for basic types, a type error is raised: "TypeError: No to_python (by-value) converter found for C++ type". When using [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/v2/indexing.html#vector_indexing_suite_class vector_indexing_suite] this problem was already solved. See [http://mail.python.org/pipermail/cplusplus-sig/2005-July/008865.html Wrapping std::vector<AbstractClass*>]. The only inconvenience of using the vector_indexing_suite is creating new types such as vector_Node, instead of using a standard Python list.

+

Things that are being studied to be included in the module:

+

* extract the backbone dihedral angles from a given set of structures;

+

* use this data to train a TorusDBN-like model;

+

* automatically decide on the best model using the BIC criterion.

−

The code for the translations is in the [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/bindings/mocapy_data_structures.cpp?revision=340&view=markup mocapy_data_structures] module.

The mocapy Python packages follow Mocapy's current source tree. For each package, a shared library with the bindings was created. This makes compilation faster and debug easier. Also, if a single library was created it wouldn't be possible to define packages.

Each of the libraries is called libmocapy_<nameofthepackage>. For example, libmocapy_gaussian provides bindings for the gaussian nodes and probability distributions. The libmocapy_data_structures is used by other libraries and, therefore, must be imported first. This is done on the Python side. Each of the libmocapy_* libraries is imported in the corresponding package. See [http://www.boost.org/doc/libs/1_46_1/libs/python/doc/tutorial/doc/html/python/techniques.html#python.creating_packages Creating Packages].

+

The aim of BARNACLE (BAyesian network model of RNA using Circular distributions and maximum Likelihood Estimation) is to capture both the marginal distributions of each of the angles and the local dependencies between them. Barnacle describes RNA structure in a natural continuous space. It can be used purely as a proposal distribution, but also as an energy term enforcing realistic local conformations. The model combines a dynamic Bayesian network (DBN) with directional statistics.

−

The bindings code can be found in the [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/bindings/ Bindings directory].

Currently, tests to the just created interface are being developed. There are a few tests already implemented under the framework package: [http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/python/mocapy/framework/tests/ mocapy/framework/tests]

+

The DBN represents nine consecutive dihedral angles, where the seven central angles originate from a single nucleotide. Each slice j (a column of three variables) corresponds to one dihedral angle in an RNA fragment. The variables in each slice are: an angle identifier, Dj, a hidden variable, Hj, and an angular variable, Aj. The angle identifier keeps track of which dihedral angle is represented by a slice, while the angular node models the actual dihedral angle value. The hidden nodes induce dependencies between all angles along the sequence (and not just between angles in consecutive slices).

+

+

+

The original source code for Barnacle, which contains an embedded version of Mocapy written in Python, can be found at http://sourceforge.net/projects/barnacle-rna.

+

+

The modified version of Barnacle, changed to work with the Mocapy bindings can be found at https://github.com/mchelem/biopython/tree/master/Bio/PDB/Barnacle.

TorusDBN aims at predicting the 3D structure of a biomolecule given its amino-acid sequence. It is a continuous probabilistic model of the local sequence–structure preferences of proteins in atomic detail. The backbone of a protein can be represented by a sequence of dihedral angle pairs, φ and ψ that are well known from the [http://en.wikipedia.org/wiki/Ramachandran_plot Ramachandran plot]. Two angles, both with values ranging from −180° to 180°, define a point on the torus. Hence, the backbone structure of a protein can be fully parameterized as a sequence of such points.

The TorusDBN model is originally implemented as part of the backboneDBN package, which is freely available at

+

http://sourceforge.net/projects/phaistos/.

+

+

+

A new version of the TorusDBN model was implemented in the context of this project and can be found at

+

https://github.com/mchelem/biopython/tree/master/Bio/PDB/TorusDBN.

+

+

The TorusDBNTrainer can be used to train a model with a given training set:

+

<python>

+

trainer = TorusDBNTrainer()

+

trainer.train(training_set) # training_set is a list of files

+

model = trainer.get_model()

+

</python>

+

+

Then the model can be used to sample new sequences:

+

<python>

+

model.set_aa('ACDEFGHIK')

+

model.sample()

+

print model.get_angles() # The sampled angles.

+

</python>

+

+

When creating a model, it is possible to create a new DBN specifying the size of the hidden node or loading the DBN from a file.

+

<python>

+

model = TorusDBNModel()

+

model.create_dbn(hidden_node_size=10)

+

model.save_dbn('test.dbn')

+

</python>

+

+

<python>

+

model = TorusDBNModel()

+

model.load_dbn('test.dbn')

+

model.set_aa('ACDEFGHIK')

+

model.sample()

+

print model.get_angles() # The sampled angles.

+

</python>

+

+

It is also possible to choose the best size for the hidden node using the find_optimal_model method:

+

<python>

+

trainer = TorusDBNTrainer()

+

hidden_node_size, IC = trainer.find_optimal_model(training_set)

+

model = trainer.get_model()

+

</python>

+

+

IC is either the [http://en.wikipedia.org/wiki/Bayesian_information_criterion Bayesian Information Criterion] (BIC) or the [http://en.wikipedia.org/wiki/Akaike_information_criterion Akaike Information Criterion] (AIC) (Defaults to BIC. AIC can be specified by setting the use_aic flag).

+

+

For more details on the model API, see the test files:

+

https://github.com/mchelem/biopython/blob/master/Tests/test_TorusDBNTrainer.py and https://github.com/mchelem/biopython/blob/master/Tests/test_TorusDBNModel.py.

+

+

=== Performance ===

+

+

A few performance measurements were made comparing test cases implemented both in C++ and in Python. The tests were run in a computer with the following specification:

The profiling tests were made using [http://valgrind.org/info/tools.html#callgrind Callgrind] and visualized using [http://kcachegrind.sourceforge.net/ Kcachegrind].

+

+

Here are the average running time of the examples available with Mocapy (10 runs):

+

+

{| class="wikitable" border="1"

+

|-

+

! Test name

+

! C++ (s)

+

! Python (s)

+

|-

+

| hmm_simple

+

| 0.52

+

| 0.58

+

|-

+

| hmm_discrete

+

| 48.12

+

| 43.45

+

|-

+

| discrete_hmm_with_prior

+

| 55.95

+

| 50.09

+

|-

+

| hmm_dirichlet

+

| 340.72

+

| 353.98

+

|-

+

| hmm_factorial

+

| 0.01

+

| 0.12

+

|-

+

| hmm_gauss_1d

+

| 53.97

+

| 63.39

+

|-

+

| hmm_gauss

+

| 16.02

+

| 16.96

+

|-

+

| hmm_multinomial

+

| 134.64

+

| 125.83

+

|-

+

| hmm_poisson

+

| 11.00

+

| 10.60

+

|-

+

| hmm_vonmises

+

| 7.22

+

| 7.36

+

|-

+

| hmm_torus

+

| 53.79

+

| 53.65

+

|-

+

| hmm_kent

+

| 61.35

+

| 61.06

+

|-

+

| hmm_bippo

+

| 40.66

+

| 41.81

+

|-

+

| infenginehmm

+

| 0.01

+

| 0.12

+

|-

+

| infenginemm

+

| 0.01

+

| 0.15

+

|}

+

+

+

==== TorusDBN ====

+

+

Even though the PDB files are read, parsed and transformed in a format mocapy can understand, the most time consuming methods are the ones performing mathematical operations during the sampling process (Chebyshev and exp, for example).

+

+

[[File:TorusDBN.png|400px|thumb|right|Training of the TorusDBN model ]]

+

+

The model has been trained with a training set consisting of about 950 chains with maximum 20% homology, resolution below 1.6 Å and R-factor below 25%. It took about 67 minutes to read and train the whole dataset.

+

+

The resulting DBN is available at https://github.com/mchelem/biopython/blob/master/Tests/TorusDBN/pisces_dataset.dbn and can be loaded directly into the model as explained in the TorusDBN section above.

+

+

=== Future work ===

+

+

The summer is over, but the work continues... There are still a lot of things I intend to work on:

+

+

* Test the trained models to check their effectiveness in protein structure prediction.

+

+

* Try to reduce dynamic allocation as it is responsible for a lot of running time.

+

+

* Guarantee there are no memory leaks in the bindings.

Latest revision as of 15:31, 24 August 2011

Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.

Introduction

Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure?
One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences.
Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces.
The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space.
Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon.
The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database.
Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.

Project Schedule

Work Plan

Build the theoretical background on the algorithms used in Mocapy++, such as parameter learning of Bayesian networks using Stochastic Expectation Maximization (SEM).

Study Mocapy++'s use cases

Read several papers and attempt to replicate part of the experiments described using Mocapy++.

Get a better understanding of biological sequence analysis done through probabilistic models of proteins and nucleic acids.

Work with Mocapy++

Understand Mocapy++'s internal architecture and algorithms by exploring its source code and running its test cases.

Research other applications of Mocapy++ in Bioinformatics.

Design Mocapy++'s Python interface

Explore the source code of Biopython to understand its design and implementation. The Mocapy++ interface to be included in Biopython must be made compatible with the methods of solving problems in Biopython.

Design a Python interface for Mocapy++, based on its data structures and algorithms. Examine Mocapy++'s use cases and existing test cases to provide guidance for the interface design.

Formulate probabilistic models using Python-Mocapy++. Apply the models to solve biological problems. Examples of problems that can be solved using dynamic Bayesian networks include deciding if a pair of sequences is evolutionarily related, finding sequences which are homologous to a known evolutionary family and predicting RNA secondary structure.

Project Code

Project Progress

Options to create Python bindings to C++ code

Swig

There is already an effort to provide bindings for Mocapy++ using Swig. However, Swig is not the best option if performance is to be required.
The Sage project aims at providing an open source alternative to Mathematica or Maple. Cython was developed in conjunction with Sage (it is an independent project, though), thus it is based on Sage's requirements. They tried Swig, but declined it for performance issues. According to the Sage programming guide "The idea was to write code in C++ for SAGE that needed to be fast, then wrap it in SWIG. This ground to a halt, because the result was not sufficiently fast. First, there is overhead when writing code in C++ in the first place. Second, SWIG generates several layers of code between Python and the code that does the actual work". This was written back in 2004, but it seems things didn't evolve much.
The only reason I would consider Swig is for future including Mocapy++ bindings on BioJava and BioRuby projects.

Boost Python

Boost Python is comprehensive and well accepted by the Python community. I would go for it for its extensive use and testing. I would decline it for being hard to debug and having a complicated building system. I don't think it would be worth including a boost dependency just for the sake of creating the Python bindings, but since Mocapy++ already depends on Boost, using it becomes a more attractive option. In my personal experience, Boost Python is very mature and there are no limitations on what one can do with it. When it comes to performance, Cython still overcomes it. Have a look at the Cython C++ wrapping benchmarks and check the timings of Cython against Boost Python. There are also previous benchmarks comparing Swig and Boost Python.

Cython

It is incredibly faster than other options to create python bindings to C++ code, according to several benchmarks available on the web. Check the Simple benchmark between Cython and Boost.Python. It is also very clean and simple, yet powerful. Python's doc on porting extension modules mentions cython: "If you are writing a new extension module, you might consider Cython."
Cython has now support for efficient interaction with numpy arrays. it is a young, but developing language and I would definitely give it a try for its leanness and speed.

Since Boost is well supported and Mocapy++ already relies on it, we decided to use Boost.Python for the bindings.

Implemented the bindings to provide a minimum subset of functionality, in order to run the implemented examples.

Compared the performance of C++ and Python versions.

Mocapy++’s interface remained unchanged, so the tests look similar to the ones in Mocapy/examples.

In the prototype the bindings were all implemented in a single module. For the actual implementation, we could mirror the src packages structure, having separated bindings for each package such as discrete, inference, etc.

It was possible to implement all the functionality required to run the examples. It was not possible to use the vector_indexing_suite when creating bindings for vectors of MDArrays. A few operators (in the MDArray) must be implemented in order to export indexable C++ containers to Python.

Two Mocapy++ examples that use discrete nodes were implemented in Python. There was no problem in exposing Mocapy’s data structures and algorithms. The performance of the Python version is very close to the original Mocapy++.

Bindings Implementation

Bindings for the core functions and data structures

Data structures

Mocapy uses an internal data structure to represent arrays: MDArray. In order to make it easier for the user to interact with Mocapy's API, it was decided to provide an interface that accepts numpy arrays. Therefore, it was necessary to implement a translation between a numpy array and an MDArray.

The translation from MDArray to python was done through the use of Boost.Python to_python_converter. We've implemented a template method convert_MDArray_to_numpy_array, which converts an MDArray of any basic type to a corresponding numpy array. In order to perform the translation the original array's shape and internal data are copied into a new numpy array.

The numpy array was created using the Numpy Array API. The creation of a new PyArrayObject using existing data (PyArray_SimpleNewFromData) doesn't copy the array data, it just stores a pointer to it. Thus, one can only free the data when there is no reference to the object. This was done through the use of a Capsule. Besides encapsulating the data, the capsule also stores a destructor to be used when the array is destroyed. The PyArrayObject has a field named "base" which points to the capsule.

The translation from Python to C++, i.e. creating an MDArray from a numpy array is slightly more complex. Boost.Python will provide a
chunk of memory into which the new C++ object must constructed in-place. See the How to write boost.python converters article for more details.

A translation between std::vector of basic types (double, int...) and Python list was also implemented. For std::vector of custom types, such as Node, the translation to a Python list was not performed. If done the same ways as for basic types, a type error is raised: "TypeError: No to_python (by-value) converter found for C++ type". When using vector_indexing_suite this problem was already solved. See Wrapping std::vector<AbstractClass*>. The only inconvenience of using the vector_indexing_suite is creating new types such as vector_Node, instead of using a standard Python list.

The mocapy Python packages follow Mocapy's current source tree. For each package, a shared library with the bindings was created. This makes compilation faster and debug easier. Also, if a single library was created it wouldn't be possible to define packages.

Each of the libraries is called libmocapy_<nameofthepackage>. For example, libmocapy_gaussian provides bindings for the gaussian nodes and probability distributions. The libmocapy_data_structures is used by other libraries and, therefore, must be imported first. This is done on the Python side. Each of the libmocapy_* libraries is imported in the corresponding package. See Creating Packages.

The mdarray is created in python using a numpy.array that is translated to c++ using custom converters. The custom converters are registered in the global Boost.Python registry near the top of the module initialization function. Once flow control has passed through the registration code the automatic conversions from and to Python.

Because of this automatic conversions, it was necessary to create wrappers for functions which take pointers as arguments and change the functions which take references, to get const references. Because Mocapy++ is not const correct, changes are needed to use the const references properly. While the changes are being done, some const_cast have been used. When using const_cast one must be aware it is not always safe.

The call policies were also reviewed. When using an incorrect return value policy, you won't get a compile error, but your code will crash at runtime.

Pointers returned via manage_new_object will also be held by auto_ptr, so the transfer-of-ownership works correctly. When using this call policy the caller is responsible for deleting the C++ object from the heap.

Translation from numpy.array to a float mdarray

If the numpy array is an integer array, the translation creates an mdarray<int> and this is passed to a method which expects an mdarray of floats. This generates incorrect results.

The way to deal with that from the user perspective is either using floating pointer numbers to create the array or setting the ndtype parameter when creating the array:

Build the mocapy library using cmake (usual procedure described in mocapy docs);

Issue "python setup.py build", to build the extension modules;

Issue "python setup.py install", to install the package (normally, the install procedure does the step above in case you didn't).

Integration with Biopython

API Design

In order to use Mocapy in conjunction with Biopython, a new module for PDB-specific features was added to Bio.PDB. This is where the API is being designed.

Mocapy is added as an optional dependency in Biopython. Inside the function or module that requires Mocapy, "import mocapy" is wrapped in a try/except block. A MissingPythonDependencyError is issued if the import fails.

The aim of BARNACLE (BAyesian network model of RNA using Circular distributions and maximum Likelihood Estimation) is to capture both the marginal distributions of each of the angles and the local dependencies between them. Barnacle describes RNA structure in a natural continuous space. It can be used purely as a proposal distribution, but also as an energy term enforcing realistic local conformations. The model combines a dynamic Bayesian network (DBN) with directional statistics.

Barnacle DBN (doi:10.1371/journal.pcbi.1000406.g002)

The DBN represents nine consecutive dihedral angles, where the seven central angles originate from a single nucleotide. Each slice j (a column of three variables) corresponds to one dihedral angle in an RNA fragment. The variables in each slice are: an angle identifier, Dj, a hidden variable, Hj, and an angular variable, Aj. The angle identifier keeps track of which dihedral angle is represented by a slice, while the angular node models the actual dihedral angle value. The hidden nodes induce dependencies between all angles along the sequence (and not just between angles in consecutive slices).

TorusDBN aims at predicting the 3D structure of a biomolecule given its amino-acid sequence. It is a continuous probabilistic model of the local sequence–structure preferences of proteins in atomic detail. The backbone of a protein can be represented by a sequence of dihedral angle pairs, φ and ψ that are well known from the Ramachandran plot. Two angles, both with values ranging from −180° to 180°, define a point on the torus. Hence, the backbone structure of a protein can be fully parameterized as a sequence of such points.

Performance

A few performance measurements were made comparing test cases implemented both in C++ and in Python. The tests were run in a computer with the following specification:
Core 2 Duo T7250 2.00GHz, Memory Dual Channel 4.0GB (2x2048) 667 MHz DDR2 SDRAM, Hard Drive 200GB 7200RPM.

There were no significant performance differences.
For both implementations the methods responsible for consuming most cpu time were the same:

Here are the average running time of the examples available with Mocapy (10 runs):

Test name

C++ (s)

Python (s)

hmm_simple

0.52

0.58

hmm_discrete

48.12

43.45

discrete_hmm_with_prior

55.95

50.09

hmm_dirichlet

340.72

353.98

hmm_factorial

0.01

0.12

hmm_gauss_1d

53.97

63.39

hmm_gauss

16.02

16.96

hmm_multinomial

134.64

125.83

hmm_poisson

11.00

10.60

hmm_vonmises

7.22

7.36

hmm_torus

53.79

53.65

hmm_kent

61.35

61.06

hmm_bippo

40.66

41.81

infenginehmm

0.01

0.12

infenginemm

0.01

0.15

TorusDBN

Even though the PDB files are read, parsed and transformed in a format mocapy can understand, the most time consuming methods are the ones performing mathematical operations during the sampling process (Chebyshev and exp, for example).

Training of the TorusDBN model

The model has been trained with a training set consisting of about 950 chains with maximum 20% homology, resolution below 1.6 Å and R-factor below 25%. It took about 67 minutes to read and train the whole dataset.