As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2010. We are supporting João Rodrigues in his project, "[[GSOC2010_Joao|Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module]]."

+

== Introduction ==

−

In 2009, Biopython was involved with GSoC in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:

+

The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].

Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.

* João Rodrigues worked on [http://www.biopython.org/wiki/GSOC2010_Joao the Structural Biology module Bio.PDB] adding several features used in everyday structural bioinformatics.

+

Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students/mentors to contact the [http://biopython.org/wiki/Mailing_lists mailing lists] with their own ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.

−

Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].

+

Past mentors include:

−

== 2011 Project ideas ==

+

* [http://casbon.me/ James Casbon]

+

* [https://github.com/chapmanb Brad Chapman]

+

* [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]

+

* [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]

+

* [http://www.linkedin.com/in/reece Reece Hart]

+

* [http://nmr.chem.uu.nl/~joao João Rodrigues]

+

* [http://etal.myweb.uga.edu/ Eric Talevich]

−

=== Biopython and PyCogent interoperability ===

+

== Proposals ==

+

=== 2013 ===

−

; Rationale : [http://pycogent.sourceforge.net/ PyCogent] and [http://biopython.org/wiki/Main_Page Biopython] are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.

+

The BioPython proposals for the Google Summer of Code 2013 will be published here once discussed. We encourage potential students and mentors to join the [http://biopython.org/wiki/Mailing_lists BioPython mailing lists] and actively participate in these discussions, either by submitting their own ideas or contributing to improving existing ones.

−

; Approach : The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:

: Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.

+

; Challenges

+

: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.

+

; Difficulty and needed skills

+

: Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.

+

; Mentors

+

: [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]

−

; Challenges : This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.

: Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.

** release code to appropriate community efforts and write short manuscript

+

** implement web service for HGVS conversion

+

; Difficulty and needed skills

+

: Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.

: Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.

+

; Approach & Goals

+

* Add the new module backbone in current Bio.PDB code base

+

** Evaluate possible code reuse and call it into the new module

+

** Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions

+

* Define a stable benchmark

+

** Select few PDB files among interface size and proteins size would be different

: Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.

+

; Approach & Goals

+

: Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.

+

; Mentors

+

: [http://etal.myweb.uga.edu/ Eric Talevich]

+

: [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]

−

; Degree of difficulty and needed skills : Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.

+

==== [http://biopython.org/wiki/GSOC2011_MocapyExt MocapyExt] ====

+

; Student

+

: Justinas V. Daugmaudis

+

; Rationale

+

: BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.

+

; Approach & Goals

+

: The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.

: Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community. Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed edPDB or the more complete Biskit library render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.

+

; Mentors

+

: [http://etal.myweb.uga.edu/ Eric Talevich]

+

: [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]

+

: Diana Jaunzeikare

−

=== Accessing R phylogenetic tools from Python ===

+

=== 2009 ===

−

; Rationale : The [http://www.r-project.org/ R statistical language] is a powerful open-source environment for statistical computation and visualization. [http://www.python.org/ Python] serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using [http://bitbucket.org/lgautier/rpy2/ Rpy2], allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.

+

==== [http://biopython.org/wiki/PhyloXML PhyloXML] ====

+

; Rationale

+

: PhyloXML is an XML format for phylogenetic trees, designed to allow storing information about the trees themselves (such as branch lengths and multiple support values) along with data such as taxonomic and genomic annotations. Connecting these pieces of evolutionary information in a standard format is key for comparative genomics.

+

A Bioperl driver for phyloXML was created during the 2008 Summer of Code; this project aims to build a similar module for the popular Biopython package.

+

; Mentors

+

: [https://github.com/chapmanb Brad Chapman]

+

: Christian Zmasek

−

; Approach : Rpy2 contains higher level interfaces to popular R libraries. For instance, the [http://rpy.sourceforge.net/rpy2/doc-2.1/html/graphics.html#package-ggplot2 ggplot2 interface] allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the [http://bodegaphylo.wikispot.org/Phylogenetics_and_Comparative_Methods_in_R Bodega Bay Marine Lab wiki]. Some examples of R libraries for which integration would be welcomed are:

: I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.

; Challenges : The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.

; Degree of difficulty and needed skills : Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests (e.g. [http://www.warwick.ac.uk/go/peter_cock/python/heatmap/ microarrays and heatmaps]); there is also the possibility to venture far into data visualization once access to analysis methods is made. [http://kiwi.cs.dal.ca/GenGIS/Main_Page GenGIS] and can give ideas about what is possible.

=== Mocapy++Biopython: from data to probabilistic models of biomolecules ===

+

−

+

−

; Rationale : [http://sourceforge.net/projects/mocapy/ Mocapy++] is a machine learning toolkit for training and using [http://en.wikipedia.org/wiki/Bayesian_network Bayesian networks]. Mocapy++ supports the use of [http://en.wikipedia.org/wiki/Directional_statistics directional statistics]; the statistics of angles, orientations and directions. This unique feature of Mocapy++ makes the toolkit especially suited for the formulation of probabilistic models of biomolecular structure. The toolkit has already been used to develop (published and peer reviewed) models of [http://www.pnas.org/content/105/26/8932.abstract?etoc protein] and [http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000406 RNA] structure in atomic detail. Mocapy++ is implemented in C++, and does not provide any Python bindings. The goal of this proposal is to develop an easy-to-use Python interface to Mocapy++, and to integrate this interface with the Biopython project. Through its [http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf Bio.PDB] module (initially implemented by the mentor of this proposal, [http://www.binf.ku.dk/research/structural_bioinformatics/ T. Hamelryck]), Biopython provides excellent functionality for data mining of biomolecular structure databases. Integrating Mocapy++ and Biopython would create strong synergy, as it would become quite easy to extract data from the databases, and subsequently use this data to formulate probabilistic models. As such, it would provide a strong impulse to the field of protein structure prediction, design and simulation. Possible applications beyond bioinformatics are obvious, and include probabilistic models of human or animal movement, or any other application that involves directional data.

+

−

+

−

; Approach : Ideally, the student would first gain some understanding of the theoretical background of the algorithms that are used in Mocapy++, such as parameter learning of Bayesian networks using [http://en.wikipedia.org/wiki/Expectation-maximization_algorithm Stochastic Expectation Maximization (S-EM)]. Next, the student would study some of the use cases of the toolkit, making use of some of the published articles that involve Mocapy++. After becoming familiar with the internals of Mocapy++, Python bindings will then be implemented using the Boost C++ library. Based on the use cases, the student would finally implement some example applications that involve data mining of biomolecular structure using Biopython, the subsequent formulation of probabilistic models using Python-Mocapy++, and its application to some biologically relevant problem. Schematically, the following steps are involved for the student:

+

−

+

−

:* Gaining some understanding of S-EM and directional statistics

+

−

:* Study of Mocapy++ use cases

+

−

:* Study of Mocapy++ internals and code

+

−

:* Design of interface strategy

+

−

:* Implementing Python bindings using Boost

+

−

:* Example applications, involving Bio.PDB data mining

+

−

+

−

; Challenges : The project is highly interdisciplinary, and ideally requires skills in programming (C++, Python, wrapping C++ libraries in Python, Boost), machine learning, knowledge of biomolecular structure and statistics. The project could be extended (for example, by implementing additional functionality in Mocapy++) or limited (for example by limiting the time spent on understanding the theory behind Mocapy++).

+

−

+

−

; Involved toolkits or projects :

+

−

+

−

:* [http://biopython.org/wiki/Main_Page Biopython]

+

−

:* [http://sourceforge.net/projects/mocapy/ Mocapy++]

+

−

+

−

; Degree of difficulty and needed skills : Hard. The student needs to be fluent in C++, Python and the [http://www.boost.org C++ Boost library]. Experience with machine learning, Bayesian statistics and biomolecular structure would be clear advantages.

Mentor List

Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students/mentors to contact the mailing lists with their own ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.

Proposals

2013

The BioPython proposals for the Google Summer of Code 2013 will be published here once discussed. We encourage potential students and mentors to join the BioPython mailing lists and actively participate in these discussions, either by submitting their own ideas or contributing to improving existing ones.

Past Proposals

2012

Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.

Approach

Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.

Challenges

The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.

Difficulty and needed skills

Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.

Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.

release code to appropriate community efforts and write short manuscript

implement web service for HGVS conversion

Difficulty and needed skills

Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.

2011

Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.

Approach & Goals

Add the new module backbone in current Bio.PDB code base

Evaluate possible code reuse and call it into the new module

Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions

Define a stable benchmark

Select few PDB files among interface size and proteins size would be different

Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.

Approach & Goals

Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.

BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.

Approach & Goals

The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.

2010

Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community. Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed edPDB or the more complete Biskit library render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.

2009

PhyloXML is an XML format for phylogenetic trees, designed to allow storing information about the trees themselves (such as branch lengths and multiple support values) along with data such as taxonomic and genomic annotations. Connecting these pieces of evolutionary information in a standard format is key for comparative genomics.

A Bioperl driver for phyloXML was created during the 2008 Summer of Code; this project aims to build a similar module for the popular Biopython package.

I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.