O|B|F News » BiopythonOpen Source Bioinformatics news2015-03-03T16:04:33Zhttp://news.open-bio.org/news/feed/atom/WordPressPeterhttp://www.warwick.ac.uk/go/peter_cock/http://news.open-bio.org/news/?p=12582015-03-03T16:04:33Z2015-03-03T13:30:35ZLast year's Google Summer of Code 2014 was very productive for the OBF with six students working on Bio* and related bioinformatics projects. We applied to be part of GSoC 2015, but unfortunately this year were not accepted. Continue reading →]]>

On behalf of the OBF, we would like to thank our volunteer GSoC Administrators, Raoul Bonnal and Francesco Strozzi, for organising our application – and all our potential mentors across the Bio* projects who put forward potential project suggestions.

]]>0Hilmar Lapphttp://news.open-bio.org/news/?p=12432015-02-23T23:24:36Z2015-02-23T23:24:36ZIn 2014, OBF had six students in the Google Summer of Code 2014™ (GSoC) program mentored under its umbrella of Bio* and related open-source bioinformatics community projects: Loris Cro (Bioruby) with mentors Francesco Strozzi and Raoul Bonnal; Evan Parker (Biopython) … Continue reading →]]>

In 2014, OBF had six students in the Google Summer of Code 2014™ (GSoC) program mentored under its umbrella of Bio* and related open-source bioinformatics community projects: Loris Cro (Bioruby) with mentors Francesco Strozzi and Raoul Bonnal; Evan Parker (Biopython) with mentors Wibowo Arindrarto and Peter Cock; Sarah Berkemer (BioHaskell) with mentors Christian Höner zu Siederdissen and Ketil Malde; and three students contributed to JSBML: Victor Kofia (mentors: Alex Thomas and Sarah Keating), Ibrahim Vazirabad (mentors: Andreas Dräger and Alex Thomas), and Leandro Watanabe (mentors: Nicolas Rodriguez and Chris Myers).

As a change from earlier years in which OBF participated in GSoC as a mentoring organization, in 2014 we purposefully defined our umbrella as much more inclusive of the wider bioinformatics open-source community, bringing it more in line with the annual Bioinformatics Open-Source Conference (BOSC). In part this was also motivated by “paying it forward“, a concept central to growing healthy open-source communities, after the larger domain-agnostic language projects such as SciRuby and PSF had extended an open hand to OBF mentors when OBF did not get admitted as a GSoC mentoring organization in 2013. In the end, four out of the six succeeding student applications were for projects outside of the traditional core Bio* projects, a result with which everyone won: We had a terrific crop of students, our community grew larger and stronger, and open-source bioinformatics was advanced in a more diverse way than would have been possible otherwise.

In addition to our students, huge kudos also go to our mentors (see above), and to Eric Talevich (Biopython) and Raoul Bonnal (Bioruby), who ran our program participation as administrators. They all invested significant amounts of time on behalf of our community and projects. Thank you!

Below follows a short summary of each of the 2014 student projects, starting with the three JSBML students.

JSBML and GSoC 2014

JSBML is an international community-driven, open-source project to develop a Java API library for reading, writing and manipulating SBML, a data format for representing and exchanging computational models in systems biology. SBML has been in use for over a decade but continues to evolve and grow, and hence so does JSBML. JSBML holds two annual development-oriented workshops, and the three 2014 JSBML GSoC students had the opportunity to participate in and present their work at the autumn event, COMBINE (Computational Modeling in Biology Network), which was held in Los Angeles, California, right at the end of GSoC. Furthermore, a scientific publication on a new JSBML release, currently under review at Bioinformatics, highlights some of the work done by the students. Hence, JSBML’s 2014 participation in GSoC was a great success and experience, both for the students as well as the JSBML project and community.

CellDesigner is a frequently used program in computational systems biology. It features an easy-to-use GUI, powerful graph editing functions, and a rich simulation functionality, among others. To facilitate rapid prototyping of new algorithms in third-party applications, CellDesigner provides a plug-in interface for Java applications to its robust interface and other features. However, the design and implementation of the plug-in interface made developing software for it very difficult and time consuming. To remedy this, a draft version of a JSBML library had been created to allow developing and testing prospective plug-in modules initially as stand-alone software, which can then be turned into a CellDesigner plug-in with very little effort. The goal of Ibrahim’s project was to improve the interface provided by the library, and importantly, to revise it to support access to one of CellDesigner’s most interesting features, graphical network layout. As a result of Ibrahim’s work, new CellDesigner test cases and plugins that use this interface have already been implemented, including one that converts between CellDesigner’s proprietary data format and the official SBML layout extension.

Leandro H. Watanabe – “Arrays Package“

The arrays and dynamic package extensions to SBML have been proposed to overcome SBML’s limitation to static static models, which is in contrast to the inherently dynamic nature of many biological systems. The goal of Leandro’s project was to implement the arrays package in JSBML. Rather than enabling models with new behaviors to be constructed, the purpose of the arrays package is to represent regular constructs more efficiently and more compact than SBML core constructs can. To aid the integration of the arrays package into existing tools, Leandro also implemented the option of flattening an arrayed model to use only SBML core constructs, and a validation procedure for array constructs that checks whether a model violates any of the rules imposed on array constructs. As a consequence, his work helped solidify the Arrays Specification document of the SBML standard.

Victor Kofia – “Redesign the implementation of mathematical formulas“

JSBML uses the concept of abstract syntax trees to work with mathematical expressions. For example, the image to the right shows a syntax tree representing the formula k8 · R1. Originally, JSBML implemented different kinds of formula components all in just one complex class with diverse type attributes, which was prone to introducing errors upon code changes and generally made maintenance of the software difficult. Victor implemented a math package for JSBML, in which different kinds of tree nodes that can occur in formulas (e.g., real numbers or algebraic symbols such as ‘plus’ or ‘minus’) are represented with their own, specialized classes. This has made handling of formulas much more straightforward, and also more efficient. In the future, this new representation could even be used for symbolic or numeric calculations.

Though Biopython is already equipped with sequence parsers for a wide array of formats, these generally parsed entire records into memory. For large sequences such as entire chromosomes this quickly degrades performance. To allow sequences to be loaded on-demand, Evan designed a general lazy-loading parser by refactoring the existing object model, and then added format-specific modifications to each individual parser. The approach he devised works by pre-indexing the sequence files and then loading only those sequence regions that the user requests. Benchmarking and performance comparisons showed this approach yields significant performance gains when, as is common for genome-scale files, users are interested only in parts of the full sequence. Evan’s code is currently under review by Biopython core developers, and once merged will make parsing large sequences in Biopython much more tractable.

Variant Call Format (VCF) files are commonly generated by genome sequencing projects for sequence variations among different individuals and can get very large. The goal of Loris’ work was to develop code for Bioruby to determine the common variations (i.e., intersections) between multiple individuals and groups of individuals in a fast and scalable way. In the first phase of the project, Loris tested different technologies for storing large VCF files, from which MongoDB emerged as having superior performance. In the second phase Loris developed the code for efficiently storing VCF data into MongoDB, and then implemented algorithms for performing the intersection queries (see Github repo and Loris’ project blog). The code was developed using JRuby and uses the HTS-JDK library to parse the VCF data. In the course of the project, Loris also provided valuable feedback to the HTS-JDK team that led to improvements of the VCF parser and data model. The result of Loris’ GSoC work is now available to the community as a Ruby Gem, which has been tested and used already in large international genome re-sequencing projects, including Gene2Farm and WHEALBI.

Sarah Berkemer – “Open source high-performance BioHaskell“

One of the challenges with sequence alignments for the purposes of sequence similarity searches is that for most known genes (i.e., sequences) relatively little is known about their biology, and the few for which a lot is known therefore tend to be only remotely related to a query sequence. Transitive alignments try to ameliorate this by aligning the query sequence against a large body of known but not deeply understood sequences, the intermediate set, which in turn are then aligned against the core of well-understood sequences. However, in contrast to aligning two sequences, aligning a sequence via a vast intermediate data set to a smaller core set is slow and memory-consuming. As part of her GSoC project, Sarah dug deep into the structure of the algorithm, and rewrote core parts to make use of fusing data structures and efficient tree-like data structures (see her project blog). Her work brought down the runtime for a benchmark by a factor of 3, from 31 to 11 minutes, and, arguably even more important, reduced memory consumption from 53 to 22 gigabytes. This now allows running the program on consumer-grade high-memory PCs. With Sarah having finished her Masters degree (congrats!!) in the meantime, she and her mentors are now in the process of writing a scientific application note and are planning to make the program available as an online web-service.

As a rather small family within the much larger OBF umbrella, the chance to have a student contribute to functional programming for computational biology has been a tremendous opportunity and learning experience for the Biohaskell community as well.

]]>8Peterhttp://www.warwick.ac.uk/go/peter_cock/http://news.open-bio.org/news/?p=12262014-12-17T21:06:29Z2014-12-17T21:06:29ZSource distributions and Windows installers for Biopython 1.65 are now available. The most visible change is that the Biopython sequence objects now use string comparison, rather than Python's object comparison. Continue reading →]]>

This release of Biopython supports Python 2.6, 2.7, 3.3 and 3.4. It is also tested on PyPy 2.0 to 2.4, PyPy3 version 2,4, and Jython 2.7b2.

The most visible change is that the Biopython sequence objects now use string comparison, rather than Python’s object comparison. This has been planned for a long time with warning messages in place (under Python 2, the warnings were sadly missing under Python 3).

The Bio.KEGG and Bio.Graphics modules have been expanded with support for the online KEGG REST API, and parsing, representing and drawing KGML pathways.

The Pterobranchia Mitochondrial genetic code has been added to Bio.Data (and the translation functionality), which is the new NCBI genetic code table 24.

The Bio.SeqIO parser for the ABI capillary file format now exposes all the raw data in the SeqRecord’s annotation as a dictionary. This allows further in-depth analysis by advanced users.

Bio.SearchIO QueryResult objects now allow Hit retrieval using its alternative IDs (any IDs listed after the first one, for example as used with the NCBI BLAST NR database).

Bio.SeqUtils.MeltingTemp has been rewritten with new functionality.

The new experimental module Bio.CodonAlign has been renamed Bio.codonalign (and similar lower case PEP8 style module names have been used for the sub-modules within this).

Bio.SeqIO.index_db(…) and Bio.SearchIO.index_db(…) now store any relative filenames relative to the index file, rather than (as before) relative to the current directory at the time the index was built. This makes the indexes less fragile, so that they can be used from other working directories. NOTE: This change is backward compatible (old index files work as before), however relative paths in new indexes will not work on older versions of Biopython!

Many thanks to the Biopython developers and community for making this release possible, especially the following contributors:

Alan Du (first contribution)

Carlos Pena (first contribution)

Colin Lappala (first contribution)

Christian Brueffer

David Bulger (first contribution)

Eric Talevich

Evan Parker (first contribution)

Hongbo Zhu

Kai Blin

Kevin Wu (first contribution)

Leighton Pritchard

Leszek Pryszcz (first contribution)

Markus Piotrowski

Matt Shirley (first contribution)

Mike Cariaso (first contribution)

Peter Cock

Seth Sims (first contribution)

Tiago Antao

Travis Wrightsman (first contribution)

Tyghe Vallard (first contribution)

Vincent Davis

Wibowo ‘Bow’ Arindrarto

Zheng Ruan

This is a longer list of contributors and changes than usual, but it was also a longer gap since our last release.

]]>2Tiago Antaohttp://tiago.orghttp://news.open-bio.org/news/?p=11792014-05-29T13:55:23Z2014-05-29T13:55:23ZSource distributions and Windows installers for Biopython 1.64 are now available from the downloads page on the official Biopython website and from the Python Package Index (PyPI). This release of Biopython supports Python 2.6 and 2.7, 3.3 and also the new 3.4 version. It is also tested on PyPy 2.0 … Continue reading →]]>

Thanks very much to all the students who applied, we very much appreciate your hard work.

We are now in the GSoC Community Bonding Period. Official work starts on May 23rd, and until then, students should prepare for their projects: get on the project mailing lists, solidify your plans, figure out where all the version control repositories are and which branch or fork you’ll be working on, and start doing preparatory work.

Here’s to a great 2014 Summer of Code,

Eric & Raoul

OBF GSoC 2014 Organization Administrators

]]>9Hilmar Lapphttp://news.open-bio.org/news/?p=11062014-01-23T03:39:14Z2014-01-15T00:04:20ZUpdate: The deadline for responding has been extended to January 25. The 2014 Google Summer of Code (GSoC) is coming up soon. The published timeline puts the mentoring organization applications from Feb 3 to 14. OBF participated on behalf of … Continue reading →]]>

Update: The deadline for responding has been extended to January 25.

The 2014 Google Summer of Code (GSoC) is coming up soon. The published timeline puts the mentoring organization applications from Feb 3 to 14.

OBF participated on behalf of our member projects in 2010, 2011, and 2012. Those participations were both important and successful. Through them, our projects gained new contributors, new features, and new community members. The mentors involved from our projects learned as much from the experience as the students, and formed bonds. The mentoring organization payment allowed OBF to sponsor community events and infrastructure.

To participate this year, we have to designate 2-3 people as primary and backup organization administrators. This is an important role, and we are looking for people from our community to step forward to serve.

An org admin’s role is in many ways that of a cat herder. The whole team of mentors and admins creates the experience for the students, but it falls on the admin to “keep it together.” Google holds the mentoring organization, not its mentors, accountable for the actions (or non-actions) of its mentors or community, and it falls on the org admin to carry that accountability through to the org’s mentors. The org admin’s responsibilities include:

Working out processes and rules for mentors as well as students that promote transparency, fairness, and protect from late-in-the-game surprises.

Knowing GSoC rules and processes, and making sure ours are consistent with them.

Reminding participants of rules, and enforcing them in the event it is necessary.

Mediating, and sometimes arbitrating between students and mentors when needed.

Ensuring that GSoC timelines are met by everyone.

The person we are looking for will genuinely care about the well-being of our communities, is well organized, stays calm in email storms, communicates clearly, has good people skills, and generally is known as a good listener.

If you are interested in helping us out in this role, please email us (by Jan 25, 2014) a statement at board@open-bio.org explaining how you would fit well in this role, and what your vision for our GSoC participation is. You need not be a developer or programmer to respond, but for now we do require that you have been active in some capacity in at least one of our project’s communities. Please include in your email a brief summary of such activities even if you are a core developer for one of our projects.

We are looking forward to hearing from you!

]]>7Tiago Antaohttp://tiago.orghttp://news.open-bio.org/news/?p=10762013-12-06T11:16:26Z2013-12-06T11:16:26ZSource distributions and Windows installers for Biopython 1.63 are now available from the downloads page on the official Biopython website and (soon) from the Python Package Index (PyPI). The current version removed the requirement of the 2to3 library. This was made possible by dropping Python 2.5 (and Jython 2.5). This … Continue reading →]]>

The current version removed the requirement of the 2to3 library. This was made possible by dropping Python 2.5 (and Jython 2.5).

This release of Biopython supports Python 2.6 and 2.7, and also Python 3.3.

The Biopython Tutorial & Cookbook, and the docstring examples in the source code, now use the Python 3 style print function in place of the Python 2 style print statement. This language feature is available under Python 2.6 and 2.7 via:

from __future__ import print_function

Similarly we now use the Python 3 style built-in next function in place of the Python 2 style iterators’ .next() method. This language feature is also available under Python 2.6 and 2.7.

The restriction enzyme list in Bio.Restriction has been updated to the December 2013 release of REBASE.

Many thanks to the Biopython developers and community for making this release possible, especially the following contributors:

Chris Mitchell (first contribution)

Christian Brueffer

Eric Talevich

Gokcen Eraslan (first contribution)

Josha Inglis (first contribution)

Konstantin Tretyakov (first contribution)

Lenna Peterson

Martin Mokrejs

Nigel Delaney (first contribution)

Peter Cock

Sergei Lebedev (first contribution)

Tiago Antao

Wayne Decatur (first contribution)

Wibowo ‘Bow’ Arindrarto

]]>22Tiago Antaohttp://tiago.orghttp://news.open-bio.org/news/?p=10632013-11-12T16:21:10Z2013-11-12T16:20:05ZSource distributions and Windows installers for Biopython 1.63 beta are now available from the downloads page on the official Biopython website. This is a beta release for testing purposes, the main reason for a beta version is the large amount of changes imposed by the … Continue reading →]]>

This is a beta release for testing purposes, the main reason for a beta version is the large amount of changes imposed by the removal of the 2to3 library previously required for the support of Python 3.X. This was made possible by dropping Python 2.5 (and Jython 2.5).

This release of Biopython supports Python 2.6 and 2.7, and also Python 3.3.

The Biopython Tutorial & Cookbook, and the docstring examples in the source code, now use the Python 3 style print function in place of the Python 2 style print statement. This language feature is available under Python 2.6 and 2.7 via:

from __future__ import print_function

Similarly we now use the Python 3 style built-in next function in place of the Python 2 style iterators’ .next() method. This language feature is also available under Python 2.6 and 2.7.

Contributors

Many thanks to the Biopython developers and community for making this release possible, especially the following contributors:

Chris Mitchell (first contribution)

Christian Brueffer

Eric Talevich

Josha Inglis (first contribution)

Konstantin Tretyakov (first contribution)

Lenna Peterson

Martin Mokrejs

Nigel Delaney (first contribution)

Peter Cock

Sergei Lebedev (first contribution)

Tiago Antao

Wayne Decatur (first contribution)

Wibowo ‘Bow’ Arindrarto

]]>15Peterhttp://www.warwick.ac.uk/go/peter_cock/http://news.open-bio.org/news/?p=10352013-08-28T22:41:57Z2013-08-28T22:14:43ZSource distributions and Windows installers for Biopython 1.62 are now available from the downloads page on the official Biopython website and from the Python Package Index (PyPI). Continue reading →]]>

This is our first release of Biopython which officially supports Python 3. Specifically, this is supported under Python 3.3. Older versions of Python 3 may still work albeit with some issues, but are not supported.

We still fully support Python 2.5, 2.6, and 2.7. Support under Jython is available for versions 2.5 and 2.7 and under PyPy for versions 1.9 and 2.0. However, unlike CPython, Jython and PyPy support is partial: NumPy and our C extensions are not covered.

Please note that this release marks our last official for support Python 2.5. Beginning from Biopython 1.63, the minimum supported Python version will be 2.6.

Highlights

The translation functions will give a warning on any partial codons (and this will probably become an error in a future release). If you know you are dealing with partial sequences, either pad with “N” to extend the sequence length to a multiple of three, or explicitly trim the sequence.

The handling of joins and related complex features in Genbank/EMBL files has been changed with the introduction of a CompoundLocation object. Previously a SeqFeature for something like a multi-exon CDS would have a child SeqFeature (under the sub_features attribute) for each exon. The sub_features property will still be populated for now, but is deprecated and will in future be removed. Please consult the examples in the help (docstrings) and Tutorial.

Thanks to the efforts of Ben Morris, the Phylo module now supports the file formats NeXML and CDAO. The Newick parser is also significantly faster, and can now optionally extract bootstrap values from the Newick comment field (like Molphy and Archaeopteryx do). Nate Sutton added a wrapper for FastTree to Bio.Phylo.Applications.

New module Bio.UniProt adds parsers for the GAF, GPA and GPI formats from UniProt-GOA.

The BioSQL module is now supported in Jython. MySQL and PostgreSQL databases can be used. The relevant JDBC driver should be available in the CLASSPATH.

Feature labels on circular GenomeDiagram figures now support the label_position argument (start, middle or end) in addition to the current default placement, and in a change to prior releases these labels are outside the features which is now consistent with the linear diagrams.

The code for parsing 3D structures in mmCIF files was updated to use the Python standard library’s shlex module instead of C code using flex.

The Bio.Sequencing.Applications module now includes a BWA command line wrapper.

Additionally there have been other minor bug fixes and more unit tests.

Contributors

Many thanks to the Biopython developers and community for making this release possible, especially the following contributors:

Alexander Campbell (first contribution)

Andrea Rizzi (first contribution)

Anthony Mathelier (first contribution)

Ben Morris (first contribution)

Brad Chapman

Christian Brueffer

David Arenillas (first contribution)

David Martin (first contribution)

Eric Talevich

Iddo Friedberg

Jian-Long Huang (first contribution)

Joao Rodrigues

Kai Blin

Lenna Peterson

Michiel de Hoon

Matsuyuki Shirota (first contribution)

Nate Sutton (first contribution)

Peter Cock

Petra Kubincová (first contribution)

Phillip Garland

Saket Choudhary (first contribution)

Tiago Antao

Wibowo ‘Bow’ Arindrarto

Xabier Bello (first contribution)

]]>6Peterhttp://www.warwick.ac.uk/go/peter_cock/http://news.open-bio.org/news/?p=9982013-02-11T15:42:01Z2013-02-05T21:14:05ZSource distributions and Windows installers for Biopython 1.61 are now available from the downloads page on the Biopython website and from the Python Package Index (PyPI). The updated Biopython Tutorial and Cookbook is online (PDF). Platforms/Deployment We currently support Python … Continue reading →]]>

We currently support Python 2.5, 2.6 and 2.7 and also test under Python 3.1, 3.2 and 3.3 (including modules using NumPy), and Jython 2.5 and PyPy 1.9 (Jython and PyPy do not cover NumPy or our C extensions). We are still encouraging early adopters to help test on these platforms, and have included a ‘beta’ installer for Python 3.2 (and Python 3.3 to follow soon) under 32-bit Windows.

Please note we are phasing out support for Python 2.5. We will continue support for at least one further release (Biopython 1.62). This could be extended given feedback from our users. Focusing on Python 2.6 and 2.7 only will make writing Python 3 compatible code easier.

Features

GenomeDiagram has three new sigils (shapes to illustrate features). OCTO shows an octagonal shape, like the existing BOX sigil but with the corners cut off. JAGGY shows a box with jagged edges at the start and end, intended for things like NNNNN regions in draft genomes. Finally BIGARROW is like the existing ARROW sigil but is drawn straddling the axis. This is useful for drawing vertically compact figures where you do not have overlapping genes.

New module Bio.Graphics.ColorSpiral can generate colors along a spiral path through HSV color space. This can be used to make arbitrary ‘rainbow’ scales, for example to color features or cross-links on a GenomeDiagram figure.

The Bio.SeqIO module now supports reading sequences from PDB files in two different ways. The “pdb-atom” format determines the sequence as it appears in the structure based on the atom coordinate section of the file (via Bio.PDB,
so NumPy is currently required for this). Alternatively, you can use the “pdb-seqres” format to read the complete protein sequence as it is listed in the PDB header, if available.

The Bio.SeqUtils module how has a seq1 function to turn a sequence using three letter amino acid codes into one using the more common one letter codes. This acts as the inverse of the existing seq3 function.

The multiple-sequence-alignment object used by Bio.AlignIO etc now supports an annotation dictionary. Additional support for per-column annotation is planned, with addition and splicing to work like that for the SeqRecord per-letter annotation.

The Bio.Motif module has been updated and reorganized. To allow for a clean deprecation of the old code, the new motif code is stored in a new module Bio.motifs, and a PendingDeprecationWarning was added to Bio.Motif.

Experimental Code – SearchIO

This release also includes Bow’s Google Summer of Code work writing a unified parsing framework for NCBI BLAST (assorted formats including tabular and XML), HMMER, BLAT, and other sequence searching tools. This is currently available with the new BiopythonExperimentalWarning to indicate that this is still somewhat experimental. We’re bundling it with the main release to get more public feedback, but with the big warning that the API is likely to change. In fact, even the current name of Bio.SearchIO may change since unless you are familiar with BioPerl its purpose isn’t immediately clear.