SampleCodes

This document is based on the original 'BioJava in Anger' by Mark Schreiber et al. "BioJava in Anger, A Tutorial and Recipe Book for Those in a Hurry".

TOC

Introduction

BioRuby can be both big and intimidating. For those of us who are in a hurry there really is a whole lot there to get your head around. This document is designed to help you develop BioRuby programs that do 99% of common tasks without needing to read and understand 99% of the BioRuby API.

The page was inspired by various programming cookbooks and follows a "How do I...?" type approach. Each How do I is linked to some example code that does what you want and sometimes more. Basically if you find the code you want and copy and paste it into your program you should be up and running quickly. I have endeavoured to over document the code to make it more obvious what I am doing so some of the code might look a bit bloated.

'BioRuby in Anger' is maintained by Toshiaki Katayama and Pjotr Prins. If you have any suggestions, questions or comments contact the BioRuby mailing list.

These demos are tested with BioRuby 0.5.3 and Ruby 1.6.8 (Partly 1.8.1).

Alphabets and Symbols

In BioRuby, Sequence class inherits String so you can treat Sequence object as a String with various powerful methods implemented in Ruby's String class. You can easily generate DNA and/or Amino Acid sequences to edit, extract subsequence, regexp pattern match on it with usual methods for String object. Sequence class also has methods for splicing, translation, calculate stastical values, window search etc.

There are nothing equivarent to BioJava's Alphabet and/or Symbols in BioRuby, however, BioRuby provides lists of nucleic acids, amino acids and codon tables and use it transparently in appropreate methods?as needed.

How can I make an ambiguous Symbol like Y or R?

The IBU defines standard codes for symbols that are ambiguous such as Y to indicate C or T and R to indicate G or C or N to indicate any nucleotide. BioRuby represents these symbols as the same Bio::Sequence::NA object which can be easily converted to Regular expression that matches components of the ambiguous symbols. In turn, Bio::Sequence::NA object can contain symbols matching one or more component symbols that are valid members of the same alphabet as the Bio::Sequence::NA and are therefore capable of being ambiguous.

Generally an ambiguity symbol is converted to a Regexp object by calling the to_re method from the Bio::Sequence::NA that contains the symbol itself. You don't need to make symbol 'Y' by yourself because it is already built in the Bio::NucleicAcid class as a hash named Bio::NucleicAcid::Names.

Basic Sequence Manipulation

How do I make a Sequence from a String or make a Sequence Object back into a String?

A lot of the time we see sequence represented as a String of characters eg "atgccgtggcatcgaggcatatagc". It's a convenient method for viewing and succinctly representing a more complex biological polymer. BioRuby makes use of a Ruby's String class to represent these biological polymers as Objects. Unlike BioJava's SymbolList, BioRuby's Bio::Sequence inherits String and provide extra methods for the sequence manipulation. We don't have a container class like a BioJava's Sequence class, to store things like the name of the sequence and any features it might have, you can think of to use other container classes such as a Bio::FastaFormat, Bio::GFF, Bio::Features etc. for now (We have a plan to prepare a general container class for this to be compatible with a Sequence class in other Open Bio* projects).

Bio::Sequence class has same capability as a Ruby's String class, it is simple easy to use. You can represent a DNA sequence within the Bio::Sequence::NA class and a protein sequence within the Bio::Sequence::AA class. You can translate DNA sequence into protein sequence with a single method call and can concatenate them with the same method '+' as a String class's.

String to Sequence with comments

Yes, we should prepare a better container class for this. Temporally, you can do this as:

#!/usr/bin/env ruby
require 'bio'
# store a DNA sequence with the name dna_1 in a Bio::FastaFormat object
dna = Bio::Sequence::NA.new("atgctg").to_fasta("dna_1")
# store a RNA sequence with the name rna_1 in a Bio::FastaFormat object
rna = Bio::Sequence::NA.new("augcug").to_fasta("rna_1")
# store a Protein sequence with the name prot_1 in a Bio::FastaFormat object
prot = Bio::Sequence::AA.new("AFHS").to_fasta("prot_1")
# you can extract a name and a sequence stored in a Bio::FastaFormat object.
fasta_seq = Bio::FastaFormat.new(dna)
puts fasta_seq.entry_id # => "dna_1"
puts fasta_seq.naseq # => "atgctg"

Bio::Sequence to String

You don't need to call any method to convert a Bio::Sequence object to use as a String object because it can behave as a String, although you can call a to_s method to stringify explicitly.

How do I get a subsection of a Sequence?

Given a Sequence object we might only be interested in examining the first 10 bases or we might want to get a region between two points. You might also want to print a subsequence to a file or to STDOUT how could you do this?

BioRuby partly uses a biological coordinate system for identifying bases. You can use Bio::Sequence#subseq method to extract subsequence as the first base is numbered 1 and the last base index is equal to the length of the sequence. Other methods that are inherited from a String class use a normal String indexing which starts at 0 and proceedes to length -1. If you attempt to access a region outside of 1..length with a subseq method nil will be returned. Other methods in a String class will behave as a same.

How do I transcribe a DNA Sequence to a RNA Sequence?

In BioRuby, DNA and RNA sequences are stored in the same Bio::Sequence::NA class just using different Alphabets, you can convert from DNA to RNA or RNA to DNA using the rna or dna methods, respectively.

Sequences are immutable so how can I change it's name?

Sequences are not immutable in BioRuby - just use the freeze method to make sequence unchangable.

How can I edit a Sequence or SymbolList?

Sometimes you will want to modify the order of Symbols in a sequence. For example you may wish to delete some bases, insert some bases or overwrite some bases in a DNA sequence. BioRuby's Bio::Sequence object can be edited by any methods inherited from Ruby's String class.

How can I make a sequence motif into a regular expression?

In BioRuby, to make a Sequence into a Ruby's regular expression pattern is done by calling to_re method for the Bio::Sequence::NA object. The generated pattern can even be from an ambiguous sequence such as "acgytnwacrs". For the protein sequences, you can do this by just wrapping a Bio::Sequence::AA object with // as usual for the String to make it Regexp (or use Regexp.new). You can then use this pattern to search Strings for the existence of that pattern.

How do I translate a single codon to a single amino acid?

The general translation example shows how to use the translate method of Bio::Sequence::NA object but most of what goes on is hidden behind the convenience method. If you only want to translate a single codon into a single amino acid you get exposed to a bit more of the gory detail but you also get a chance to figure out more of what is going on under the hood.

#!/usr/bin/env ruby
require 'bio'
# make a 'codon'
codon = Bio::Sequence::NA.new("uug")
# you can translate the codon as described in the previous section.
puts codon.translate # => "L"

How do I use a non standard translation table?

The convenient translate method in Bio::Sequence::NA, used in the general translation example, is not limited to use the "Universal" translation table. The translate method also accepts a translation starting frame and a codon table number as its arguments.

This program automatically detects and reads FASTA format files given as its arguments.

#!/usr/bin/env ruby
require 'bio'
Bio::FlatFile.auto(ARGF) do |ff|
ff.each do |entry|
# do something on each fasta sequence entry
end
end

Similar but specify FASTA format explicitly.

#!/usr/bin/env ruby
require 'bio'
Bio::FlatFile.open(Bio::FastaFormat, ARGV[0]) do |ff|
ff.each do |entry|
# do something on each fasta sequence entry
end
end

How do I read a GenBank, EMBL or SwissProt file?

The Bio::FlatFile class contains methods for reading GenBank, SwissProt and EMBL files. Because any of these files can contain more than one sequence entry Bio::FlatFile can be used with a block to iterate through the individual sequences. One of the attractive features of this model is that the Sequences are only parsed and created as needed so very large collections of sequences can be handled with moderate resources.

Information in the file is stored in the Sequence (for now, this is one of the Bio::GenBank, Bio::EMBL or Bio::SwissProt classes, please wait for our future implementation of the generic Sequence class for this) as Bio::Features or where there is location information as Bio::Locations. Reading GenBank, EMBL or SwissProt

Save the following script as the readflat.rb and execute it with files in GenBank, EMBL or SwissProt format as arguments like:

ruby ./readflat.rb hoge*.seq

The file format is auto detected.

#!/usr/bin/env ruby
require 'bio'
# read the sequence entry by entry through the files listed in ARGV.
sequences = Bio::FlatFile.auto(ARGF)
sequences.each do |seq|
# do stuff with the sequence entry
puts seq.entry_id
end

How do I extract GenBank, EMBL or Swissprot sequences and write them as Fasta?

To perform this task we are going to extend the general reader from the previous demo and include in it the ability to write sequence data in fasta format.

#!/usr/bin/env ruby
require 'bio'
# read the sequence entry by entry through the files listed in ARGV.
entries = Bio::FlatFile.auto(ARGF)
# iterates on each entry to print the fasta formated string.
entries.each do |entry|
name = entry.entry_id
seq = entry.naseq # use aaseq method in the case of protein database
puts seq.to_fasta(name)
end

How do I turn an ABI sequence trace into a BioJava Sequence?

BioRuby doesn't support ABI trace at the moment...

Annotations

How do I list the Annotations in a Sequence?

To be done...

How do I filter a Sequences based on their species (or another Annotation property)?

To be done...

Locations and Features

How do I specify a PointLocation?

To be done...

How do I specify a RangeLocation?

To be done...

How do CircularLocations work?

To be done...

How can I make a Feature?

To be done...

How can I filter Features by type?

To be done...

How can I remove features?

To be done...

BLAST and FASTA

How do I set up a BLAST parser?

A frequent task in bioinformatics is the generation of BLAST search results.

Save the following script as a blastparse.rb and execute it with a file in BLAST XML output format (-m 7) like:

How can I generate a random sequence from a Distribution?

How can I find the amount of information or entropy in a Distribution?

To be done...

What is an easy way to tell if two Distributions have equal weights?

To be done...

How can I make an OrderNDistribution over a custom Alphabet?

To be done...

How can I write a Distribution as XML?

To be done...

Weight Matrices and Dynamic Programming

How do I use a WeightMatrix to find a motif?

To be done...

How do I make a HMMER like profile HMM?

To be done...

How do I set up a custom HMM?

To be done...

User Interfaces

How can I visualize Annotations and Features as a tree?

To be done...

How can I display a Sequence in a GUI?

To be done...

How do I display Sequence coordinates?

To be done...

How can I display features?

To be done...

OBDA

How do I set up BioSQL?

To be done...

Structure I/O and Manipulation

How do I read a structure file?

The current implementation only reads PDB format files, there will hopefully be code for parsing mmCIF and PDB XML files in the future. Adding these new formats will probably change the API for reading and writing structure files.

Given the above caveat, the current way to read a PDB file is to slurp the file into a string which is fed to the constructor of the Bio::PDB class:

How do I access annotations?

The annotations from the PDB file are stored in Bio::PDB::Record objects. You can retrieve a specific annotation by calling the '.record' method of a Bio::PDB object with the name of the annotation as the argument:

structure.record("HEADER")

In fact '.record' returns an array of all Bio::PDB::Records of the given type, so you'll need to call '.first' to get to the actual Bio::PDB::Record. Bio::PDB::Record provides methods to get to the individual data in each record. E.g. The 'HEADER' record provides classification, depDate and idCode methods. You can look at the PDB format to see what fields are available in each record, or just look in the pdb.rb file which has a hash of all the definitions.

So, to get the title of a PDB file you would do it like this:

structure.record('TITLE').first.title

Some records have their own special methods:

structure.remark(num) #Returns hash of the REMARK records based on 'num'
structure.jrnl #Returns a hash of the JRNL records
structure.jrnl('TITL') #Returns an array of all the TITL subfields from
#the JRNL records
structure.helix(id) #Returns the HELIX records based on 'id'
#Returns an array if no 'id' is given
structure.turn #Same interface as '.helix'
structure.sheet #Same interface as '.sheet'
structure.seqres(id) #Returns the sequence given in the SEQRES record
#as a Bio::Sequence::AA object
structure.keywords #returns a flattened lsit of keywords

How do I access the coordinate data?

Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the top. A Bio::PDB object contains one or more Bio::PDB::Model objects which contain one or more Bio::PDB::Chain objects which contain one or more Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.

There are two ways to access the coordinate data in a Bio::PDB object: keyed access and iterators.

Keyed access provides the simplest way to get to data if you know which residues or atoms you're interested in. For instance if you wanted to get hold of chain 'A' you can do it like this:

chain = structure[nil]['A']

The nil refers to the model, which will be nil in crystal structures but may have a number in NMR structures. Every level in the heirarchy can be accessed this way. E.g. to get hold of the CA atom of residue 100 in chain 'A':

structure[nil]['A']['100']['CA']

or if you still have your chain object:

chain['100']['CA']

The residue identifier is not just the residue number. PDB residues can have insertion codes and even negative numbers.

chain['100A']['CA']

could be valid.

Iterators are also provided so you can easily go through all objects in the heirarchy. E.g:

Bio::PDB::Utils.distance requires two Vectors as its arguments. Bio::PDB::Coordinate objects, which are produced by the .xyz call to the Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,
[10,10,10])

And use the .centreOfGravity and .geometricCentre methods, both of which provide a Bio::PDB::Coordinate:

How do I find specific atoms, residues, chains or models?

If the iterators and keyed access aren't powerful enough for you, the finder method is also provided. All the objects in the Bio::PDB hierarchy implement finder. It requires the class of object you wish to find as the first argument, and then a block which is evaluated for each object. If the block returns true then that object is added to an array of 'found' things.

Selection 3 now contains all the HIS atoms within the box and selection 1 contains all the HIS atoms outside the box.

How do I work with ligand data?

Because some PDB files reuse residue numbers that they already used in the ATOM records in the HETATM records, we have to add an extra flag when we want to access HETATM data. The extra flag is simply adding 'LIGAND' to the residue id. E.g Given the PDB file

There should also be an .each_ligand method in the Bio::PDB::Chain class for iterating through the ligands, but this is not implemented in the current cvs version.

How do I work with solvent?

Solvent is defined in the current parser as any HETATM record with HOH as the residue type. This may not be ideal for some files, but deals with 95% of cases which is good enough for me at the moment!

Solvent residues are stored in a separate Bio::PDB::Chain attached to each Bio::PDB::Model. Currently the only methods available are 'addSolvent' which adds a residue to the solvent chain and 'removeSolvent', which sets the whole solvent chain to nil.

An each_solvent method for Bio::PDB::Model has been implemented in my copy but is not in cvs at the moment.

How do I get the sequence of a chain?

There are two types of sequence in a PDB file: the real sequence of whatever protein has been crystalised, and the sequence observable in the structure.

The first type of sequence is available through the SEQRES record (if this is provided in the PDB file). You can access this through the top-level Bio::PDB object:

structure.seqres['A']

provides the sequence given in the SEQRES record for chain A. The observed sequence is obtained from the Bio::PDB::Chain object. E.g.

structure[nil]['A'].atom_seq

Both methods return a Bio::Sequence::AA object.

Disclaimer

This code is generously donated by people who probably have better things to do. Where possible we test it but errors may have crept in. As such, all code and advice here in has no warranty or guarantee of any sort. You didn't pay for it and if you use it we are not responsible for anything that goes wrong. Be a good programmer and test it yourself before unleashing it on your corporate database.

Copyright

The documentation on this site is the property of the people who contributed it. If you wish to use it in a publication please make a request through the BioRuby mailing list.The original version was based on the 'BioJava in Anger' document by Mark Schreider et al.

The code is open-source. A good definition of open-source can be found here. If you agree with that definition then you can use it.