Not Logged In

SeqFindr 0.35.0

SeqFindr - easily create informative genomic feature plots. It’s a
bioinfomagicians tool to detect the presence or absence of genomic features
given a database describing these features & a set of draft and/or complete
genomes. We work with bacterial genomes & as such SeqFindr has only been
tested with bacterial genomes.

I am on vacation from 08/12/14 -> 06/01/15. User support will not happen
in this period. Stupidly I’ve done some releases the day/night before I
leave.

18/11/14: Version 0.4.0 now has new option –remove_empty_cols. It will
strip out entire columns where no hits were detected.

28/07/14: Fixed a bug where axes were shifted when using newer versions
of matplotlib.

Important: Were you using a specific SeqFindr version as a dependency
for you project and it has disappeared from PyPI?

We recently activated a name change of SeqFind*R* to SeqFind*r*. This was to
avoid potential users believing this was a R package. Unfortunately, PyPI
while aware that SeqFindR and SeqFindr were different packages did not like
the potential confusion. As a consequence the only resolution was to delete
SeqFind*R* completely (and losing all PyPI published releases) and registering
SeqFind*r* and starting fresh. All previous 10 releases, while not available
on PyPi are still available on GitHub. If you require a previous release you
can actually do something like this (SeqFindr v0.26):

pip install -e git://github.com/mscook/SeqFindr.git@v0.26

Version 0.31.1 released on 10 July 2014.

We are now testing SeqFindr builds on both Linux & MacOSX systems.

Best use “git log” for a changelog as the changelog for most recent
changes/fixes/enhancements may not be up to date.

Citation

Cite this Github repository if you use SeqFindr to generate figures
for publications:

These libraries will also have dependencies (i.e. atlas, lapack, fortran
compilers, freetype and png). These most likely won’t be installed on
your computer. Please install these before attempting the installation.

Linux (Ubuntu)

SeqFindr uses 3rd party packages that are extremely important for scientific
computing but are notoriously difficult to install. While pip install *
*–user SeqFindr may work we recommend you install these 3rd party packages
using apt-get.

We use the –user option of pip to put SeqFindr in: /home/$USER/.local/bin/
You need to add this location to you ~/.bash_profile.

Add SeqFindr to your path:

$ echo 'export PATH=$PATH:/home/$USER/.local/bin/' >> ~/.bash_profile

Finally install BLAST+:

$ sudo apt-get install ncbi-blast+

Test it:

Run:

$ SeqFindr -h
$ python -c 'import SeqFindr; print SeqFindr'

MacOSX (Mavericks)

You’ll need to have the equivalents of python-dev libatlas-dev liblapack-dev
gfortran libfreetype6-dev libfreetype6 & libpng-dev installed. We had no
problems installing SeqFindr on a recently acquired OSX Mavericks machine
using the homebrew package manager.

Example figure produced by SeqFindr

SeqFindr database files

The SeqFindr database is in multi-fasta format. The header needs to be
formatted with 4 comma separated elements. We concede that inventing
another file format is annoying, but, future versions of SeqFindr will
exploit this information.

The elements headers are:

identifier,

common name (this is taken as the gene label in the plot),

description and

species

The final element, separated by [] contains a classification. This
information is used by SeqFindr to draw different coloured blocks.

The -c (–class_file) option is very useful. Suppose you want to annotate your
sequences of interest with user defined classification values. Simply develop a
file containing the scheme as pass using the -c option (3rd example above).
A sample file for the situation where you had 7 input sequences with the first
3 Fe transporters, the next two Toxins, the next a Misc and the final
sequence is a Toxin would look like this:

Fe transporter
Fe transporter
Fe transporter
Toxin
Toxin
Misc
Toxin

How does SeqFindr determine positive hits

We use the following calculation:

hsp.identities/float(record.query_length) >= tol

Where:

hsp.identities is number of identities in the high-scoring pairs between
the query (database entry) and subject (contig/scaffold/mapping
consensus),

record.query_length is the length of the database entry and,

tol is the cutoff threshold to accept a hit (0.95 default)

For a database entry of 200 bp you can have up to 10 mismatches/gaps without
being penalised.

Why not just use max identity?

Eliminate effects of scaffolding characters/gaps,

Handle poor coverage etc. in mapping consensuses where N characters/gaps
may be introduced

What problems may this approach cause? I’m still looking into it…

Fine grain configuration

SeqFindr can read a configuration file. At the moment you can only redefine
the category colors (suppose you want to use a set of fixed colors instead of
the default randomly generated). The configuration file is expected to expand
in the future.

For example the first row of colors in RGB is:
(51,102,255), (102,51,255), (204,51,255), (255,51,204)

Short PCR primers

In some cases you may want to screen using PCR primers. Please use the –short
option. Here we adjust BLASTn parameters wordsize = 7 & Expect Value = 1000

Tutorial

We provide a script to run all the examples. Note: We have changed the
color generation code. As a consequence the background colors will be
different when running this yourself. The results will not change.

Navigate to the SeqFindr/example directory (from git clone). The following files should be present:

Note: the ordering file is defined using the option –index_file. The
ordering file must contain the same number of strains as the assemblies
directory and the strain names must agree (TODO - add a script to flag issues).

How to generate mapping consensus data

We strongly recommend that you use mapping consensus data. It minimises
the effects of missassembly and collapsed repeats.

We use Nesoni. We use the database file (in multi-fasta format) as the
reference for mapping. Nesoni has no issues with multifasta files as
references (BWA will treat them as separate chromosomes).
The workflow is something like this:

For those of you using a cluster running PBSPro see:
https://github.com/mscook/SeqFindr_nesoni
This is a script that generates a job array, submits and cleans up the
mapping results ready for input to SeqFindr.

The output from the described workflow and SeqFindr_nesoni is a consensus.fa
file which we term the mapping consensus. This file is a multi-fasta file of
the consensus base calls relative to the database sequences.

Caveats:

you will probably want to allow multi-mapping reads (giving –monogamous
no –random yes to nesoni consensus) (this is default for
SeqFindr_nesoni),

The (poor) alignment of reads at the start and the end of the database
genes can result in N base calls. This can result in downstream false
negatives.

SeqFindr now provides a solution to minimise the effects of poor mapping at
the start and end of the given sequences.

By default this strips the 1st and last 10 bases from the mapping consensuses.
We have had good results with this value. Feel free to experiment with
different values (say, -s 0, -s 5, -s 10, -s 15). Please see image-compare
a script we developed to compare the effects of different values of -s on the
resultant figures.