Version-1.6.0 Release Notes, October 2005

The main changes in this release, in addition to the usual round of
bug fixes, are the support in Gap4 for very large databases, for
traces from the "454" instrument and for a new SNP Candidates plot.
In addition to these substantial speed improvments have been made.

64-bit file size support

With databases getting ever bigger we finally reached the point where
the gap4 database files needed to be larger than 2Gb in filesize. This
posed a problem for the old 32-bit file offset code (mandated by the
format of the .aux file) so we have made Gap4 64-bit aware.

Firstly note that Gap4 still uses 32-bit file sizes by default and so
is fully backwards compatible. Gap4 may create 64-bit databases by
specifying the "-bitsize 64" command line option. Gap4 will
automatically detect the format on subsequent opens so this option
only need to be specified at creation time. However note that 64-bit
databases are incompatible with the old 32-bit ones and so will not be
read by older Gap4s. Fortunately copy_db has also been modified to
support conversion between 32-bit to 64-bit and vice versa (assuming
the database isn't too large to fit in 32-bits).

Dealing with such massive databases also showed a number of areas
where the speed was too slow. We've made considerable improvements
across multiple places and these changes should also help speed up
normal 32-bit usage too.

The SNP Candidates plot

This new plot has been designed to graphically show the locations
where a consensus column has a strong chance of being made up of
multiple sequences, eg due to it being a SNP or a variance in a
collapsed repeat.

Stage 2 of this plot is to cluster the differences to attempt to pull
apart the sequencing into a sets where each set has no or few internal
differences. These sets may then be viewed directly in the contig
editor by colour coding and sorting the sequences. The editor also has
manual control over moving sequences from one set to another, but
currently saving of edits made this data is not supported.

Finally the consensus sequence for each set may be saved or
alternatively the sets may be split apart to form multiple contigs.

The 454 sequencing machine

This new instrument by 454 Life Sciences is based on pyrosequencing
techniques and so produces a very different style of trace to
traditional Sanger sequencing. The 454 machine SFF format is now
natively supported as is a prototype for the TVF archive (although
both of these format have not been finalised yet and so may be subject
to change). Trev and Gap4 support viewing of "454 flowgrams".

The new flowgrams are substantially smaller than traditional traces
which gives rise to a number of interesting issues with file
management. To compensate for this there are now a variety of tools
for packing multiple small traces into a single large archive. This
can be either a unix tar file (with a faster indexing method than
before) or a new ZTA (ZTR Archive) file. See the io_lib package for
more details and related tools.

We are aware that there are a number of issues to still address with
incorporating pyrosequencing data with traditional Sanger sequencing
data. There is a new replacement for Shuffle Pads which greatly
improves alignments of 454 data produced by Gap4 or Phrap (but isn't
needed if you use 454's own assembler), but there are still issues to
resolve with the use of confidence values. We expect to be
concentrating on this in the oncoming months.

Gap4

64-bit database size support. Gap4 databases may now grow beyond the
2Gb limit (we've tested 6Gb and in theory it should handle
terabytes, albeit slower). Old 32-bit databases are still supported
and are still the default output format, for backwards
compatibility. Specify "-bitsize 64" on the command line to force
Gap4 to create 64-bit databases.

A new (experimental) plot labelled SNP Candidates. It identifies and
plots likely locations when multiple sequences may have been
assembled together (either SNPs or collapsed repeats) and provides
tools to pull these apart.

The contig editor now has the notion of "sets". A set is basically a
sub-group of a contig, which is useful for exploring the outcome
within a conti of pulling apart a collapsed repeat without actually
doing it. The editor is automatically invoked in this manner by the
SNP candidates output. This is largely experimental, but there are a
number of potential ways to improve interaction. Ideas on a postcard
please!

The editor's Shuffle Pads algorithm has been moved to the main gap4
menu (so it can be scripted non-interactively) and it now does a
substantially better job too (as it is based on the same techniques
used in ReAligner).

Lots of speedups (primarily for the huge 64-bit databases):

Drawing of some consistency graphs;

Experiment file reading on large files (1000 fold on a 1Mb
sequence);

Dealing with lists of contigs (eg "all contigs") is now O(N)
complexity instead of O(N^2) (On databases with 50K contigs and
2.6million reads this has changed some operations that would take
hours to taking a couple of seconds);

(De)selecting contigs in the contig selector is now much much
faster.

Substantial modifications to the gap4 I/O mechanisms (balancing the
"freetree"). The upshot of this is that the worst-case performance
is now dramatically improved and the average performance will be
slightly better too. Complementing a 10Mb contig dropped from about
50 minutes to 1.5 minutes. Disassembling all readings (approx
175000) in this database went from ~3 hours to 2.5 minutes.

Substantial speed up of the contig selector with large databases.

The confidence plot is now substantially faster to plot (typically
10-20 fold).

The contig editor Report Mutations now has a confidence scale for
the "by difference" option.

Dragging a selection in the contig editor now auto-scrolls the
display as appropriate.

Highlight Disagreements now uses ":" to display bases which differ
to the contig but have a low quality value.

Trev

Now shows pyrosequencing traces. Eg those from 454.

A new "-trace_scale" option may be used to force a particular Y
scaling instead of auto-scaling. This can be useful when comparing
absolute trace heights across multiple files.

Copy_db

Added a -b option to switch between 32 and 64-bit database sizes.

Io_lib

Substantial speed ups, particularly when dealing with gzipped files
or when extracting data from tar files. This is primarily due to
performing decoding in memory instead of from the disk structures,
but experiment file I/O has had additional (and substantial)
improvements when dealing with very long files (approx 1000 fold
speed increase for reading a 1Mb sequence on the Alpha).
*INCOMPATIBILITY*

The Exp_info structure now has an "mFILE *fp" member instead of
"FILE *fp".

Some functions are no longer external.
These include many ctf functions, ztr_(de)compress,
ztr_chunk_(read/write), be_read_*, be_write_*,

The default search order for RAWDATA is that the current
directory is searched after the rest of rawdata instead of
before.

Removed support for the old unix "pack" program as a
compression tool.

Preliminary support for 454 flowgrams, including the SFF format.

Added support for hash indexing of tar files or creation of "solid"
archives. This allows traces to be packed into a single archive with
a fast index for extraction. Replaces the old index_tar program.
New programs hash_tar and hash_extract.

Reenabled gzip compression on Windows.

Spin

Under MS Windows spin and EMBOSSwin should coexist happily to the
extent that Spin will spot the EMBOSSwin installation and use it.

Misc

Removed some redundant old ABI utilities: getABIfloat, getABIhex,
getABIraw, getABIstring. All of these can be achieved via
getABIfield.