The Newbler assembler and mapper (gsAssembler, gsMapper) was developed especially for working with the reads from the Roche/454 Life Science sequencing technology. It is one of the best programs to deal with this type of data, scoring well in the assemblathon 2 competition. Newbler has been used for many large and small genome assemblies (numerous bacteria, Atlantic cod, bonobo, tomato, to name a few). Recently, Newbler has added support for using multiple sequencing technologies, making it one of the few hybrid assembly programs available. At the Advances in Genome Biology and Technology (AGBT) in 2013, Roche announced having used the Newbler program with a hybrid 454 and Illumina dataset to improve upon the human genome.

However, the Newbler program is not open source. Luckily, researchers only need to fill out an online form to get a free copy of the software. Still, this has hampered the wide-spread adoption of this program. Newbler, for example, was not included in assembly evaluations like GAGE and GAGE-B. That Roche/454 does not want to make the source code for Newbler available is partly understandable from a commercial standpoint: at least one competitor technology (Life Tech/Ion Torrent) with a similar sequencing error-model could benefit from access to the code. In fact, in a blog post, I showed Newbler to be superior to an open-source program when assembling Ion Torrent mate-pair data.

More worringly is that the hundreds of projects that used Newbler as part of the analysis are fundamentally irreproducible without the source code for each of the different versions. This is especially the case for projects, such as the Atlantic cod genome project, that have been given access to development versions of the code, incorporating elements not available to the general community.

Last October, Roche announced it will shutdown its 454 sequencing business in mid-2016. Whatever one may feel about this decision, this further strengthens the argument for Roche/454 to make the Newbler source code open source. After the 454 shutdown, Newbler is otherwise likely to disappear too, meaning that large swathes of the literature cannot be recapitulated from the raw data. Also, long after the 454 shutdown, many researchers will have to process their 454 sequencing data, and many may still want to rely on Newbler for that purpose.

There are several other reasons why I feel the research community should be given access to the source code of Newbler. Newbler represents a very valuable contribution to the field of genome assembly and mapping. Software developers can learn from the algorithms and implementations of the Newbler code, opening up for reusing these in other programs. Also, there is the hope that developers will improve upon the program, for example by adding support for other sequencing technologies, or assembling with reads longer than the current maximum of 2 kbp.

So I hereby ask the readers of this blog for help: I have set up an online petition asking for Roche/454 to make the Newbler source code available at the latest at the time of the 454 shutdown. Please sign the petition here. Additionally, spread the word (e.g., on twitter or your own blog). Thanks in advance!

I intend to hand over the results of the petition to a Roche representative at the Advances in Genome Biology and Technology (AGBT) meeting (February 12-15, 2014).

One unfortunate drawback of working with Illumina sequences is the many changes to the format of their fastq readfiles. The quality scoring has been changed several times since the first Solexa reads become available. It appears they have now settled on the Sanger style, see this wikipedia entry.

(Source: thepoolandspashoponline.com.au)

Regrettably, with their latest software upgrade (Casava 1.8), the headers (sequence identifiers) in the fastq files have changed. The change is described in the aforementioned wikipedia entry; basically, some elements have been added, some have changed order, and there are now two parts seperated by a space.

I wouldn’t have written this blogpost if this change had not been relevant for newbler: we were lucky enough to enjoy direct reading of Illumina fastq files (with newbler determining the quality scoring type) starting with newbler 2.6. newbler also matches mate-pairs (Illumina read 1 and read 2), so that these can be used as paired-ends by newbler (to build scaffolds). By the way, FASTQ files from the NCBI/EBI Sequence Read Archive are also correctly parsed for mate pairs, but here the filename is used for determining read 1 and read 2.

The new Illumina fastq header (from Casava 1.8 and beyond) still allows direct reading of the fastq files by newbler, but, with the change in format the header, the pairing information is no longer understood. These reads are therefore used as shotgun reads instead.

When I asked 454 Life Sciences abut this, they confirmed newbler 2.6’s behaviour on the new Illumina fastq headers, and came with a helpful tip on how to solve this, while we await a new newbler version that fixes this problem. The solution unfortunately requires you to make a copy of the fastq file, with the old-style header. For this, you can use your favorite bioinformatics command or language, but here I use an awk command. It adjusts the ‘@’ header line, but leaves the ‘+’ header line blank (potentially saving some disk space):

Note that the original header is added at the end in brackets. If you do not want/need that, simply remove the space and the ‘(%s)’ just before the ‘\n’ from the command. Also, the flowcell ID is added to the instrument name.

How to know whether newbler accepted your reads as pairs? During the inital ‘Indexing’ step (parsing of the read files), newbler will report for fastq files (on the screen and in the 454NewblerProgress.txt file):

In addition, after assembly the estimated library insert size will be reported in the 454NewblerMetrics.txt file for fastq files with paired reads.

Note that I couldn’t do any extensive testing on the awk command due to lack of new Illumina fastq file to try. So, use the command at your own risk, and if you find problems please let me know through the comments!

In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.

454ContigScaffolds.txt and its relation to the 454Scaffolds.txt file
Both these files are in the AGP format, see my earlier post on the 454Scaffolds.txt file. The examples for post are based on a bacterial genome data set (shotgun and paired end 454 reads), assembled using the -scaffold flag (and newbler 2.6).
The 454Scaffolds.txt looks is different from an assembly without the -scaffold flag:

Instead of ‘contigXXXXX’ in the 6th column, there are sctg_XXXX_YYYY. ‘sctg’ stands for ‘ScaffoldContig’, see below. ‘sctg_0001′ stands for scaffold 1, while the following ‘_0001′ stands for the first contig in this scaffold. So, the 20th contig in scaffold 13 would be sctg_0013_0020. The 454ContigScaffolds.txt file is one line per contig followed by one line for a gap.

In the new 454ContigScaffolds.txt file, the corresponding region of scaffold 1 looks like this:

This show how the -scaffold option works: repeat contigs are placed in gaps, so-called ‘ScaffoldContigs’ are formed by concatenating the contigs that are now next to each other without gaps in between. The 454ContigScaffolds.txt file shows which contigs are placed where, while the 454Scaffolds.txt shows the scaffolds as they are built up out of ScaffoldContigs.

If we now add the per-contig depth (from the 454ContigGraph.txt file) to the contigs that make up the ScaffoldContigs, we get:

So, we have a long, 26 kb long contig of ‘normal’ depth (40x), followed by four short contigs of quite high depth (203-352x), after that one long contig again of almost 32 kb of ‘normal’ depth. This looks like four repeat contigs in between long single-copy contigs. Finally, there is a 1.9 kb contig of somewhat lower depth, which I cannot really explain…

Here, there are four long contigs, 4kb, 0.6, 46.5 kb and 13 kb, of ‘normal’ depth (34-42x), with shorter contigs in between, most of them with high depth (75 – 204x). Unsurprisingly, a quick blast identified contig 13 and 15 as being part of putative transposases, proteins known to be present in multiple copies in bacterial genomes…

454ScaffoldContigs.fna and .qual files
These simply list the sequences of the ScaffoldContig files as listed in the 454Scaffods.txt file.

In conclusion, 454 has tried to offer more complete scaffolds by placing repeats in gaps where possible

The latest version of newbler, version 2.6, has some welcome additions for input and output. As I have so far only treated de novo assembly, I will skip the updates on the gsMapper (except for mentioning that it is now able to provide a bam file using the -bam option).

FASTQ files support
Newbler could already use sff files (including those from the IonTorrent, by the way!), and fasta/qual files (e.g Sanger reads). Now, fastq files (a much used format for Next Generation Sequencing, see http://en.wikipedia.org/wiki/Fastq) are also supported. In principle, one can now use any fastq file, also those downloaded from the NCBI Short Read Archive, and import it directly into newbler. Newbler should be recognize the quality scoring version, and which reads are paired up (read1 and read2) based on the header text. I did a quick test on one such a fastq file, and it seem to work.

Gap filling with repeat contigs
Some contigs, usually with high depth, represent collapsed repeats. These make for many of the gaps in scaffolds. With the new -scaffold flag, you can now ask newbler to place a copy of the repeat in the gap it forms, effectively closing the gap. This potentially leads to much more complete assemblies. Note, however, that a contig from collapsed repeats is the consensus sequence from all occurrences of the repeat. Newbler places an instance of this contig as it is, so if the actual repeat instances in the original genome have sequence variation, it introduces errors in the scaffolds. This you will have to take into account. Two new output files are gene when the -scaffold flag is set, 454ContigScaffolds.txt and 454ScaffoldContigs.fna/qual, which I will describe in the next blog post.

Edge information
New sections in the 454NewblerMetrics.txt for assemblies which produce scaffolds has been added, called ‘largeContigEndMetrics’, ‘scaffoldGapMetrics’ and ‘scaffoldEndMetrics’. An ‘edge’ represent reads that exit a contig or scaffold and enter another one. These metrics report for contigs and scaffolds, the number and percentage that have ‘NoEdges’, ‘OneEdge’, ‘TwoEdges’, or ‘ManyEdges’. For gaps in scaffolds, those that have ‘BothNoEdges’, ‘OneNoEdges’, ‘BothOneEdge’ and ‘MultiEdges’ are reported. Although there is little documentation about this, I understand that if there are many contigs (or scaffolds or gaps) with no edges, this could indicate not enough reads (too low coverage) to bridge reads.

Increased assembly of splice variants in cDNA assembly
The new -isplit option will look for depth spikes in the read alignment for transcriptome datasets. Such a spike could be the result of a specific splice variant. Setting this flag results in using these spikes for starting the generation of isotigs, potentially resulting in more isotigs.

One of you asked in the comments: “Is there an existing way of converting the 454NewblerMetrics.txt file to a tab-delimited file?”

I have in fact written a script for that. We use it all the time in our group for newbler assemblies, and I am hereby sharing it with you. The perl script, called newblermetrics.pl, needs to be given a 454NewblerMetrics.txt file from a newbler assembly. It works both on shotgun assemblies, with or without paired end data, and on cDNA assemblies (for which it includes the isogroups and isotigs metrics in the output). It will not work on mapping projects (gsMapper/runmapping commands).

The script produces an output like this:

Input
Number of reads 975240
Number of bases 275262092
Number of reads trimmed 1195883 122.6%
Number of bases trimmed 256085747 93.0%

Sometimes you might observe very short contigs, some even having high read depth. You might see these for example when
– you choose ‘-a 1′ (or ‘-a 0′) as a setting during the assembly, forcing newbler to output all contigs of whatever length (normally the lower limit is 100 bp)
– you run an assembly using the cDNA option, here the lower limit is set to 1
– you use the 454ContigGraph.txt file, in which all contigs of whatever length are listed

The -minlen option requires by default a minimum length of 50 (20 when paired reads are part of the dataset), and the default minimum overlap between reads is 40 bases, so how are contigs so short possible at all?

There appear to be several reasons for these contigs (the information below was kindly provided by the newbler developers; disclaimer: I might have misunderstood them… ):

– microsatellites are very short repeats that the alignment loops through, causing a very short (2bp, 3bp, 4bp) alignment with ultra-high depth.
– very deep alignments (with lots of reads) can cause shattering, caused by accumulation of enough variation to break the alignment into pieces, some of which may be very short
– at the end of contigs, variations in the (light) signal distributions of homopolymers can also cause small contigs ‘breaking off’

Another very strange type of contig is one that mentions in the fasta header ‘numreads=1′. How can one single read become a contig? It should be labelled a singleton, right? Well, these ‘contigs’ can be explained also…
A multiple read alignment grows when reads added to it. After such an addition, there are checks run on the alignment. Addition of new reads may actually result in an alignment being broken, in some cases a part is taken out and placed in its own alignment. During the detangling phase, reads may be removed from a set of aligned reads and. For these parts taken out of alignments this may mean that onlu a single read is left in the alignment. Newbler then keeps this read as a contig (perhaps they should remove these instead, but who am I to complain…).

A singleton read is a read that did not show any significant overlap (by default, a 40 bp window of at least 90% similarity) with any other reads. These ‘numreads=1′ contigs are not singletons as they (or part of them) actually had sufficient overlap for them to have been part of an alignment.

Many people ask about these strange contigs, both in the comments on this blog, and on sites such as seqanswers.com. I hope this post makes the situation around these contigs a bit less confusing…

Recently, newbler version 2.5.3 became available. With this post, I’ll describe the changes between this version, and the previous (2.3). As I have not yet described the gsMapper function of newbler, I here only dicuss changes relevant to assembly (gsAssembler, runAssembly).

Read order
Previously, the order in which reads where added using addRun could have an effect on the final outcome of the assembly. Now, newbler will use a ‘canonical’ (fixed) order, regardless of the order of addition. This means that assemblies using the same parameters and the same set of reads will always result in the same assembly output (contig number and sequences etc), regardless of the order of the addition of reads. Note, however, that this only holds when only a single cpu is used, repeated multi-cpu assemblies will have slightly different outcomes.

New options
-tr
This option to output the trimmed reads was present already in version 2.3, but hidden, see my previous post.

-sio
This parameter is a great one for assemblies based on very large read datasets (over 4 million reads); ‘sio’ stands for Serial I/O. The option solves the problem of extremely long processing times at the end of assemblies, the ‘Computing signals’ and ‘Generating output’ phases. Here, newbler has to go trough all the sff (raw read) input files, so that it can use the exact basecalls and signal strengths for consensus base and signal calculations. Newbler searches for this information in the order in which bases are located in contigs and scaffolds, accessing the read files many times. With -sio, newbler first builds temporary files with the required information in a more efficient order. this speeds up these phases significantly.
Note, however, that up to three times the amount of disc space that the original sff files occupy is required for this process (I have had a long assembly crash at the very end because of lack of disc space…). Also, more memory is needed (the number of reads in your project times 8 kilobytes, with a maximum of 8 Gigabytes). Temporary files will be deleted upon completion of the assembly.

-siom
This option can be used to allow more memory consumption of the -sio option. Use -siom followed by the number of Gigabytes allowed (from 1-1000).

-siod
This option modifies the way -sio operates, in that it minimizes disc space consumption, presumable not at a loss of speed.

-nosio
When rerunning a project (that was run using the -nrm option) for which any of the -sio options was set, by default, the previous -sio option is used again. Using -nosio cancels any other -sio options.

-force
Previous versions of newbler would overwrite an assembly project folder with the same name as specified with the -o option. Newbler 2.5.3 instead exits with an error message that the project folder already exists. Using the -force option allows overwriting the existing project folder.

-urt
This stands for ‘use read tips’ and can be helpful for low coverage assemblies, or low coverage regions in an otherwise high-coverage assembly. The unaligned parts of assembled reads at the ends of contigs can extend significantly beyond the actual contig (the region consisting of multiple aligned reads). With the -urt option, the contig is extended to the end of such reads. The description says ‘the assembler tries to extend the contig to the “tip” (end) of the read which extends unaligned’, but I don’t know when what determines whether a try is successful or not.
In addition, very low coverage overlaps can result in contigs, where they normally would not.
The primary use of the -urt option is for transcriptome assemblies, where using the option will help obtaining contigs for rare transcripts (because they have a low coverage). However, also genome assemblies might benefit if they have regions represented with few reads.

New output files for transcriptome assemblies
Assemblies using the -cdna option will have two new output files:

454Isotigs.faa
This file reports the protein sequences of any ORFs (open reading frames) detected in isotigs and contigs (of at least 10 bp). It is up to you to determine which ORF is correct (where the longest one is the most probable for long isotigs…). The ORFs are reported withfrom longest to shortest, and the header lines contains the following tab-separated information:

The amino acid sequences are sorted by length (for each isotig), and are preceded by a
description line that consists of following tab-delimited information:

454IsotigOrfAlign.txt
This file contains the predicted ORF amino acids aligned below the nucleotide sequences for the ORFs reported in the 454Isotigs.faa file. (The example below might look better if text size in your browser is set small…)

Lines showing the nucleotide sequence list the isotig/contig name, the start base position of the part shown, the sequence and end base of the part shown.
The other lines show the frame, a colon (‘:’), start (nucleotide) base, ‘…’ and end base, and an optional asterisk (‘*’) indicating that this ORF is the longest one. This is followed by the start amino acid position of the part shown, the sequence and end amino acid of the part shown.

Both the runMapping and runAssembly programs are able to take in reads from other platforms, at least Sanger reads and Illumina reads. As long as the reads are in fasta format, with an optional quality file, newbler accepts and uses these reads. When the fasta files contain paired end (mate pair) reads, newbler can actually be made to use the pair information.

In general, it is a good idea to clean your fasta sequences before adding them to newbler: remove vectors, linkers, low quality parts of reads, or entire low quality reads first.
Also note that, while for sff files a symbolic link is generated in the assembly or project folder (still present after the program is finished when the -nrm flag is set), fasta files are not included in this way.

1) Unpaired (single-end) Sanger reads
These can be simply added by telling newbler the location of a one or more fasta files:

If you have a file with the corresponding quality files, make sure to use the same filename, but change the ending to ‘.qual’ (and put the file in the same folder). Newbler will always check whether there is such a file. So, in the above example, placing your quality file in /data/sanger and calling it reads.qual will do the trick.

2) Paired Sanger reads
The pairing information needs to be put in the fasta header for each sequence in order for newbler to understand which reads belong together. So, there is no need to join the two reads into one and add the 454 specific paired-end linker (as sometimes is suggested in forums), this actually will not work.
A read whose fasta header looks like this:

>plate12_G08_F template=plate12_G08 dir=F library=fosmid1

tells newbler that it is from a paired end library called ‘fosmid1’, in the forward orientation, and that the sequencing template (e.g. clone, or in this case, fosmid) was ‘plate12_G08’. Newbler then will look for a corresponding reverse read with this fasta header:
>plate12_G08_R template=plate12_G08 dir=R library=fosmid1

You can actually add multiple reads with the same header (if you have duplicate sequencing attempts from the same template), but newbler will only pick the ‘best’ one for assembly or mapping, where the alignment length and similarity is determining which read is best.

The ‘library’ name is used by newbler to group reads in the same way as for sff files (a per-library average insert distance is calculated, see this post). You will see the library name appear in the 454NewblerMetrics.txt file.

How you make your fasta headers for paired reads into this format is a bit up to you, as it is very much dependent on the format of the headers of your Sanger reads. It usually requires some scripting, or sed/awk commands. For example, I once had a set of BAC-ends wth fasta headers like this:

Note: you have to ‘force’ newbler to take in the reads as paired end reads by including the -p flag:runAssembly -o project1 -p /data/sanger/paired_reads.fasta /data/sff/EYV886410.sff
At the beginning of the assembly, newbler should then mention this:

3) Sanger reads for closing gaps
If you have a genome assembly for which you did some PCRs to close gaps, and sequenced the PCR products using the Sanger technology, you can actually try to use these reads to have newbler close the gaps for you. I must admit that I have not seen this being done with success yet, but in principle it should work. However, as Newbler needs more than one read in an alignment to build a contig, I would recommend adding several non-identical copies of each read to the assembly. The copies should be non-identical because newbler takes only one copy of identical reads. Making such copies can be done by shifting the start and end position of the copies. For example, for a 600 nt read you could create three copies as such:
– copy 1 from position 1 to 580
– copy 2 from position 10 to 590
– copy 2 from position 20 to 600

Make sure to give each copy a unique fasta header…

4) Illumina reads
In principle, Illumina reads can be added by converting the fastq files to fasta and quality files, with Sanger quality values, adjusting the fasta headers as described above, and feeding them to newbler. However, people who have tried this have so far reported newbler crashing when these read were being assembled (e.g. here). My only experience is trying to adda tiny amount of Illumina reads to a much larger 454 read dataset, and that worked well.

Again, converting Illumina fastq files to fasta and (Sanger-style) qual files can be done in several ways, see for example the comments to this post or the SEQanswers forums.

Illumina runs typically come in two files per lane, one for each read direction (forward and reverse). You can also add each of these in a separate file, and newbler will still be pairing the reads up with their mates. As an example for two files from lane 4:

It seems that the commandline I gave above does not work. I have had better success by using the newAssembly/addRun/runProject approach, both for a single Illumina file as well as one file per run half:

Sff files are the standard output of the 454 sequencing machine. ‘sff’ stands for ‘standard flowgram file’. The 454 sequencing method determines the sequence not base by base, but measures homopolymer length (the number of consecutive ‘A’s, ‘C’s, ‘G’s and ‘T’s on a sequence). Nucleotides are flown over the sequencing plate in a determined order (T-A-C-G) and a light signal is generated during nucleotide incorporation. The strength of the light signal is proportional to the number of bases built in (at least up to a certain number, around 7). As the flow order is always the same, for certain sequences no base can be built in, leading to a signal of strength (+/-) 0.

The sff file contains all the bases, quality values and signal strengths, in contrast to the fna and qual files. Note that sff files can, by definition, contain reads from only one type of chemistry, i.e. either GS 20, GS FLX or GS FLX Titanium reads.

Sff files are binary files, meaning that they can not be accessed by regular text-based tools. 454 has its own scripts to manipulate sffiles and extract information from them (sfffile, sffinfo), but other programs/scripts can also be used to extract information from them. Example programs are sff_extract, flower, sff2fasta, or use the biopython parser, nothing for bioperl yet (I have not tested any of these – use at your own discretion…). When one uses 454’s sffinfo command on an sff file without parameters, all information contained in the file is reported in text format. The remainder of this post will describe that output.

Key length: the length (in bases) of the key sequence that each read starts with, so far always 4

# of Flows: each flow consists of a base that is flowed over the plate; for GS20, there were 168 flows (42 cycles of all four nucleotides), 400 for GS FLX (100 cycles) and 800 for Titanium (200 cycles)

Flowgram code: kind of the version of coding the flowgrams (signal strengths); so far, ‘1’ for all sff files

Flow Chars: a string consisting of’ # of flow’ characters (168, 400 or 800) of the bases in flow order (‘TACG’ up to now)

Key Sequence: the first four bases of reads are either added during library preparation (they are the last bases of the ‘A’ adaptor) or they are a part of the control beads. For example, Titanium sample beads have key sequence TACG (default library protocol) or GACT (rapid library protocol), control beads have CATG or ATGC. Control reads never make it into sff files…

>F7K88GK01BMPI0: this is the read name, or “universal accession number.” ‘F7K88G’ encodes the timestamp of the run, ‘K’ is a random character, ’01’ indicates the region (lane) number on the plate, ‘BMPI0′ encodes the x,y location of the read on the plate.

Run prefix: A run folder starts with ‘R’ and the time the run started: R_yyyy_mm_dd_hh_min_

Region #: the region (lane) on the plate the read originated from

XY location: the location of the read on the plate

Run name: R_yyyy_mm_dd_hh_min_sec_machineName_userName_yourrunname

Analysis name: after a run, a subfolder is made with the image/basecalling analysis results, the foldername starts with ‘D’ and the time the analysis started: D_yyyy_mm_dd_hh_min_sec_machineName_analysisType

Full path: of the analysis results that the sff file originated from (on the GS FLX instrument: /data/R_…/D_…)

Read header len: 32 for all files as far as I can tell

Name length: the length of the read name (14), see above

# of bases: the total number of bases called for the read (before clipping)

Clip qual left: the position of the first base to be included after clipping. This is usually 5 because of the first four bases that are the key sequence. In this example, the read had an 10 base MID sequence; the example sff file is the result of splitting the original sff file, during splitting the MID sequence is ‘removed’, i.e. the clipping point is set beyond the MID end.

Clip qual right: position of the last base before the (quality) clipping.

Clip adap left and right: I actually wouldn’t know what these represent, but perhaps under certain circumstances, adaptors can be ‘removed’ this way.

Flowgram: for each flow, the normalized signal strength, or actually, the homopolymer length estimate, as a floating point integer with two digits to the right of the point.

Flow Indexes: the flows actually used for basecalling (excluding flows considered to be ‘0’, i.e. no signal.

Bases: the determined DNA sequence. Lower case bases are before and after the clipping point

The figure is a graphic representation of the flowgram, with another example of ‘reading’ the sequence from it. Note that for some signals, the intensity is such that it is hard to determine whether for example there are two or three bases at that position. This inherent property of pyrosequencing leads to the well-known homopolymer (over- and undercall) errors.

This post describes the transcriptome specific output files, or differencers between the files for transcriptome assembly relative to a regular assembly. For (aspects of) files not treated in this post have a look at these previous posts.

Alternative rope splicing (source: Wikimedia commons)

1) 454NewblerMetrics.txtThe differences for this file, relative to the same file for a ‘normal’ assembly (described in this post), are metrics on the isogroups and isotigs:

Besides the number of isogroups, the average and maximum number of contigs per isogroup are listed, as well as number of isogroups with only one contig. Below that, the average and maximum number of isotigs per isogroup are listed, as well as number of isogroups with one isotig.

After the number of isotigs follows the average and maximum number of contigs per isotig, and the number of isotigs with one contig (note the typo…). Below that, the total length of all isotigs combined is listed, as well as the average, N50 and longest isotig length.

2) A note on isogroup numbering
The isogroups are ordered such that the lowest numbers (identifiers, e.g. isogroup00001, isogroup00002, …) are the isogroups that only consist of contigs, followed by the isogroups with two and more isotigs in the middle, and ending with isogroups with only one contig.

3) 454IsotigsLayout.txt
This file represents in a schematic (‘graphical’) way how the isotigs are build up out of contigs.
As an example, let’s take isogroup 156:

(Note that I added ‘_’ symbols to the second and third lines in order to have the columns align as they should, in the real file thesea re spaced, but I can’t get these to show up here…).
The first row explains that Isogroup 156 (‘gene’) contains 10 contigs forming 10 isotigs.
The table lists the contigs as columns, and the isotigs as rows.
The ‘Length’ row lists the lengths of the individual contigs
The contig numbers are listed below (so, ‘5625’ is really contig05265 in other files).

So, isotig02293 is built up out of the following contigs:
5638, 6778, 5613, 5627, all in the ‘forward’ orientation, represented by the right-pointing ‘>’ symbol.
isotig02294 is built up out of the same contigs, with one difference: contig 5627 is replaced by 5628. Most likely, these two contigs are quite similar, but not similar enough to be collapsed into a single contig.
isotig02299 and isotig02300 are a good candidate for splice variants, as they are identical except for the third contig, which is missing in 2300. Note that the last contig of these isotigs is in the reverse orientation (‘<’ symbols).
Isotig02302 is a bit special; it consists of only two contigs, and it is the only one with contig 217…

The isogroups listed last are the ones consisting of only 1 isotig/contig, for example:

4) 454Isotigs.txtThis file is the equivalent of the 454Scaffolds.txt file from a regular assembly (see my post on this file /2010/03/22/newbler-output-ii-contigs-and-scaffolds-sequence-files-and-the-454scaffolds-txt-file/). It is follows the ‘AGP’ format.

These files have all the contigs. Note that ‘All’ here refers to contigs from 1bp and longer, usually this file contains by default contigs of 100 bp and larger. Also note that in the 454NewblerMetrics file (described above), the number of All contigs is again different, but when I checked, it does not look like it is representing the contigs with a lower limit of 100 bp…
The fasta headers of the contigs list the usual length and number of reads, but in addition, they list to which isogroup the contig belongs, and the ‘filter status’, as described above for the 454IsotigLayout.txt file

7) 454Isotigs.ace and/or consed files; 454AlignmentInfo.tsv
These are as per the normal assembly output, but contain the isotigs, and those contigs that were not included in any isotig.

The header line, though, is based on the UCSC’s reflink.txt file. The manual states “This can be used as an annotation file for further mapping projects of the cDNA / transcriptome assembly products (isotigs and contigs)”, but I have no idea how to actually use this file for that purpose…

A talk I gave at the Dec 2013 Assembly Masterclass at UC Davis. Really licensed under CC0. UPDATED May 2014, for the presentation I gave at the combined SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th 2014

Un update of the previous talk with the same title. A talk I gave at the Computational Life Science initiative (University of Oslo) about new High Throughput Sequencing instruments at the Norwegian Sequencing Centre. I also mentioned future upgrades, and the upcoming nanopore sequencing platform of Oxford nanopore.

A talk I gave at the Microbiology Research Group (University of Oslo) about new High Throughput Sequencing instruments at the Norwegian Sequencing Centre. I also mentioned future upgrades, and the upcoming nanopore sequencing platform of Oxford nanopore