Wednesday, March 8, 2017

Release Candidate 1 version 3.2.2_RC1

In the past several weeks I have been using MaSuRCA to create improved assemblies of several genomes of varying difficulty and I was able to detect and fix several performance issues and bugs:

1. run time too long in scaffolder and consensus when using Illumina linking mate pairs or other mater reads such as Sanger or 454
2. failure when combining single end and paired end paired end libraries when USE_LINKING_MATES is turned on
3. accuracy and contiguity of the assembly does not improve uniformly with more PacBio data
4. overlap filtering step after the unitigger takes too long for large data sets
5. mapping in duplicate/haplotype filtering is faster, because I excluded the singleton unitigs from the mapping
6. Illumina mate pairs are now properly used in scaffolding of assemblies that contain PacBio or Nanopore reads

Thanks to all MaSuRCA users who reported their issues to me and I hope I was able to address most of them.

The resulting package is much faster and more accurate. Here is an example of an assembly of Arabidopsis thaliana data set with 50x PacBio data, aligned to the finished genome. Note that the finished sequence is different species but there should not be any large structural differences. The plot shows a line for each contig with start and end positions corresponding tot the alignment positions in the finished genome (X axis) and in the assembly (Y axis). There are three off-diagonal matches which correspond to three small misassemblies. The N50 contig size of this assembly is 8Mb, which is close to about 10Mb one can achieve with 100x PacBio data. The NGA50 -- N50 after breaking at misassemblies is 7.1Mbp.

I also changed the way super-reads are mapped to the PacBio and Nanopore reads. The problem was in connectivity of the super-reads -- to snap two super-reads together to create a mega-read, an overlap of K or more is needed, and we try to maximize K to create longer super-reads e.g. K could be 127bp. For example we have super-reads A_B_C, C_D and D_E_F. To connect super-reads A_B_C and D_E_F to create the mega-read A_B_C_D_E_F, we need to have A_B_C and D_E_F to have exact overlaps with C_D. So if for some reason C_D is lost in the mapping a gap may be created. But the gap does not need to be there because A_B_C and D_E_F may still overlap exactly, with the overlap shorter than K. Now the super-reads are pre-processed to reduce K to 41 which effectively reduces the minimum overlap length for creating mega-reads from super-reads while still keeping the principle of the algorithm intact.

I feel that MaSuRCA mega-reads is not mature enough and tested to give it Release Candidate status. The new MaSuRCA is available here:

ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.2_RC1.tar.gz

Please post any issues with the new version in comments to this blog post.

8 comments:

I have a gel-free mate-pair library with a highly variable insert size (2x100 4000 +-3100) and PE (2x100 350 +-90) PEs, and previous masurca version generate very nice assemblies with N50 ~ 5Mb for a 312Mb plant (polyploid and heterozygous) genome. However there are a lot of NNs in the scaffolds. For instance, for the larger 9Mb scaffold, 4,5Mb are NNs. Interestingy, any other assembler are capable to generate scaffolds larger than 500kbs, or even N50 near 5kbps. My question is, whether this is a probable masurca bug, or indeed it's represent a good and confident assembly.Thanks a lot! I'm testing this new version

Hi, the assembly is likely to be fine but the gap sizes may be inflated due to large stdev of the library, you can simply reset them back to 100bp. The new MaSuRCA version did not change much for Illumina-only assemblies, except for a few bugfixes, most changes are for long-read assemblies.

Hi,For illumina only assemblies with a large amount of mate-pairs reads the cgw ckp07-2SM step is still taking several days to get completed.I have noted that the MergeScaffoldsAggressive runs till it reach the minWeightThreshold of 2. However, when the minWeightThreshold reach between 5 to 7 I already have very large scaffolds. Is there any way to set the minWeightThreshold to 5 or a larger number, just to speed-up this step, and skip the scaffold merging of contigs with low Threshold ?

Hi Aleksey, When running this new version on a butterfly data set with ~12x PacBio + 100x 2x150 Illumina, it crashes the server right when it starts creating the quorum database. I tried the January release on the same data and it now proceeds past that point. I specified 40 CPUs and I'm using RHEL 7. Any idea about what the issue might be?