Sunday, June 12, 2016

Assembling the Mentha Longifolia genome

A group of researchers, including myself, recently made a draft genome assembly for Mentha longifolia. The publication for which is currently under review. This post is my notes for my part of that effort. Basically that means that this is my lab notebook for part of the summer of 2014. It's pretty disorganized and may or may not be useful to anyone ever....

Long story short:
I was trying to figure out how to integrate PacBio reads with Illumina reads.
In the end, I used ECtools to correct the PacBio reads, and then used PBJelly to merge the PacBio reads with an assembly generated from the Illumina reads. It is my understanding that there may be better options for PacBio error correction today, but at the time ECtools seemed to be the best.

Read on for the juicy details....

M. longifolia has been proposed as a model species for mint genetics due to its
relatively small, diploid genome. Here I'm documenting my attempts to
generate a reasonable assembly of the genome from two datasets, a set of
paired end illumina reads, and a set of PacBio reads.

Before
I get to far, I want to know a bit about the data. What is the quality
of the reads, and what are the read length distributions. Based on the
answers to these questions, I will have to cull and trim the data to get
rid of low quality reads and parts of reads.

I have the following files:
filtered_subreads.fasta
The PacBio reads in fasta format. They've already been adapter trimmed
and run through a 0.75 quality threshold.
filtered_subreads.fastq The same reads, but with phred-33 quality scores
mentha_plastome_2.27.12.fas An assembly of the chloroplast genome.
Mint_allcontigs_052311.fa An assembly of the genome based on illumina data (a velvet assembly)
Mlong_all_contigs.fa A variant of the genome assembly based on illumina data (a velvet assembly)
s_5_seqsMerged.fastq
The illumina reads (about 135 million 101 bp reads, paired end) Appears
to be Illumina 1.5+ Phred-64 scoring (there are "B" but not "@"
characters)
s_5_seqsMerged.fastq.gz A zip of the illumina reads

Make a histogram of sequence lengths (based on the biopython tutorial, I'm doing this in ipython):

from Bio import SeqIO
import json
sizes = [len(rec) for rec in SeqIO.parse("filtered_subreads.fasta", "fasta")] #get the length of each sequence
counts = dict()
for x in sizes: #make a dict where keys are size, and values are the number of sequences of that length
counts[x] = counts.get(x,0) + 1
with open('filtered_subreads_lengths.json', 'w') as outfile: #save the dict as a json file, so we can use it later if we want
json.dump(counts, outfile)
len(sizes), min(sizes), max(sizes) # outputs the number of sequences and range of the sequence lengths (4117507, 1, 16013), if not using an interactive interpreter, you could print these

Not
real pretty. Lots of really short sequences, but also some longer ones.
I've got tons of illumina data in the form of 101nt paired end reads.
So where the PacBio data will be most useful will be for reads longer
than 202 nt (or perhaps 101 nt in some regions). So lets see what the
data looks like if we get rid of all the reads shorter than 202 nt.

Looks like we've still got about 2.5 million reads, which is
63% of our original sequences.
The histogram once again looks like exponential decay. If I repeat for
sequences longer than 1 kb, I find that there are 1.27 million of them,
31% of the original reads. Which gives us a ballpark estimate of
(haploid) genome coverage for the ~500 Mb genome of about 3x-5x. If we
use a 500 bp cutoff we get an estimated 6x coverage. From the illumina
data we have 13.5 gigabases which is about 27x coverage.

The
target fragment size for the Illumina library must have been around the
typical 200 bp. Based on an estimate from a CLC-Bio assembly output,
the average fragment size was actually a bit less than that, maybe 150
bp or so.
The short insert size indicates that merging overlapping pairs may improve assemblies. I used FLASH to merge reads with default settings except for switching from phred 33 to phred 64 with the option "-p 64".

There are many potential strategies we can use to try to assemble a draft genome from this data:
1. Use the Celera Assembler and PacBioToCA.
I haven't seen many (any) reported success stories for using this with
largish genomes like this one, but they give instructions for large
genomes in the FAQ. They say "Approximately 100X (or more) of C2
sequencing is ideal." which is way more depth than we've got, but it
couldn't hurt to try. This process is also explained in this PacBio tutorial.
2. Assemble the short-reads by themselves
(by whatever means... There are lots of shortread assemblers out there),
and then try to fill gaps and glue contigs together with the PacBio
reads. Gap filling could be done with PBjelly. Or this blog seems to suggest doing it manually or with custom scripts.
3. Correct the reads with LSC, or PacBioToCA, and do a hybrid assembly with an assembler that can use multiple read sources.
4. Use Cerulean assembler. They don't recommend using it for large genomes, but apparently they're working on that, so it may be worth trying eventually.
5. Use the process described by Allen Van Deynze in his spinach genome talk where
he discusses binning the PacBio reads into a bin of shorter and a bin
of longer reads, then using the shorter reads to correct the longer
reads using FALCON an HBAR-DTK. He had way longer reads and deeper coverage than we've got, but it may be worth it to try anyways.
6. Use the process proposed by Mike Schatz (as mentioned in this blog) and briefly explained in the second half of this presentation. And further elaborated towards the end of this presentation.
Basically, it involves first using Celera Assembler to generate unitigs
from illumina reads, then use those to correct the PacBio reads, then
assemble the PacBio reads (also with Celera). A partial implementation
of this process can be found as ectools.
7. In the process of trying out all of the strategies listed above, I've devised by own strategy. Correct PacBio reads (initially using Celera unitigs from assembling the Illumina data as single end reads, for some reason treating them as paired end reads gave worse unitigs, could also try other methods of generating unitigs). Assemble the corrected PacBio reads with Celera Assembler. Generate a short read assembly using only the Illumina reads (try FLASH merged, and not FLASH merged, also try a range of K-values and assemblers). Merge assemblies with GAM-NGS, then run PBjelly.
8. Like 7, except assemble all the reads (corrected PacBio, FLASHed Illumina) together in one assembly (try Abyss, MaSuRCA, SOAPdeNovo2, Celera, and Minia, CLC, whatever), then PBjelly with uncorrected PacBio reads.

1. Following the tutorial on the PBcR page. First I had to install AMOS,
configure complains about a bunch of dependencies. This was pretty
easy, as Ubuntu has packages for most of them, and CPAN has some.

(for cpan Statistics::Descriptive to install correctly, I had to make cpan install packages system-wide, as explained here).
After all this, BLAT was the only dependency left, and I think I should
be ok without it. make gave an error: "error: ‘getopt’ was not declared
in this scope" but google turned up a solution here
(in src/Align/find-tandem.cc add #include <getopt.h>). After
compiling add it to $PATH, and off we go. Prepare the illumina frg file
with fastqToCA. Compile BLASR and put it in $PATH (this is straight forward). Download the sample data
from the PBcR tutorial. The computer I'm using is similar enough to the
one from the tutorial that I'm just following their configuration
without modification. Change pacbio.spec to include the lines:

myself in the console, and it finally launched BLASR. I believe
the reason it did not work before is because the data directory is on
an NTFS drive that is not currently mounted to allow permissions
changes, although that seems to be possible.
Also, it's hard to tell whether I've set the threading settings well,
I'm using 12 physical cores with hyperthreading allowing up to 24
threads. I kept the threading settings they use in the tutorial,
optimized for a 16 core machine, figuring maybe that would cause my
machine to max out 8 cores, but top shows it using 24 threads at an
average of 50% load on each, so I'm not sure exactly what's going on, it
seems to work well enough though.
To solve the NTFS permissions problem, added the following to
/etc/fstab:

Permissions still not perfect, but good enough I think.
The
BLASR step ran ok, but when it got to the "runCorrection.sh" it started
using too much memory and accessing swap space constantly and going
really slowly. I sent an email to Sergey Koren, the first author on the
paper, and he said that the runCorrection.sh step could take more than
48 gb of RAM on large genomes, however, since it's already done the
alignments, I don't have to repeat that step. All I have to do is kill
the process and move the temporary files (and the input files) to a
higher ram computer and rerun the script. He says that it should pick up
where it left off, so that's what I'll do. Sergey also said that
decreasing the maxGap option would decrease memory usage. Instead of
recompiling AMOS (with all of its myriad dependencies) on the hpc, I'm
just copying over the contents of AMOS/bin that I compiled locally,
which will hopefully be good enough.
I don't know if it was
necessary, but after I moved the temporary files to the hpc, I also
changed all the .frg files to point to the right files in the hpc
directory structure, and I changed paths used in any .sh files (namely
runCorrection.sh and 1-overlapper/overlap.sh) to be correct relative to
the hpc (edit: it appears that this was actually unnecessary I think it
rewrites the sh files anyways, though probably not the frg files). The
first time I ran pacBioToCA on the hpc it gave me the error:
"Error: found output from runCorrection. Stopping to avoid infinite loops. To try again, remove asm.layout.*"
It
looks like the only file I had in temppacbio with that name was
asm.layout.err, so I deleted that file and restarted the process. That
looks like it worked. However, after about 18 hours of running, the
process had generated nearly 3 TB of data, and was using about 300 GB of
RAM. At this point, the process terminated with an error (which I
assume was caused by the overconsumption of system resources, but I
can't find the log, so I don't know for sure). So, I deleted a bunch of
the temporary files, changed the maxGap setting to 0 and put the process
on the queue again If that works, I'll put it back up to 750 or 1000
and see if that works.Puting the maxGap setting to 0 led to significant
reduction in resource utilization, the program got a step further than
it had before, but the flaw in my strategy to just copy over the
binaries from my desktop to the hpc became apparent, as both BLASR and
AMOS died with errors. BLASR compiled easily enough on the hpc, but
first had to compile its dependency hdf5 remembering to use the
--enable-cxx option on ./configure. AMOS also compiled easily, but when
I tried to run one of the executables, I got the same error that I saw
with the build from the other machine: "/lib64/libc.so.6: version `GLIBC_2.14' not found", some generic workarounds described here and here. This error doesn't make any sense because AMOS documentation
claims that it can be compiled and run on RedHat, and (as of this
writing) no version of RedHat has glibc version >=2.14. So, I spent
quite a while trying to make a module that would allow glibc 2.14 to run
on RedHat, only to find that once I got it, AMOS still didn't work.
Apparently the problem was that I was skipping the `make test` step in
the installation, which for reasons I don't understand, is apparently
essential on RedHat, but not essential on my Xubuntu desktop. Anyways,
all of that hassle, I ran pacBioToCA once again, and it got further than
ever before (taking about 15 days to get there), and then crashed. In
the meantime, I read about another strategy (#6 in this list), that
promises to do the same job, but much more efficiently. So I'm giving up
on pacBioToCA for now (and hopefully forever). I think my 500 GB genome
is just too much for it, and it keeps crashing, and the debug cycle is
extremely long (sometimes it runs for 2 weeks, and then crashes... The
obvious solution, I suppose would be to test with a much smaller test
dataset, and once it works for that, switch to the larger dataset. Maybe
when I'm desperate, I'll try that).

2.

Using minia as a shortread assembler: kmergenie (default
settings), recommends a kmer-size of 43, but the graph was jagged
rather than concave, so it may be a bad estimation (it's a diploid
genome, but when I ran kmergenie with the diploid option, it failed).
Ran minia with settings: s_5_seqsMerged.fastq 43 3 500000000 k_43_. Not a
great assembly. contigs: 1,784,832; N50: 201; mean length: 183; min
length: 87; max length: 4782. Tried again with K = 27 (took about 32
hours to run). K = 43 seems to be the better option: contigs: 2,008,767;
min = 55; max = 4789; average = 140.24.

Tried running minia again, this time with the FLASH merged contigs, and a range of kmer sizes from 17 to 51 step 2.

Using SOAPdenovo2:
compiled nicely right out of the box . Rumored to be memory intensive,
so I'll put it on a node with 512GB. Easy to compile. I need to know
more about the Illumina reads before I can configure this properly

ABySS: Needs some external libraries to compile properly (see their quickstart
for details) : boost, sparsehash, and OpenMPI. I already had boost and
OpenMPI, but I had to download and compile sparsehash
(I don't have root access on the hpc, so I had to change the Makefile
so it installed into my home directory, by changing the $prefix
variable). To compile ABySS, I used "./configure
--with-boost=/home/software/boost-1.48_intel/include
--with-mpi=/home/software/mpi/gcc/openmpi-1.4.5
CPPFLAGS=-I/home/langelab/seanrjohnson/local/include", then make. Then
"make install DESTDIR=/home/langelab/seanrjohnson/local" because we
can't install this globally either (that DESTDIR is not set quite right,
I need to investigate how to do that better). Run the following Torque script to try a bunch of different kmer-size assemblies (20 to 64 by 2s)

The maximum assembly size (in contigs >= 500 bp) is small, only
97.04 MB out of a predicted 500 MB genome. When I run the stats with a
minimum contig cutoff of 0, I see that, in fact, most of the genome
seems to be getting covered, albeit mostly with very short sequences
(the Sum column is between 550 MB and 431 MB with higher Kmer values
corresponding to lower sum). Doing more assemblies, this time with kmers
between 40 and 64:

We're getting the best assembly at a kmer size of about 38 it looks
like. I tested to see if using the FLASH joined reads would improve the
assembly quality.

3. LSC always seemed to use way more memory than the
10GB they
claim on the website. In the Bowtie step, it launches 10 instances of
bowtie2, each of which use about 5 GB of RAM, which is enough to put my
computer over the edge, and slow everything way down. I decided to try
again on a high-RAM hpc node. First I need to get python, numpy, and
scipy installed there, which is kind of a pain, because I don't have
root access, and the hpc is not directly connected to the internet, so I
have do download packages locally, and then scp them over and compile
them. Compiling scipy and numpy looks like a pain (and I don't know if I
have the right Fortran compiler), so I'll take the alternative path of
compiling SAGE, which looks easier, and includes scipy and numpy. I had all of the prereqs except 'texinfo', which was easy enough to compile
(difference: ./configure --prefix=/home/langelab/seanrjohnson/local).
(SAGE is a complicated package that I think actually builds most of its
environment from scratch). Compiling SAGE worked, so now I just have to
point LSC to the SAGE bin directory
(/home/langelab/seanrjohnson/sage-6.1.1/local/bin) for its python
executable, and I should be good to go. Not that simple it seems, after a
bit of finagling with environment variables, I finally did get it
working:
add the following to .bash_profile

(update: turns out messing with LD_LIBRARY_PATH may not be such a great idea)

Running
LSC gave me the error "bowtie2-build: /lib64/libc.so.6: version
`GLIBC_2.14' not found (required by bowtie2-build)" which I fixed by
recompiling bowtie2 on the hpc (I think at first I had just copied the
binaries that I had compiled locally which apparently was a bad idea).
Now bowtie2-build worked, but bowtie2 itself gave the error
"GetOptionsFromArray" is not exported by the Getopt::Long module" Which I
found out is because the version of Perl on the hpc is old (5.8.8, whereas it must be >= 5.10
for that option to work). So... now I have to compile the newest
version of Perl from source on the hpc: honestly not so bad, just
followed the instructions here
(pointing it to a directory in my home folder of course), it took
forever to compile and test, but it worked. After all that, the LSC test
data ran, so I submitted my job to the scheduler: kept the default
settings in run.cfg except for pointing it to the right files, and
setting it to use 32 cores. On the high RAM node, each of the 31
bowtie2-align-s jobs it ran took about 10 gb of RAM. It ended up taking
longer than the 24 hours I scheduled, so I tried again with 6 days of
computer time, and that failed too. Maybe I should try again with 20
days?

4. Never tried this.

5. Never tried this either.

6. First I made sure Celera Assembler (8.1) was on the system and working. I then converted my Illumina reads into frg format.

This run didn't do quite what I expected. For one thing, it
didn't seem to recognize that the reads are paired-end, as it reported:
"ReadsWithNoMate 51342111(100.00%)"
But, that doesn't matter because I'm just after unitigs, which it seems to have done just fine.

I
found out why it didn't realize they were paired end, I should have run
fastqtoca with the "mates" parameter instead of the "reads" parameter. I
don't actually know for sure what the insert size is, but based on
FLASH and the CLC-Bio output, I can guess it is something like 150 +/-
150 or so (although it has a long right tail, and its left tail cuts off
abruptly at 101):

Turns out, for illumina reads longer than 160 bp, you need to use
the illumina-long technology specifier. So I reran fastq-CA for the
FLASH merged reads, using "-technology illumina-long", and "-libraryname
s_5_seqsMergedFlashed" to get rid of that second error which came about
because I used the same library name in both frg files.

Fixing that and rerunning. asm file is the same as above except that I added the line:
stopAfter=unitigger
so that it won't take so much time.
After
a complete run, you'll get a utg.fasta file with the unitigs. When you
stop it early, it doesn't get that far, so you've got to extract the
unitigs yourself. I believe the command to do this is:
tigStore -g flash_s_5_seqsMerged.gkpStore -t flash_s_5_seqsMerged.tigStore 8.1 -U -d consensus > sequences.utg.fasta
Nope...
that's definitely not it. There must be some redundant unitigs in
there, because the that file is way too big, and has a total length of
1.9 gigabases.

Running to completion (luckily runCA is good about starting back up where it left off).

Interestingly,
the entire assembly using the FLASH merged reads, and using raw reads
as pairs came out much worse than the assembly with paired end reads
assembled as though they were single end reads (at least by the general
statistics, perhaps there is less misassmbly). Also it is notable that I changed two variables here, I used FLASH merged reads, and I also
had it treat paired end reads as paired end reads rather than single
reads. Maybe it was the second change, not the first that resulted in
the worse assembly.

Step 8 of the
ectools tutorial is written for some kind of grid computing system that
I don't have, so I rewrote steps 8 and 9 to use GNU parallel instead.

#! /bin/bash
export TMPDIR=/media/langelab/data/longifolia_genome/celera_assembly/longifolia_correct/temp_dir #be sure to make this directory

I tried running that on my local computer, but it would have taken way too long, so I
modified it once again to run on the campus HPC that uses the Torque
scheduler (hint: if running multiple directories at the same time be
sure that their temp files are mapped to different folders, otherwise
there will be collisions as the process correcting one folder overwrites
the temp files for the process correcting a different folder). I used
the unitigs from my first Celera assembly of the illumina data as the
basis for the correction (because it seemed to be the highest quality
set of unitigs I have. I wonder if unitigs from an assembler other than Celera would work very well. Perhaps an entire Minia assembly would work nicely as unitigs.) Happily, it finished, with a total run time of about 6 days (Celera assembly of short-reads plus
correction of long reads) which is way shorter than the
run time of PacBioToCA was (before I killed it out of hopelessness). Unfortunately, we're down to about 3.5x PacBio coverage (from the 6x of the input).

The readme says that Celera with default settings should work pretty well, so I'll give that a shot (the only thing I'm worried about is that it will take forever because it won't be set to use lots of cores).

Assembly finished in just a couple days. According to the statistics, there are 30198 scaffolds, representing 92.4 megabases of sequence. N50=3168. Not great, but pretty good compared to what I've been seeing with the other assemblies.

To see if I could recover some of the data lost in the correction of the PacBio reads, I ran PBjelly to try to join the assembled scaffolds into bigger scaffolds.

8. MaSurCA using corrected PacBio reads, and flashed illumina reads:
Use default parameters, try running it on my desktop, hope it doesn't use more than 48 gb of RAM. MaSurCA got to the stage of assembly where it ran the Celera Assembler, it crashed, I believe because it can't actually handle PacBio reads, according to the paper:
"We note that the modified version of CABOG 6.1 used in MaSuRCA is not capable of supporting the long high-error-rate reads
generated by the PacBio technology."
I was hoping it could handle corrected PacBio reads, but apparently it can't do that either. It kept complaining about sequence length being longer than 2047, and eventually crashed, presumably for that reason, but perhaps for some other reason.

To try to get around that, I chunked my PacBio corrected reads into overlapping chunks of length 2047. This got much further, but eventually it also crashed at runCA step 3, with the error: " at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 1121
main::caFailure('109 overlap correction jobs failed; remove /media/langelab/da...', undef) called at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 3019
main::overlapCorrection() called at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 5343"
in the files "CA/3-overlapcorrection/*.err" I get the following:
"/home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/CA/Linux-amd64/bin/correct-olaps: AS_configure()-- AS_CGW_ERROR_RATE set to 0.25
Quality Threshold = 1.50%
Starting Read_Frags ()
Starting Correct_Frags ()
Starting Read_Olaps ()
Starting qsort ()
Starting Redo_Olaps ()
Aborted (core dumped)"

Tried again with just a subset of the paired end reads (the ones that FLASH could not merge) and no single end reads (that is: no FLASH merged reads, and no PacBio reads). This didn't work either. I give up on MaSuRCA...

Velvet is also capable of hybrid assemblies. It uses the de-brujin graph strategy rather than the overlap strategy, so it probably doesn't handle long reads as well as MaSuRCA, but it's worth a shot: