Friday, 8 March 2013

Closing gaps in assemblies

I've learnt how to use a couple of programs to close gaps in assemblies from my colleagues at Sanger, including Gapfiller and Image. Both are simple to run:

Gapfiller
To run Gapfiller, you need to provide a 'lib.txt' file containing the parameters, eg. this line (just one line):
lib bowtie SRR022868_1.fastq SRR022868_2.fastq 3594 411 FR
This means the reads are in fastq files SRR022868_1.fastq and SRR022868_2.fastq (for the forward and reverse reads respectively, and that the mean insert size for the library sequenced is 3594 bp, and the standard deviation is 411 bp. 'FR' means that you have forward and reverse readsfor read-pairs.

To run Gapfiller you can type:

% /software/pathogen/external/apps/usr/local/GapFiller_v1-10_linux-x86_64/GapFiller.pl -l lib.txt -s assembly.fa
where /software/pathogen/external/apps/usr/local/GapFiller_v1-10_linux-x86_64/GapFiller.pl is the path to where Gapfiller is installed, lib.txt is the lib.txt file described above, and assembly.fa is the assembly file.

Gapfiller closes gaps in assemblies, but requires that the gaps to be closed have roughly the correct size, ie. the correct number of Ns in the assembly fasta file (or else it won't be able to close them).

I find that Gapfiller needs quite a lot of memory, eg. 2 Gbyte.

IMAGE
Image was written by my former colleague Jason Tsai at Sanger.

IMAGEcloses gaps in assemblies, but unlike Gapfiller, it doesn't require that the gap is roughly the correct size (number of Ns) already.
This is the description of IMAGE in the PAGIT paper:

"IMAGE
is an approach that uses Illumina paired-end reads to extend contigs
and close gaps within the scaffolds of a genome assembly. It functions
in an iterative manner: at each step it identifies pairs of short reads
such that one of the pair maps to a contig end, whereas the other hangs
into a gap. It then performs local assemblies using these mapped reads,
thus extending the contig ends and creating small contig islands in the
gaps. The process is repeated until contiguous sequence closes the gaps,
or until there are no more mapping read pairs. IMAGE is able to close gaps using exactly the same data set
that was used in the original assembly"
and:"Depending on the repetitive nature of the genome, assembly quality and
the coverage depth of the paired-end reads used by IMAGE, up to 50% of
gaps can be closed. When using Illumina data, IMAGE can only run with
paired reads with inserts of a few hundred base pairs."

To run IMAGE, you can type:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image.pl -scaffolds assembly.fa -dir_prefix ite -automode 1 -prefix SRR022868
where /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image.plis the path to where you have installed IMAGE, assembly.fa is your assembly fasta file, and '-prefix SRR022868' tells IMAGE that the fastq files are called SRR022868_1.fastq and SRR022868_2.fastq.

I found that IMAGE needs about 2 Gbyte of memory to run.

When IMAGE has finished running, you need to run in the directory where you ran IMAGE:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image_run_summary.pl ite

Then in the final 'ite' directory (eg. 'ite20', for the default of 20 iterations of IMAGE), you run:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/contigs2scaffolds.pl new.fa new.read.placed 200 500 scaffolds
The '500' in this command means that scaffolds of <500 bp are discarded.