To run these scripts, you will need to have installed GABI and the following
software (note: matching the version number may not be critical, but we put
them here to record what we have successfully tested):

If you would like to regenerate the 2000 read-pair subset, run the
00-subset-data.sh script. This will download a very large dataset from
Illumina (two 3.5GB files) and run it through the filter_illumina tool from
BioLite. Alternatively, you can
skip this step and use the subset.*.fq.gz files that we have alread generated
and included in this directory. It will also download the reference assembly
NC_001422 from the NCBI Nucleotide
Database (also included in the repo as phix-reference-NC_001422.fa).

The 01-assemble.sh script will constuct the assembly graph and run Velvet and
SGA on the subset. The assembly graph output should have two clusters
(cluster0.graphml and cluster1.graphml) which are reverse complements of
each other, and should each have 110 nodes and 148 edges. The script will also
generate figures (compare?.pdf) showing which edges in the assembly graph are
also present in the Velvet and SGA assemblies and the NCBI reference sequence.

The 02[a-c]-sample.sh scripts are the most computationally intensive and
run sample from three independent chains. They includes directives for
the SLURM cluster management system. You may need to adapt these to your own
cluster environment. By default, we use cluster0.graphml from above and run
20,000 sample iterations, with an estimated runtime of ~7 hours and peak memory
usage of 10GB for each chain, each running on an 8-core Intel Xeon E5540 (2.53Ghz)
node. If you want to run a quick test, you could reduce the iterations to
1,000, although this will probably fail to converge. On our cluster, we would
launch the three scipts with:

Once the sampling has finished, use the 03-report.sh script to generate a
report in the directory report-X where X is the current time. You can
compare this report against the sample one provided in the report-1399918695
directory. Figures 3 and 5a-b in the paper are from this report.

The remaining scripts 04 through 08 run a study of different priors, and are
also computationally intensive, with an estimated runtime of ~7 hours and peak
memory usage of 10GB for each run. We would submit these on our cluster with: