gSSR Processing

For the primary purpose of discovering SSRs, we performed low coverage sequencing of ten hardwood tree species. DNA from each tree was extracted and turned into an individually barcoded library from ten hardwood trees. The ten samples were run on a single lane of an Illumina HiSeq. The reads are 101bp paired ends, from libraries with an average fragment size of 180 bases. This should yield an overlap of about 20 bases between the forward and reverse reads, thus enabling reconstruction of the entire 180 base fragment. To discover high quality SSRs, a new processing pipeline was created, comprised of publicly available software and custom perl scripts. The perl scripts are available through a GitHub repository and carry a GPL2 license.

Neither read was reverse complemented from the original file, and we would normally expect to have to reverse complement the reverse read to find the overlap. If you choose to filter these out, a perl script was written to do just that:

2. FLASH

3. Find SSRs pre-assembly

Details of usage and caveats can be found in the comments in the header of the script, or by running it without parameters on the command line. This perl script will identify sequences that meet the following requirements:

A repeat motif meeting one of the following patterns:

2 bp motifs repeated from 8 to 40 times

3 bp motifs repeated from 7 to 30 times

4 bp motifs repeated from 6 to 20 times

No part of the repeat falls within 15 bases of either end of the sequence

This strategy was designed to allow for primers design. If your reads differ in length, you may wish to change these parameters in the script. This script will filter sequences with two SSRs (adjacent or not) into a separate file to allow the user to decide whether or not to use them. They can be polymorphic, but are not preferred. Finally tip, for using this pipeline, use the -a option and output fasta files for fastq files in order to preceed to step 5.

4. Assembly

The goal here is to collapse identical SSRs into a single read. Many assemblers would work, but I recommend using a high degree of sequence similarity during assembly to collapse only truly identical loci. For this project we used CAP3 with the parameter -p 95.

To produce a set of unique loci, the contigs and singlets are combined into a single fasta file as input to step 7.

5. Mask low complexity regions

Masking is used to create higher quality primers that are more likely to be unique in the genome. I recommend dustmasker from NCBI, and for this project we used level 1.

6. Find ssrs post-assembly and design primers

This script requires the user to specify a path to the primer3 executable and to the primer3 config file at the top of the script.

As before, details of usage and caveats can be found in the comments in the header of the script, or by running it without parameters on the command line. This perl script will identify sequences that meet the same criteria set out in step 4. In addition, it uses Primer3 to design primers around each SSR. The primer3 parameters can be modified in the global variable section at the top of the script, but currently look like this:

Primer optimum size = 20

Primer minimum size = 18

Primer maximum size = 25

Number of Ns acceptable in primer = 0

Product size range = 100-200

Primer optimum Tm = 60.0

Primer minimum Tm = 55.0

Primer maximum Tm = 65.0

Primer minimum GC content = 40

Primer minimum GC content = 60

Primer max Poly X = 3

Primer GC clamp = 2

The script will produce extensive output for further examination. Please see script header for details.