Evolutionary Dynamics Across Species Ranges

Tag Archives: software

As part of my postdoc with Gideon Bradburd, I’m using his new software package conStruct (bioRxiv; GitHub) to analyze dozens of genomic datasets. conStruct requires three input files: 1) genetic data, 2) coordinate data (longitude in the first column, latitude in the second), and 3) a pairwise distance matrix with the same number of sites as in the coordinate data. Files 2 and 3 are straight forward; but it took me a little time to be able to go from a regular STRUCTURE file to a conStruct file. So below is the R code I’m using to do this conversion.

Now if your data is not already in STRUCTURE two row format (i.e. two rows per sample), then you’ll need to get there as a starting place. I used PGDSpider to make the STRUCTURE files WITH a header row and a column to denote sampling site.
(PLINK, bless its heart, makes 1 row 1 column STRUCTURE files, and I’m not coding out of that.) Remember you want to denote sampling sites for conStruct and not putative populations. I then replaced the two blank headers for columns one and two with “SampleID” and “PopID.”

conStruct can take data as counts or frequencies. The code below makes a table of frequencies for one allele (doesn’t matter major or minor, derived or ancestral) for each sampling site for each locus.

I have written this as a loop to process multiple input files at once. You can remove the for loop and start at “str <- read.table()” if you only have one file.

As I noted in the code, my friend Jake Burkhart wrote the internal for loop that makes the frequency table. He originally wrote the loop to make pseudo-SNP datasets out of microsatellite data. Which means, if you want to run conStruct on a microsatellite dataset, you can print all of the loci (instead of just one of the biallelic SNPs), then keep processing the frequencies at each sampling site. Note, conStruct will throw an error if there are fewer loci than samples, which shows up more readily when using pseudo-SNP data from (even highly polymorphic) microsatellites.

I want to use BayesAss on a large SNP dataset generated with RADseq. But I found out when I went to convert the data into the .immaq format that my favorite converter, PGDspider, would only convert the first 40 loci. I didn’t get 10,000s of loci for nothing, so that wasn’t going to work. But a second problem was that BayesAss 3.0.3 only allows 240 SNPs anyways.

Obviously I’m not the only person with this problem. And thanks to Steve Mussmann there’s a solution. Steve re-wrote BayesAss 3.0.4 to be able to handle large SNP datasets, as well as a program to convert the STRUCTURE files from pyRAD into .immaq input files for BayesAss.

Since I have STRUCTURE files from STACKS and not pyRAD, I had to do a little conversion. My messy code is below, but leave a comment if you are a wiz with an elegant solution.

Turning STRUCTURE Output from STACKS into STRUCTURE Output from pyRAD
First, output data from STACKS in STRUCTURE format (.str). Remove the first two rows from this output (STACKS header and loci identifiers). Then, print the first column (sample names), insert five empty columns to match pyRAD. Do not print the second column from the STACKS STRUCTURE output because that is the population code from your Population Map input into STACKS.

awk '{print $1 "\t" "\t" "\t" "\t" "\t" "\t"}' data.str > test.out

Next, print out the remainder of your data starting at column 3 (i.e. first locus) in the original dataset (awk code from here). Then use paste to concatenate the two files into the .str file you will convert.

Convert .str to .immac Using a Custom PERL Script
If you already had data from pyRAD and did not have to do the steps above, you can just convert your .str file to an .immac using Steve’s script: str2immaq.pl. However, if you have data from STACKS then you need to modify lines 39-52 in the original script to the following:

Since STACKS exports missing data as 0 (whereas pyRAD exports missing data as -9), this change removes converting missing data from -9 to 0. Steve also wrote this change. Save the change in the perl script, then use the script to convert data from .str to .immac. Now you’ve got an input file for BayesAss.