Tuesday, August 16, 2011

Do-It-Yourself Dodecad v 2.0

There are many new features in DIYDodecad 2.0, including by-chromosome and by-segment ancestry analysis, and a visualization tool that can be used in conjunction with it. Simple admixture analysis as in version 1.0 is, of course, also included.

You can download the RAR archive from Google Docs (File->Download original) or Sendspace. DIYDodecad works for Windows and 32/64bit Linux.

Here is the chromosome #3 of a project pariticipant for 3 components of interest.

A different region on Chromosome 4 on a different set of components:

Bug reports/suggestions for improvement are always welcome at dodecad@gmail.com

NOTE: In Windows, the result files and dv3.par file will not look good if you open them with Notepad, because of the different way in which Linux and Windows represents new lines. So, you should edit/look at these files using another text editor (e.g., Wordpad, Textpad, Word), but make sure you always save them as plain text (.txt) files.

I'd love to try this out, but can't download the DIYDodecadWin.exe file. I get a webpage 404 page not found error. I tried saving just that file to Google Docs, and got an error, please try again later message.

Thank you once again for wonderful work. I have two questions, though. Are the genomewide results from your v3 spreadsheet better than the ones we get from the DIY? (Also, ought the genomewide membership scores from both DIYs be the same?)

This is really a step forward for the resolution and precision of personal genome analysis. And it even is easy to use. Dienekes continue to amaze us all, I'm sure.

My own results came up as predominantly Western European on all chromosomes, but with some variation. It would be nice to compare the results with other project participants, but I wonder how that could be realised

Are the genomewide results from your v3 spreadsheet better than the ones we get from the DIY?

The underlying model is the same, it's only the termination condition that is different. I would say that for comparing oneself with the reference populations the spreadsheet values are the best, because they've been calculated in exactly the same way as the reference populations, but the differences are miniscule really.

(Also, ought the genomewide membership scores from both DIYs be the same?)

In theory, no, as 2.0 uses a different termination condition than 1.0. Empirically I've run 2.0 on about a dozen participants and I've yet to see a case where the difference between the two is more than 0.01-0.02%. For the same number of iterations both produce the same answer, although 2.0 may do a different number of iterations than 1.0 because of the different termination criteria.

In the future it might be possible to create markers/types from one string/side of the chromosome only, so all 46 are threated as the individual packs, they are.

That the SNP's are used to make types/markers in a two-dimensional way instead of a one-dimensional like explained, will make it more difficult to tell small variation from noise. It's probably a tedious job to do, but it might be possible to put all the single strings into a PCA as individual variables, but it would require the full sequence of several genomes.

Thank you for making this available. So far, I've analyzed my DNA using genomewide, bychr, and byseg modes, and it's been really interesting to see the results. However, after I used paint_byseg on my first chromosome, the data that was in byseg.txt was lost; now it's just some incomplete data for my first chromosome. I'm not sure what I did wrong.

However, after I used paint_byseg on my first chromosome, the data that was in byseg.txt was lost; now it's just some incomplete data for my first chromosome. I'm not sure what I did wrong.

paint_byseg doesn't change the byseg.txt file. Your probably started DIYDodecad in byseg mode, which started over-writing the byseg.txt file. If you want to keep the byseg.txt file and try different parameters for byseg mode (different window size/advance step), rename byseg.txt to something else, because the file is over-written everytime the program runs in byseg mode.

You probably messed up with dv3.par in some way. Replace it with the fresh dv3.par from the RAR and edit it as per the instructions by replacing "genomewide" with what is necessary for the different modes. Don't use Notepad to edit.

At line 142 of file DIYDodecad.f90 (Unit 50 "genotype.txt")Traceback: not available, compile with -ftrace=frame or -ftrace=fullFortran runtime error: End of fileWarning message:running command 'DIYDodecadWin dv3.par' had status 2

The problem is Dodecad 2.0 doesn't work with V2 files. V2 is processed OK only in Dodecad V 1.0. V1.0 doesn't yield correct results with V2 chip, though, insisting on 25.09% Neo_African and 19.62%Palaeo_African ancestry in a person with no African ancestry at all and zero Africa in Ancestry Painting. Which again applies only to V2, with illumina FTDNA and V3 the result is next to perfect even when No. of iterations = 1.

I meant, 1D-1. It says, "1 total iterations".I did everything as in Readme several times, even reinstalled R. Nothing. It does standardize the V2 file OK, and the file is processed OK afterwards with V 1.0.I get that same message everytime I try to process it in V 2.0. Changing files in the working directory to V 1.0 solves the problem instantly.

Finally, you can get that same message even with V3 of illumina raw data file - by deleting any line in the beginning of genotype.txt file. Deleting lines at the end of it does nothing - the file is still readable.

Hi Dienekes, How do your population groups correspond to the old physical anthropoloical groups in Europe of Nordic, Mediterannean, East Baltic, Alpine and Dinaric? Is West European close to Nordic? Med is Med. East Baltic and Alpine is East European? Not sure where to venture with Dinaric? I m guessing that West Asian would correspond to the Hither Asiatic? Thanks Again, Jonathan

I don't see a particularly strong correspondence. Alpine and Dinaric do not appear to correspond to anything that comes out of ADMIXTURE runs. There are resemblances between some of the other components (e.g., Sardinia with Mediterranean, Balto-Slavs with Deniker's Eastern race, or Arabs wit Orientalids), but I wouldn't push them too far. At present we know almost nothing about how genes affect craniofacial morphology.

I don't know as to whether you'll cater to this request, but I was wondering whether you could create a Do-it-yourself calculator using the components of the initial K=10 analyses that you used to carry out for participants. That would be very, very useful.

Regarding what Jon Dibble said: maybe it will get more precise when we use more SNPs. This analysis is done actually with 166,462 now. The *.F files (wc -l *.F). Am I right, Dienekis? Is there a possibility to work with more SNPs? Maybe use the available 23andme data from all over the world?

When one combines all available Illumina datasets and performs LD-based pruning one is left with something like 100-200k SNPs, depending on how aggressive their pruning and genotype rate filter is. More SNPs are useless for ADMIXTURE-type analyses.

I have run the genomewide analysis, and now I would like to run the bychr analysis (as well as use the visualization with the paint_bychr). From the readme file, I've figured out that I would need to modify the dv3.par file to say "bychr" instead of "genomewide." However, I wasn't sure what program to use to open and modify this .par file. Do you have a suggestion? Also, will modifying this file and running the "dv3" analysis with the new file work, or would I need to create a new calculator, as you have outlined here: http://dodecad.blogspot.com/2011/08/how-to-make-your-own-calculator-for.html ? That may be a bit out of my reach right now... Thanks for your help!

You can use Wordpad/Word or any other editor to edit the .par file.And, no you don't need to worry about making your own calculator.Also, you should probably get version 2.1 of DIYDodecad, if you haven't already, the link is at the top of this post.

Cool thanks Dienekes, at last I got it installed and played some with params in anticipation of some of the potentially more interesting genomes in the family (mine doesn't really have any of those captivating ancestry legends but it's the first trial set I got)

One surprise is that genotype.txt has data on X-chromosomal SNPs but the output of the program doesn't use them? For males at least, X chromosome segments would come from a narrowed subset of ancestors, and it would have been an interesting reality check. Specifically in my case, my mggm was a Pomor, who formed a distinct ethno-cultural subgroup of the Russians in the Far North, strongly influenced by their Finnic neighbors. So I was curious about Finnish segments in the autosomes (and found some with MGLP pars, if one can put much faith in its setup), but even more so, on the X (alas, no analysis there!)

BTW my would-be Finnish segments are in strong anti-correlation with the Ashkenazi Jewish segments from 23andMe's ancestry_finder spreadsheet. Which is intuitively very much as expected. But it brings another question, why can't we use AJ as a reference population in DYIDodecad? Is it because there is too little public data?

genotype.txt is just converting the company files into a common format that can be understood by DIYDodecad. The SNPs used can be found in the .alleles file of each calculator, and do not include X chromosome SNPs.

>> why can't we use AJ as a reference population in DYIDodecad?

The DIYDodecad calculators are based on the components that emerge from admixture analysis over public/Dodecad datasets, including 2 Ashkenazi Jewish ones. Ashkenazi specific clusters do emerge at higher K numbers; these may be interesting for genealogists, but not for prehistorians, because the Ashkenazi Jewish cluster reflects the relative recent isolation followed by expansion of AJ communities (in, say, last 1-2 millennia).

Yes, I understand that each calculator uses a *sub*set of SNPs, and that the X-chromosomal SNPs aren't included, but it still surprises me that they didn't provoke more curiosity. Of course just forcing them into existing autosomal models may be a somewhat flawed way of analysis, for example because of zygocity considerations, or because the expected numbers of recombinations depends on the number of females in the lineage ... but intuitively, I'd think that one can see many interesting things just by feeding X-SNPs directly into the computational kitchen of Dodecad.

It's like in the XXth c. genetics, feeding genes of unknown partial penetrance or partial codominance into classic linkage calculations may have been a flawed way of analysis, yet still useful

Very interesting thought about the dichotomy of "genealogists vs. prehistorians". Of course the original Dodecad was a prehistory-centered Cavalli-Sforzian quest for the primordial components of large (and largely drift-less) populations, but what is the real audience of the DIY tool? Aren't the most nontrivial DYID results expected right in the "time gap" between the classic genealogy and the deep prehistory? (I.e. far enough in time for the personal records to fail, but recent enough for the correlation of admixtures with the historical, cultural, and linguistic record of recent migrations and hybridizations).

Of course the allelic footprints of drift in somewhat-mixed, semi-endogamous populations (like AJs or Pomors of my example) will only become informative, in the larger population pattern, after a large number of lower-K PCs have already been accounted for. That doesn't make them any less informative for a person's quest for "pre-genealogic history" of recently admixed individuals (or for almost anybody seriously curious about painting the chromosomes and divining which segment came from what event in the past)

I've run the calculator on my results and was a bit surprised at the output. FTDNA is saying 96.75% Orcadian and 3.25% African (which is about twice the African that I was expecting but close enough), but dodecad come back with:12.83% East_European 51.14% West_European 22.94% Mediterranean 1.89% Neo_African 7.54% West_Asian 1.53% South_Asian 0.29% Southwest_Asian 0.25% Northwest_African 1.58% Palaeo_African

I know that the answer is fundamentally "different reference populations", but I'm surprised at the large mediterranean component that DIYDodecad is reporting vs zero for FTDNA. I think I was expecting shifting values somewhat, but this seems odd - do I have the wrong end of the stick in expecting FTDNA to report something mediterranean if DIYDodecad is going to come back this strongly?

Useful software

You may cite, quote, or reproduce articles on this site for non-commercial purposes, provided that you attribute them to Dienekes Pontikos and provide a link either to the main page of this blog or to the individual blog entry you are referring to.