OK, we have a first cut of mthap with PhyloTree Build 14 support, but this is just the first version that seems to mostly work. There's likely many bugs that'll need to work out, so please try it out and let me know of anything that seems odd. There's 1030 new haplogroups in the new build, so making sure everything works will be a challenge. Be sure to cross-check with the PhyloTree website to make sure what mthap comes up with is reasonable.

Known problems: Builds 1-2 completely broken. Builds 3-5 broken for some groups. Builds 6-13 seem to be OK. Build 14 has a quick hack for the new .XC insertion notation, and haplogroups L3e1c and M38e not working properly yet. There's a few other things, mostly minor or cosmetic. Anything else, let me know. I'll probably go through a few updates over the next several days while we get the kinks out.

Yes, it is normal to have multiple matches with the same rank. That just means that there is not a significant difference between the scores, so they are tied for 3rd place in your example.

Here are some interesting things about the new build:

Build 14 has 3550 haplogroups, up from 2520 in Build 13, an increase of 1030 haplogroups, or about 41%. This means that a lot of you will have a new haplogroup in this new build.

rCRS is now H2a2a1; previously it was H2a2a.

Some haplogroups have C repeat insertions of widely varying length. Previous builds made an attempt to get the right number of insertions for haplogroups that have them, but there's often too much variation to accurately reflect in the haplogroup tree. Build 14 introduces a new .XC notation which indicates an arbitrary number of insertions (e.g. 573.XC instead of 571.1C 571.2C ... or 573.1CC ...). (As previously mentioned, mthap doesn't fully support this yet.)

This build also marks the first step in transitioning from the Revised Cambridge Reference Sequence (rCRS) to the new Reconstructed Sapiens Reference Sequence (RSRS) which better represents a common ancestor rather than an arbitrary human rather far down the phylogenetic tree. After the Build 14 issues are worked out, I'll be looking at how to support this new convention.

Hi James,Thanks a bunch! I tried it and works just fine with my FASTA file. Because my classification has not changed(still C4a1) , I guess I may not be a good test subject.By the way, do you plan to switch over the interface to evolution from RSRS instead of rCRS?

The bulk of the changes appear to be in European haplogroups, as the vast majority of new sequences are from FTDNA, whose customer base is largely those of European ancestry. Those with Asian, African or Native American haplogroups have less chance of any change in designation. That said, my haplogroup did not change either, though that is to be expected as I have no FGS matches and only one private mutation.

RSRS support will require extensive changes, but it is something I plan to work on as it would make some things much more logical in the long run.

Thank you James for your continous work! The merging to RSRS is a logical path and should be done by everyone. Here a result of mthap version 0.18pre2 (2012-04-10); haplogroup data version PhyloTree Build 14 (2012-04-05) +mods for an actual 23andme v3 raw data test file:

My private 16223T is actually a back-mutation from the original at hg R, and someone else in the U5a1a1 FGS Project shares 152C 16294T, so hopefully that will make it in a future build if the other donor authorized Dr. Behar to use his/her sequence in this paper. I'm anxiously waiting for GenBank to release their latest data-set to find out if it's actually there. The phylogeny provided by http://www.mtdnacommunity.org currently lists this haplogroup [U5a1a1 T152C!] as "TBD" with no associated GenBank ascension numbers. T152C! of course indicates that my 152C is a back mutation from the RSRS (the original C152T mutation took place at hg L2'3'4'5'6, so here's to vindication ), and of course there's also the C16192T / T16192C back-mutation from hg U5 and between hgs U5a1 and U5a1a. What can I say, my maternal line prefers the old ways.

Alpeu wrote: I hope this is one of the major steps to a fine graded tree and the resolution will be enough for all major uses in the future. Look at this statement in Behar, Oven et al. (2012):

Approaching a Perfect Phylogeny[...] First, an almost ﬁnal level of resolution for a number of western Eurasian clades was achieved, and the nodes of ancestral and derived haplogroups are often differentiated by a single mutation.

I expect that Build 14 will included most of terminal nodes for younger haplogroups such as the major daughters of H, but for the older haplogroups, e.g., daughters of U, I think there is a lot of structure that remains to be discovered. It appears that a large portion of the FTDNA customers did not authorize the use of the data, so I believe we still have many unnamed nodes in the data in several of the U projects.

Vince - about 152, is seems to be a frequent mutation site, so I'm not sure how reliable it is for specifying a new haplogroup. I'm guessing that is why it has not been included in Build 14 or in earlier versions.

Alpeu wrote: I hope this is one of the major steps to a fine graded tree and the resolution will be enough for all major uses in the future. Look at this statement in Behar, Oven et al. (2012):

Approaching a Perfect Phylogeny[...] First, an almost ﬁnal level of resolution for a number of western Eurasian clades was achieved, and the nodes of ancestral and derived haplogroups are often differentiated by a single mutation.

I expect that Build 14 will included most of terminal nodes for younger haplogroups such as the major daughters of H, but for the older haplogroups, e.g., daughters of U, I think there is a lot of structure that remains to be discovered. It appears that a large portion of the FTDNA customers did not authorize the use of the data, so I believe we still have many unnamed nodes in the data in several of the U projects.

Vince - about 152, is seems to be a frequent mutation site, so I'm not sure how reliable it is for specifying a new haplogroup. I'm guessing that is why it has not been included in Build 14 or in earlier versions.

Gail

Hi Gail and all,

I don't like to be disagreeable (hah!), but I think there's still a lot of structure to be discovered. If you go back to Behar's original K tree from 2007, there were NO sequences on it from Ireland. From FTDNA, there are now lots of Irish sequences, and their haplotypes are getting somewhat predictable. But I have several K Project members from the Middle East. Their sequences tend to be unique each time. Had one from Turkey yesterday whose nearest match was from Italy. The Italian one has three additional coding-region mutations, which represents thousands of years. Think of all the structure to be found between those two. I think it's way too early to close up my discovering-new-subclades business. There is a limit to the structure; if you test everybody, the structure ends - until you test their newborns.

Not sure why you say 152C has not been included in the PhyloTree. It certainly is in the definitions of several K subclades, usually in conjunction with other mutations. "Recurrent" does not mean "unstable." (Trying out that phrase.)