So let's try a dendrogram of all these populations' average admixture results. Instead of using regular Euclidean distance, I used some weighting based on Fst distances between admixture components, very similar to what Palisto did.

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M

Pulliyar_M

North_Kannadi

Chamar_M

Piramalai_Kallars_M

Velamas_M

1265.77

1259.38

1256.06

1255.6

1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M

Velamas_M

UP_Scheduled_Caste_M

Piramalai_Kallars_M

Muslim_M

Pathan

1229.91

1229.56

1229.53

1229.32

1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M

Piramalai_Kallars_M

Pulliyar_M

Velamas_M

North_Kannadi

Sindhi

1234.09

1234.08

1233.85

1233.6

1233.55

Again, Sindhis as donors are #12.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Brahmins_UP_M

1244.6

1244.53

1243.44

1242.88

1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Kshatriya_M

1247.72

1247.36

1246.42

1244.98

1244.56

And #12.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Muslim_M

1255.96

1255.36

1253.96

1251.74

1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

pathan

punjabi-jatt

bhatia

haryana-jatt

rajasthani-brahmin

punjabi

balochi

kashmiri

punjabi-brahmin

sindhi

For Sindhis,

sindhi

bhatia

balochi

makrani

brahui

punjabi-jatt

haryana-jatt

meghawal

pathan

punjabi

For Brahmins from Uttar Pradesh,

bihari-brahmin

haryana-jatt

brahmin-uttar-pradesh

punjabi-jatt

kurmi

sourastrian

bengali-brahmin

bihari-kayastha

bhatia

up-brahmin

For Kshatriyas,

bihari-brahmin

kurmi

meena

kshatriya

rajasthani-brahmin

haryana-jatt

punjabi-jatt

bengali-brahmin

kerala-muslim

sourastrian

For Muslims,

muslim

chamar

kol

oriya

uttar-pradesh-scheduled-caste

bihari-muslim

sourastrian

brahmin-uttaranchal

dusadh

bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

The "chunkcounts" file has the donors in columns and recipients in rows. Here's a heat map of the same.

fineSTRUCTURE classified these 615 individuals into 89 clusters. I have named these clusters for convenience, however, the names do not imply that anyone in the Punjab cluster is Punjabi.

While I created the cluster tree at the top of the spreadsheet, here's how the clusters are related.

The most interesting thing is how Gujarati A (likely Patels) are an out-group to everyone else. Another major grouping is that of the Baloch, Brahui and Makrani, along with 4 Sindhis (might be one of the Baloch tribe of Sindh?).

The Punjabis, Sindhis and Pathan get better classification here than they did last time.

The Punjab cluster includes 3 Gujarati B, 4 Pathans, 2 Singapore Indians, Punjabis, Haryanvis, Kashmiris, and a Rajasthani Brahmin. Even using this method, HRP0036, who is half-Sri Lankan and half-German/Polish was classified in the same cluster.

The Dharkar and Kanjar could not be separated at all here. According to Metspalu:

There are three second degree relatives groups in our sample: ..snip.. [Kanjar evo_37 and Dharkar HA023]. Again the last pair needs further explanation. The Dharkar and Kanjar practice a nomadic lifestyle and were living side by side at the time of sampling. As the ethnic border between the two is permeable we cannot rule out neither our error during sample collection and/or subsequent labelling nor shifted self-identity.

The inter-cluster heat map:

And you can see the chunkcounts donated from each cluster to recipient individuals in a spreadsheet.

fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.

Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.

Here's the cluster-level coancestry heat map.

And the pairwise coincidence:

And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.

UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.

For an individual, the value under a specific cluster shows the probability of that person belonging to that cluster. For example, HRP0152 has a 58% probability of belonging to cluster CL8 and 42% probability of being in cluster CL14.

For the populations in the first sheet, I added up the probabilities of all the samples in that population to get the expected number of individuals of that ethnicity belonging to a specific cluster.

In the second sheet, I have listed all the individual samples' clustering results.

There are some outliers who didn't belong in any cluster: HRP0001 (me, of course), 7 (out of 18) Makranis, 4 (out of 23) Sindhis, 3 (all) Great Andamanese, 1 (out of 20) Balochi, 1 (out of 4) Madiga, and 1 (only) Onge.

Then I ran mclust on the first 70 dimensions. The resulting 156 clusters can be seen in a spreadsheet.

For individuals belonging to Harappa Ancestry Project, the value in a column shows that person's probability of being in that cluster. So if there is a 1 in CL15 for example, then that person has a 100% probability of being in Cluster CL15.

For the reference population groups, I have added up the probabilities for all the individuals belonging to that group.