Genetics and South Asia

Tag Archives: paintmychromosomes

Continuing with the Eurasian ChromoPainter analysis, here is the zip file containing the chunk counts that were donated by an individual in a column to an individual in a row. Please note that this is an all-against-all analysis, so it does not directly show the direction of gene flow. Also, the IDs I used here are based on ethnicity (except for harappann which are mixed Harappa Project participants). If you want to find out your ethnicID, take a look at this spreadsheet which has the appropriate mapping.

Since fineStructure classified these 2,001 individuals into 203 populations, it's easier to look at the chunk counts averaged over these populations.

From top to bottom (recipient) and left to right (donor), the five major branches are South Asian, European, Near Eastern & Western Asian, Inner Asian/Siberian, and East Asian respectively.

Here's the top 50 populations that have donated chunks to the Kalash (Pop133).

The three bars at the bottom are for the 3 different (closely related) Kalash clusters. The clusters donating the most after that are the Burusho, Sindhi, Pathan, etc. The top non-South Asian donor (Tajik Pop116) is at #21 and the next one is also Tajik (Pop95) at #38.

Now here are the top donors for the Pathans (Pop148).

Interestingly, the number of chunks donated to Pathans from Balochi, Brahui and Sindhi seems to be a bit more than from Punjabis. Again, Tajiks are the closest non South Asian group at #55 and #59, followed by Kurds at #62 and Iranians/Kurds cluster (Pop172) at #63.

The top donors (other than Punjabis, of course) are Sindhis and Gujarati-B. The top non South Asian donors are Tajiks at #65 & #67, Iranians/Kurds at #69, Turkmen at #70, Kurds at #73 and Lezgin at #75.

Now for Pop181 (2 Baloch and 9 Brahui).

The Baloch/Brahui are more inbred compared to Punjabis and Pathans. After teh top donors from Baloch, Brahui and Makrani, we get Sindhis, Pathans, Velama and Punjabis. The top non South Asian donor populations are Iranian/Kurd at #28, Turkmen at #33, Turk/Kurd (Pop162) at #35, Iranian Jews at #39, Kurd at #41, a lone Saudi at #42, Iraqi Jews at #43, Tajik at #44, Drue at #47, Armenians at #48, and Samaritian at #50. So it seems like Baloch and Brahui are a lot more West Asian than other groups in Pakistan/NW India.

The top donors, after Pop129, are Iyengar Brahmins and a group consisting of other South Indian Brahmins, Kerala Christiand and Nairs, and then Velama. The Dusadh are the top north Indian donor, followed by Gujarati-B and Chamar. Top non South Asian donor is Tajik at #73.

Some months ago, I decided to run a big ChromoPainter analysis of the Eurasian samples I have. I removed from my dataset not only all Sub-Saharan Africans, but also North Africans and anyone else with more than 2% African admixture (which unfortunately included me).

Since the number of samples was still too large, I picked 25 random individuals from each non-South-Asian ethnicity while keeping all South Asians. I also tried to remove all close relatives and those with a high missing genotyping rate.

In the end, I had 254,576 SNPs for 2,001 samples belonging to 197 ethnic groups.

Then I got busy and the results sat on my computer for more than a month.

Now let's look at the ChromoPainter/fineStructure analysis. Due to my time constraints, I am going to present them in several posts.

Today, let's look at the fineStructure clustering run on the chunkcount output of ChromoPainter. It divided the individuals into 203 populations. Here's the spreadsheet containing the group and individual population clustering.

And here is the dendrogram showing the relationship of the clusters/populations computed by fineStructure.

Dienekes ran ChromoPainter/fineSTRUCTURE analysis of South Asians along with some West Eurasian populations, something I had neglected to do in my own South Asian run.

Using Dienekes' data, I was trying to figure out which South Asian populations had more DNA chunks in common with other groups when I ran into something strange. Looking at the chunkcount spreadsheet, if we focus on a recipient population (i.e., one row), we can see which populations contributed more "chunks". For most populations, the results are expected. It's either the same population or some close population. For example, let's look at top 5 matches for Velamas_M,

Velamas_M

Pulliyar_M

North_Kannadi

Chamar_M

Piramalai_Kallars_M

Velamas_M

1265.77

1259.38

1256.06

1255.6

1254.74

However, when we do the same for Pathans, Sindhis, Uttar Pradesh Brahmins, Kshatriyas and Muslims, we get strange results.

Chamar_M

Velamas_M

UP_Scheduled_Caste_M

Piramalai_Kallars_M

Muslim_M

Pathan

1229.91

1229.56

1229.53

1229.32

1229.27

Do Pathans match Chamar the best? Pathans don't show up as a donor till #11.

Chamar_M

Piramalai_Kallars_M

Pulliyar_M

Velamas_M

North_Kannadi

Sindhi

1234.09

1234.08

1233.85

1233.6

1233.55

Again, Sindhis as donors are #12.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Brahmins_UP_M

1244.6

1244.53

1243.44

1242.88

1241.94

The same Brahmins_UP_M are #13 as donors.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Kshatriya_M

1247.72

1247.36

1246.42

1244.98

1244.56

And #12.

Pulliyar_M

Chamar_M

North_Kannadi

Kol_M

Piramalai_Kallars_M

Muslim_M

1255.96

1255.36

1253.96

1251.74

1250.86

Muslim_M are #8 as donors.

There is a pattern here among the top donors for these populations. The same populations show up time and again.

Compare to my results (with a larger South Asian dataset) now. The top 10 matches for Pathans are:

pathan

punjabi-jatt

bhatia

haryana-jatt

rajasthani-brahmin

punjabi

balochi

kashmiri

punjabi-brahmin

sindhi

For Sindhis,

sindhi

bhatia

balochi

makrani

brahui

punjabi-jatt

haryana-jatt

meghawal

pathan

punjabi

For Brahmins from Uttar Pradesh,

bihari-brahmin

haryana-jatt

brahmin-uttar-pradesh

punjabi-jatt

kurmi

sourastrian

bengali-brahmin

bihari-kayastha

bhatia

up-brahmin

For Kshatriyas,

bihari-brahmin

kurmi

meena

kshatriya

rajasthani-brahmin

haryana-jatt

punjabi-jatt

bengali-brahmin

kerala-muslim

sourastrian

For Muslims,

muslim

chamar

kol

oriya

uttar-pradesh-scheduled-caste

bihari-muslim

sourastrian

brahmin-uttaranchal

dusadh

bihari-brahmin

If Dienekes can post a chunkcount file for the clusters computed by fineSTRUCTURE, may be we can try to figure out what happened.

The "chunkcounts" file has the donors in columns and recipients in rows. Here's a heat map of the same.

fineSTRUCTURE classified these 615 individuals into 89 clusters. I have named these clusters for convenience, however, the names do not imply that anyone in the Punjab cluster is Punjabi.

While I created the cluster tree at the top of the spreadsheet, here's how the clusters are related.

The most interesting thing is how Gujarati A (likely Patels) are an out-group to everyone else. Another major grouping is that of the Baloch, Brahui and Makrani, along with 4 Sindhis (might be one of the Baloch tribe of Sindh?).

The Punjabis, Sindhis and Pathan get better classification here than they did last time.

The Punjab cluster includes 3 Gujarati B, 4 Pathans, 2 Singapore Indians, Punjabis, Haryanvis, Kashmiris, and a Rajasthani Brahmin. Even using this method, HRP0036, who is half-Sri Lankan and half-German/Polish was classified in the same cluster.

The Dharkar and Kanjar could not be separated at all here. According to Metspalu:

There are three second degree relatives groups in our sample: ..snip.. [Kanjar evo_37 and Dharkar HA023]. Again the last pair needs further explanation. The Dharkar and Kanjar practice a nomadic lifestyle and were living side by side at the time of sampling. As the ethnic border between the two is permeable we cannot rule out neither our error during sample collection and/or subsequent labelling nor shifted self-identity.

The inter-cluster heat map:

And you can see the chunkcounts donated from each cluster to recipient individuals in a spreadsheet.

fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.

Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.

Here's the cluster-level coancestry heat map.

And the pairwise coincidence:

And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.

UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.