Clustering plots

BJ Knaus, and NJ Grünwald

Introduction

A frequent research question is “how many groups are in my data” and “which individuals belong to which groups?” Clustering algorithms attempt to address this. These algorithms include software outside ot the R environment such as Struccture (but see strataG), fastStructure, and admixture. Within the R environment, we’ve frequently used discriminant analysis of principle components (DAPC). We’ve discussed how to implement this analysis here. There are results from several steps during this analysis. Here we a present a way to present this information in a single graphic.

K-means clustering

We’ll start by performing K-means clustering over a number of values of K and repeat 10 times fo each value so we can explore variability for these values.

DAPC

In general, it is recommended to explore several values of K. The find.clusters() function includes some stochasticity. When we’re at the figure creation step we’ll need consistency, so we’ll set a seed. If you’re at an earlier stage in your analysis you should comment the set.seed() call out to explore how sensitive your results are to the seed.

Another popular perspective is to look at barplots of the posterior probabilities of group assignment for each sample. Here we’ll use “facets” to separate the different values of K. First we’ll need to combine our data into a single long form data.frame. We’ll also add our geographic information for additional faceting.

The group names in the assignment test are arbitrary. This means that what is group 1 in the K = 2 facet may be very similar to group 4 in the K = 3 facet. For example, group 2 at K = 4 appears to be group 5 at K = 5. At the present I handle this with a manual step, which I’ve omitted here. If your goal is a “publication quality” graphic, you should invest some time in standardizing the group names. This is also why I added the seed above.

Multi-panel plot

We can now put all of this together into one plot. We’ll use the package ggpubr which I learned to use at this STHDA blog.

Note that I’ve commented out two lines. These can be uncommented to create a publication quality graphic. But if I did that while building the website my graphic would be sent to the file instead of appearing in this website. So I’ve commented it out so its there when I need it.

We now have our justification for choosing any K value (A. K-means clustering), the relashionships among the groups at the highest K value we’ve explored (B. scatter plot of discriminant functions) and the probability of each individual being assigned to each group. Note that the data comes from clonal plant pathogens. Your data may look very different.