Category Archives: Statistics

Post navigation

Well they’ve been running around on the flat expanses of the early Holocene lake bed with impressively large machines, whacking down and gathering the soybeans and corn. This puts dirt clods on the roads that cause one on a road bike at dusk to weave and swear, but I digress. The Farmer’s Almanac indicates says that it must therefore be about World Series time, which in turn is just about approximately guaranteed to initiate various comments regarding the role of luck, good or bad, in deciding important baseball game outcomes.

There are several important things to be blurted out on this important topic and with the Series at it’s climax and the leaves a fallin’ now’s the time, the time is now.

It was Bill James, the baseball “sabermetric” grandpa and chief guru, who came up with the basic idea some time ago, though not with the questionable terminology applied to it I think, which I believe came later from certain disciples who knelt at his feet.

The basic idea starts off well enough but from there goes into a kind of low-key downhill slide, not unlike the truck that you didn’t bother setting the park brake for because you thought the street grade was flat but found out otherwise a few feet down the sidewalk. At which point you also discover that the bumper height of said truck does not necessarily match that of a Mercedes.

The concept applies not just to baseball but anything involving integer scores. Basic idea is as follows (see here). Your team plays 162 baseball games, 25 soccer matches or whatever, and of course you keep score of each. You then compute the fraction S^x/(S^x + A^x), where using the baseball case, S = runs scored, A = runs allowed and x = an exponent that varies depending on the data used (i.e. the teams and years used). You do this for each team in the league and also compute each team’s winning percentage (WP = W/G, where W = number of wins and G = games played in the season(s)). A nonlinear regression/optimization returns the optimal value of x, given the data. The resulting fraction is known as the “pythagorean expectation” of winning percentage, claiming to inform us of how many games a given team “should” have won and lost over that time, given their total runs scored and allowed.

Note first that the value of x depends on the data used: the relationship is entirely empirically derived, and exponents ranging from (at least) 1.8 to 2.0 have resulted. There is no statistical theory here whatsoever, and in no description of “the pythag” have I ever seen any mention of such. This is a shame because (1) there can and should be, and (2) it seems likely that most “sabermatricians” don’t have any idea as to how or why. Maybe not all, but I haven’t seen any discuss the matter. Specifically, this is a classic case for application of Poisson-derived expectations.

However the lack of theory is one, but not really the main, point here. More at issue are the highly questionable interpretations of the causes of observed deviations from pythag expectations, where the rolling truck smashes out the grill and lights of the Mercedes.

You should base an analysis like this on the Poisson distribution for at least two very strong reasons. First, interpretations of the pythag always involve random chance. That is, the underlying view is that departures of a given team’s won-loss record from pythag expectation is always attributed to the action of randommness–random chance. Great, if you want to go down that road, that’s exactly what the Poisson distribution is designed to address. Secondly, it will give you additional information regarding the role of chance that you cannot get from “the pythag”.

Indeed, the Poisson gives the expected distribution of integer-valued data around a known mean, under the assumption that random deviations from that mean are solely the result of sampling error, which in turn results from the combination of Complete Spatial Randomness (CSR) complete randomness of the objects, relative to the mean value and the size of the sampling frame. In our context, the sampling frame is a single game and the objects of analysis are the runs scored, and allowed, in each game. The point is that the Poisson is inherently designed to test just exactly what the SABR-toothers are wanting to test. But they don’t use it–they instead opt for the fully ad-hoc pythag estimator (or slight variations thereof). Always.

So, you’ve got a team’s total runs scored and allowed over its season. You divide that by the number of games played to give you the mean of each. That’s all you need–the Poisson is a single parameter distribution, the variance being a function of the mean. Now you use that computer in front of you for what it’s really ideal at–doing a whole bunch of calculations really fast–to simply draw from the runs scored, and runs allowed, distributions, randomly, say 100,000 times or whatever, to estimate your team’s real expected won-loss record under a fully random score distribution process. But you can also do more–you can test whether either the runs scored or allowed distribution fits the Poisson very well, using a chi-square goodness-of-fit test. And that’s important because it tells you basically, whether or not they are homogeneous random processes–processes in which the data generating process is unchanging through the season. In sports terms: it tells you the degree to which the team’s performance over the year, offensive and defensive, came from the same basic conditions (i.e. unchanging team performance quality/ability).

The biggest issue remains however–interpretation. I don’t how it all got started, but somewhere, somebody decided that a positive departure from “the pythag” (more wins than expected) equated to “good luck” and negative departures to “bad luck”. Luck being the operative word here. Actually I do know the origin–it’s a straight forward conclusion from attributing all deviations from expectation to “chance”. The problem is that many of these deviations are not in fact due to chance, and if you analyze the data using the Poisson as described above, you will have evidence of when it is, and is not, the case.

For example, a team that wins more close games than it “should”, games won by say just one or two runs, while getting badly smoked in a small subset of other games, will appear to benefit from “good luck”, according to the pythag approach. But using the Poisson approach, you can identify whether or not a team’s basic quality likely changed at various times during the season. Furthermore, you can also examine whether the joint distribution of events (runs scored, runs allowed), follows random expectation, given their individual distributions. If they do not, then you know that some non-random process is going on. For example, that team that wins (or loses) more than it’s expected share of close games most likely has some ability to win (or lose) close games–something about the way the team plays explains it, not random chance. There are many particular explanations, in terms of team skill and strategy, that can explain such results, and more specific data on a team’s players’ performance can lend evidence to the various possibilities.

So, the whole “luck” explanation that certain elements of the sabermetric crowd are quite fond of and have accepted as the Gospel of James, may be quite suspect at best, or outright wrong. I should add however that if the Indians win the series, it’s skill all the way while if the Cubs win it’ll most likely be due to luck.

I’ve been thinking some more about this issue–the idea that selection should tend to favor those genotypes with the smallest temporal variations in fitness, for a given mean fitness value (above 1.00). It’s taken some time to work through this and get a grip on what’s going on and some additional points have emerged.

The first point is that although I surely don’t know the entire history, the idea appears to be strictly mathematically derived, from modeling: theoretical. At least, that’s how it appears from the several descriptions that I’ve read, including Orr’s, and this one. These all discuss mathematics–geometric and arithmetic means, absolute and relative fitness, etc., making no mention of any empirical origins.

The reason should be evident from Orr’s experimental description, in which he sets up ultra-simplified conditions in which the several other important factors that can alter genotype frequencies over generations, are made unvarying. The point is that in a real world experimental test you would also have to control for these things, either experimentally or statistically, and that would not be easy. It’s hard to see why anybody would go to such trouble if the theory weren’t there to suggest the possibility in the first place. There is much more to say on the issue of empirical evidence. Given that it’s an accepted idea, and that testing it as the generalization it claims to be is difficult, then the theoretical foundation had better be very solid. Well, I can readily conceive of two strictly theoretically-based reasons of why the idea might well be suspect. For time’s sake, I’ll focus on just one of those here.

The underlying basis of the argument is that, if a growth rate (interest rate, absolute fitness, whatever) is perfectly constant over time, the product of the series gives the total change at the final time point, but if it is made non-constant, by varying it around that rate, then the final value–and thus the geometric mean–will decline. The larger the variance around the point, the greater the decline. For example, suppose a 2% increase of quantity A(0) per unit time interval (g), that is, F = 1.020. Measuring time in generations here, after g = 35 generations, A(35) = F^g = 1.020^35 = 2.0; A is doubled in 35 generations. The geometric (and arithmetic) mean over the 35 years is 1.020, because all the yearly rates are identical. Now cause F to instead vary around 1.02 by setting it as the mean of a normal distribution with some arbitrarily chosen standard deviation, say 0.2. The geometric mean of the series will then drop (on average, asymptotically) to just below 1.0 (~ 0.9993). Since the geometric mean is what matters, genotype A will then not increase at all–it will instead stay about the same.

This is a very informative result. Using and extending it, now imagine an idealized population with two genotypes, A and B, in a temporally unvarying selection environment, with equal starting frequencies, A = B = 0.50. Since the environment doesn’t vary, there is no selection on either, that is F.A = F.B = 1.0 and they will thus maintain equal relative frequencies over time. Now impose a varying selection environment where sometimes conditions favor survival of A, other times B. We would then repeat the above exercise, except that now the mean of the distribution we construct is 1.000, not 1.020. The resulting geometric mean fitness of each genotype is now 0.9788 (just replace 1.02 with 1.00 in the above code).

So what’s going to happen? Extinction, that’s what. After 35 generations, each will be down to 0.9788^35 = 0.473 of it’s starting value, on average, and on the way to zero. The generalization is that any population having genotypes of ~ equal arithmetic mean (absolute) fitness and normally distributed values around that mean, will have all genotypes driven to extinction, and at a rate proportional to the magnitude of the variance. If instead, one genotype has an arithmetic mean fitness above 1.00 a threshold value determined by it’s mean and variance, while all others are below it, then the former will be driven to fixation and the latter to extinction. These results are not tenable–this is decidedly not what we see in nature. We instead see lots of genetic variation, including vast amounts maintained over vast expanses of time. I grant that this is a fairly rough and crude test of the idea, but not an unreasonable one. Note that this also points up the potentially serious problem caused by using relative, instead of absolute, fitness, but I won’t get into that now.

Extinction of course happens in nature all the time, but what we observe in nature is the result of successful selection–populations and species that survived. We know, without question, that environments vary–wildly, any and all aspects thereof, at all scales, often. And we also know without question that selection certainly can and does filter out the most fit genotypes in those environments. Those processes are all operating but we don’t observe a world in which alleles are either eliminated or fixed. The above examples cannot be accurate mathematical descriptions of a surviving species’ variation in fitness over time–something’s wrong.

The “something wrong” is the designation of normally distributed variation, or more exactly, symmetrically distributed variation. To keep a geometric mean from departing from it’s no-variance value, one must skew the distribution around the mean value, such that values above it (x) are inverses (1/x) (mean/x) of those below it–that is the only way to create a stable geometric mean while varying the individual values. [EDIT: more accurately, the mean must equal the product of the values below the mean, multiplied by the mean divided by the product of the values above the mean, but the values will be skewed in any case.] Mathematically, the way to do so is to work with the logarithms of the original values–the log of the geometric mean is designated as the mean of normally distributed logarithms of the individual values, of whatever size variance one wants. Exponentiation of the sum of the logarithms will equal the product of the fitness series.

Hopefully, what I’m driving at is emerging. If the variance structure must obey this mathematical necessity to preserve a genotype’s mean fitness at 1.00, while still allowing the individual series values to vary…then why should we not expect the same to hold true when the mean geometric fitness is not equal to 1.00? I would argue that that’s exactly what we should expect, and that Gillespie’s original arguments–and Orr’s, and others’ summaries thereof–are not particularly defensible theoretical expectations of what is likely to be happening in nature. Specifically, the idea that the variance in fitness around an arithmetic mean should necessarily arise from symmetrically (normally) distributed values, is questionable.

As alluded to above, there is (at least) a second theoretical argument as well, but I don’t have time to get into it now (nor for this one for that matter). Suffice it to say that it involves simultaneous temporal changes in total population size and selective environments. All this without even broaching the entire hornet’s nest of empirically testing the idea, a topic reviewed five years ago by Simons. For starters, it’s not clear to me just how conservative “bet hedging” could ever be distinguished from the effects of phenotypic plasticity.

Last week at the blog Dynamic Ecologyit was argued that natural selection behaves like a “risk-averse” money investor. That is, assuming that fitness varies over time (due to e.g. changing environmental variables or other selective factors), natural selection favors situations in which the mean fitness is maximized while the variance is minimized. The idea is explained in this short paper by Orr (2007), whose goal was to explain previous findings (Gillespie, 1973) intuitively. This presumes that knowledge of investor behavior is commonplace, but for my money, an examination of the math details and assumptions is what’s really needed.

I often get obsessed with certain topics, especially statistical and mathematical ones. Lately I’ve been thinking about the Poisson distribution a lot, as it figures heavily in analyses of random populations, and thus also in assessing deviations therefrom, a topic that I’ve been heavily involved for a while, relative to historic and current tree populations. I also find that often a specific question will arise, related to whatever it is I’m doing, that is best (and often quickest) answered in a way that gives me the most confidence, by direct simulation, for which I use R.

The topic of varying scales of variation, and the information that can be obtained by analysis thereof, is a pretty interesting one IMO. When the generating processes for a phenomenon of interest are complex, or poorly understood for whatever reason, one can (and should!) obtain valuable information regarding likelihoods of various hypothetical process drivers, by multi-scale analysis–essentially, obtaining evidence for the scale(s) at which departures from randomness are greatest and using that information to suggest, or constrain, possible explanations. There are a number of ways to go about doing so.

The Poisson distribution is the appropriate descriptor of a homogeneous (constant) rate process whose individual event outcomes are random. An “under-dispersed” population at a particular scale of analysis will be more “regular” in its arrangement than expected by a random process, and it is clear that in such a situation there must necessarily also be under-dispersion at at least some other scales as well, both smaller and larger. To illustrate via an extreme example, suppose some location that gets 36 inches of precipitation (P) per year on average, distributed as exactly three inches per month, every month. The probability of such a result arising, when P varies randomly (Poisson) at any sub-monthly scale, is extremely low. It won’t occur over any extended period of time. The same principle holds, though muted, if there is some monthly variance around 3.0.

In an over-dispersed (“clustered”) population, again at some defined scale, the situation is different. Such a population will also be over-dispersed at smaller scales, but not necessarily at larger ones, at least not at the same intensity, and there should be some unknown scale at which the variation reduces to the Poisson. This means that a Poisson population is not necessarily Poisson at smaller scales, but it should be so at larger scales. That is, it should “scale up” according to Poisson expectation, i.e. with the same rate but a greater absolute number, and variance therein, per sample unit.

But does it? Or rather, what does R think about the matter?

Well, here’s what I get, using the example case of a mean annual P/yr of 36″ and 100,000 simulated monthly, or weekly, sums from randomly sampling the Poisson expectation at sub-interval (weekly or daily) time scales.

*However, I also evaluated gamma expectations, mainly as a check and/or curiosity (the gamma interpolates between the Poisson’s integer values). I didn’t always get the exact correspondence as expected, and I’m not really sure why. Close, but not close enough to be due to rounding errors, so that’s kind of interesting, but not enough to pursue further.

Funding for this post was provided in equal part by the French Fish Ectoderm and Statistics Foundation and the American Association for Advancement of Amalgamated and Aggregated Associations. These organizations are solely responsible for any errors herein, and associated management related consequences.

So, without getting into the reasons, I’m reading through the entry in the International Encyclopedia of Statistical Science on “Statistical Fallacies: Misconceptions and Myths”, written by one “Shlomo Sawilowsky, Professor, Wayne State University, Detroit MI, USA”. Within the entry, 20 such fallacies are each briefly described.

Sawilowsky introduces the topic by stating:

Compilations and illustrations of statistical fallacies, misconceptions, and myths abound…The statistical faux pas is appealing, intuitive, logical, and persuasive, but demonstrably false. They are uniformly presented based on authority and supported based on assertion…these errors spontaneously regenerate every few years, propagating in peer reviewed journal articles…and dissident literature. Some of the most egregious and grievous are noted below.

Great, let’s get after it then.

He then gets into his list, which proceeds through a set of +/- standard types of issues, including misunderstanding of the Central Limit Theorem, Type I errors, p values, effect sizes and etc. Up comes item 14:

14. Chi-square
(a) We live in a Chi-square society due to political correctness that dictates equality of outcome instead of equality of opportunity. The test of independence version of this statistic is accepted sans voire dire by many legal systems as the single most important arbiter of truth, justice, and salvation. It has been asserted that any statistical difference between (often even nonrandomly selected) samples of ethnicity, gender, or other demographic as compared with (often even inaccurate, incomplete, and outdated) census data is primae faciea evidence of institutional racism, sexism, or other ism. A plaintiff allegation that is supportable by a significant Chi-square is often accepted by the court (judges and juries) praesumptio iuris et de iure. Similarly, the goodness of fit version of this statistic is also placed on an unwarranted pedestal.

Bingo Shlomo!!

Now this is exactly what I want from my encyclopedia entries: a strictly apolitical, logical description of the issue at hand. In fact, I hope to delve deep into other statistical writings of Dr. Sawilowsky to gain, hopefully, even better insights than this one.

Postscript: I’m not really bent out of shape on this, and would indeed read his works (especially this one: Sawilowsky, S. (2003) Deconstructing arguments from the case against hypothesis testing. J. Mod. Appl. Stat. Meth. 2(2):467-474). I can readily overlook ideologically driven examples like this to get at the substance I’m after, but I do wonder how a professional statistician worked that into an encyclopedia entry.

I note also that the supposed “screening fallacy” popular on certain blogs is not included in the list…and I’m not the least bit surprised.

This post constitutes a wrap-up and summary of this series of articles on clustering data.

The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.

But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):

…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.

This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…

Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.

The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.

Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…

In part one of what has quickly developed into an enthralling series, I made the points that (1) at least some important software doesn’t provide a probability value for cluster outputs, and (2) that while it’s possible to cluster any data set, univariate or multivariate, into clearly distinct groups, so doing doesn’t necessarily mean anything important. Such outputs only tell us something useful if there is some actual structure in the data, and the clustering algorithm can detect it.

But just what is “structure” in the data? The univariate case is simplest because with multivariate data, structure can have two different aspects. But in either situation we can take the standard statistical stance that structure is the detectable departure from random expectation, at some defined probability criterion (p value). The Poisson and gamma distributions define this expectation, the former for count (integer valued) data, and the latter for continuous data. By “expectation” I mean the expected distribution of values across the full data. If we have a calculated overall mean value, i.e. an overall rate, the Poisson and gamma then define this distribution, assuming each value is measured over an identical sampling interval. With the Poisson, the latter takes the form of a fixed denominator, whereas with the gamma it takes the form of a fixed numerator.

Using the familiar example of density (number of objects per unit space or time), the Poisson fixes the unit space while the integer number of objects in each unit varies, whereas the gamma fixes the integer rank of the objects that will be measured to from random starting points, with the distance to each such object (and corresponding area thereof) varying. The two approaches are just flip sides of the same coin really, but with very important practical considerations related to both data collection and mathematical bias. Without getting heavily into the data collection issue here, the Poisson approach–counting the number of objects in areas of pre-defined size–can get you into real trouble in terms of time efficiency (regardless of whether density tends to be low or high). This consideration is very important in opting for distance-based sampling and the use of the gamma distribution over area-based sampling and use of the Poisson.

But returning to the original problem as discussed in part one, the two standard clustering approaches–k-means and hierarchical–are always going to return groupings that are of low probability of random occurrence, no matter what “natural structure” in the data there may actually be. The solution, it seems to me, is to instead evaluate relative probabilities: the probability of the values within each group being Poisson or gamma distributed, relative to the probability of the overall distribution of values. In each case these probabilities are determined by a goodness-of-fit test, namely a Chi-square test for count (integer) data and a Kolmogorov-Smirnov test for continuous data. If there is in fact some natural structure in the data–that is, groups of values that are overly similar (or dissimilar) to each other than that defined by the Poisson or gamma–then this relative probability (or odds ratio if you like), will be maximized at the clustering solution that most closely reflects the actual structure in the data, this solution being defined by (1) the number of groups, and (2) the membership of each. It is a maximum likelihood approach to the problem.

If there is little or no actual structure in the data, then these odds ratios computed across different numbers of final groups will show no clearly defensible maximal value, but rather a broad, flat plateau in which all the ratios are similar, varying from each other only randomly. But when there is real structure therein, there will be a ratio that is quantifiably higher than all others, a unimodal response with a peak value. The statistical significance of this maximum can be evaluated with the Likelihood Ratio test or something similar, though I haven’t thought very hard about that issue yet.

Moving from the univariate case, to the multivariate, ain’t not no big deal really, in terms of the above discussion–it just requires averaging those odds ratios over all variables. But multivariate data does introduce a second, subtle aspect into what we mean by the term “data structure”, in the following respect. It is a possible situation wherein no variable in the data shows clear evidence of structure, per the above approach, when in fact there very much is such, but of a different kind. That outcome would occur whenever particular pairs (or larger groups) of variables are correlated with each other (above random expectation), even though the values for each such variable are in fact Poisson/gamma distributed overall. That is, there is a statistically defensible relationship between variables across sample units, but no detectable variation in values within each variable, across those sample units.

Such an outcome would provide definite evidence of behavioral similarity among variables even in the absence of a structuring of those variables by some latent (unmeasured) variable. I think it would be interesting to know how often such a situation occurs in different types of ecological and other systems, and I’m pretty sure nobody’s done any such analysis. Bear in mind however that I also once thought, at about 4:30 AM on a sleep deprived week if I remember right, that it would be interesting to see if I could beat the Tahoe casinos at blackjack based on quick probability considerations.

I hope the above has made at least some sense and you have not damaged your computer by say, throwing your coffee mug through the screen, or yelled something untoward, at volume, within earshot of those who might take offense. The Institute hereby disavows any responsibility, liability or other legal or financial connection to such events, past or future.

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

Well we’re long overdue for another installment of “Ask the Self-Appointed Experts“, or at least for the question part. In today’s edition a follower from Two Forks Montana wrestles with the following conundrum, inviting others to help, or at least reassure him that he is not confused alone. He writes:

I know this issue is all over AM talk radio but the inmates pretty clearly run the asylum there and I’m more confused on the following issue than ever.

It is well known that, given a known rate process, the gamma distribution defines the expected values from random starting points to the nth “nearest” object, and so inversely, we can estimate unknown rates by such values. For example, in a forest of randomly distributed trees, the circular area, a, defined by the distance to the nth closest tree, will estimate tree density. But as Skellam (1952), Moore (1954) and Pollard (1971) showed analytically, these estimates are biased, in inverse magnitude to the value of n, specifically, as n/(n-1) for n > 1. Thus, the distance to, say, the 2nd closest tree will correspond to the area represented by one tree, not two. All very well and good.

Now, the mean of the integration of the gamma distribution from 0 to 1, for a known rate, should return the mean area a, but when I closely approximate the integral (in R, which can’t integrate), I seem to get bias-corrected values reflecting the rates, rather than the biased values reflecting the areas (a) to the nth object. I’m flummoxed and not a little aggravated. Do they know what they’re doing there at R headquarters, or is it me that’s got all turned about the wrong way? If I can’t even trust the values from the statistical distributions in R, then just what can I trust there? I tried taking my mind off the matter by following the Winnipeg Jets (WJPCA, 2015), but man, one can just take only so much of that and I sure as hell ain’t going to follow Edmonton. The ice fishing seems to help, at least until the alcohol wears off, but really there should be an 800 number I think. If you can put me onto some clues I would be most grateful.

This post is about estimating a rate process, particularly when certain required input data are missing or critical assumptions are unmet. Although it’s quite unrelated to recent posts (here and here) I’m actually headed somewhere definite with the collection of them, if you can bear with me. If you can’t bear with me, I can bear that.

As we know, estimating rate processes is a very common task in science, where “rate” is defined generally as a measured change in one variable per unit measure of another, usually expressed as a ratio or fraction. In the discussion that follows I’ll use object density–the number of objects of interest per unit area–to illustrate the concepts, but they are applicable to rates of any type.

Obtaining rate estimates, with rare exceptions, requires empirical sampling over the domain of interest, followed by some mathematical operation on the samples, often an averaging. Sampling, however, can take two opposing approaches, as determined by the measurement units of the denominator and the numerator. Using our density example, in the first approach we tally the objects occurring within samples of some apriori defined area (the denominator thus fixed), and then simply average the values over all samples. In the second distance sampling (DS) approach, we instead measure to the nth closest objects from random starting points, with the value of n chosen apriori, and thus with the numerator fixed and the denominator varying. Low values of n are typically chosen for convenience, and distances are converted to density using a suitable density estimator.

This DS approach has definite advantages. For one, it is time efficient, as no plot delineations are required, and decisions on which objects to tally or which have already been so, are typically easier. A second advantage is that the total sample size is independent of object density, and a third is an often superior scatter of samples through the study area, leading to a better characterization of spatial pattern.

The data from the two approaches are modeled by two well-established statistical models, the Poisson and the gamma, respectively. With the Poisson, the number of objects falling in each of a collection of plots will follow a Poisson distribution, the exact values determined by the overall density. With the second approach, the distances to the nth closest objects follow a gamma distribution.

Either approach requires that objects be randomly located throughout the study area for unbiased estimates, but there is a second potential bias source with distance sampling (DS), and hence a down-side. This bias has magnitude of (n+1)/n, and so e.g., measuring to the n = 1st closest object will bias the density estimate by a factor of 2x, to the n = 2nd closest by 3/2, etc. This bias is easily removed simply by multiplying by the inverse, n/(n+1), or equivalently, measuring to the 2nd closest objects will give the area corresponding to unit density, that is, the area at which object density equals one. Which is kind of interesting, assuming you’ve made it this far and nothing more enthralling occupies your mind at the moment.

The shape of the gamma distribution varies with n. When n = 1, the gamma assumes a negative exponential shape, and when n > 1 it is unimodal, strongly skewed at low values of n, but of decreasing skew (more normal) at higher n. The upshot is that one can diagnose n if it is unknown, because these distributions will all differ, at least somewhat, especially in their higher moments, such as the variance and skew. However, discriminatory power decreases with increasing n, and the approach also assumes the objects occur spatially randomly, which they of course may in fact not.

If our rate denominator measurement space is defined on more than one dimension–as it is in our example (area, two dimensions)–we can subdivide it into equally sized sectors and measure distances to the nth closest object within each sector. Sectors here would equate to angles emanating from the sample point, and this “angle-order” method increases the sample size at each sample point. Supposing four sectors, for a given value of n, the four measurements at a point give (four choose two) = six possible ratios between the objects, these ratios being defined as the further distance over the closer. The distributions of these six values, over a collection of sample points, are then collectively discriminatory for n.

Usually both the rate and its homogeneity are unknown, and we need the latter to get the former. If a non-random pattern exists, the non-randomness can be quantified (in several ways) if we know n, and density estimates made using suitable corrections. If we don’t know n but do know that the objects are in fact randomly arranged, we can still infer density, although not with nearly as much precision as when we do know n. Interestingly, the bias arising from a clustered non-random pattern is of the same direction as that from an under-estimate of n, both leading to density under-estimates. Similarly, a tendency toward spatial pattern regularity gives a bias in the same direction as an overestimate of n.

These last facts can be useful, because our main interest is not n or the spatial pattern, but rather the rate (density), and the effects of uncertainties in them are not additive, and can even be compensating, depending on the situation. This serves to greatly constrain the total uncertainty in the rate estimate, even when these critical pieces of input information are lacking. I’m going to try to expound on that in the next post in this utterly enthralling series.

Support for this post has been provided by the Society for Public Education of Not Very Well Recognized or Cared About Issues. Nothing in the post should be construed, or mis-construed, as in any way necessarily reflecting the views, opinions, sentiments, thoughts, conceptual leanings, quasi-conscious daydreaming or water cooler banter of said Society, or really of anyone in particular other than me, and even that is open to debate. You may now return to your regularly scheduled life.

Well not really–I discussed instead the very closely related topic of enumerating the outcomes of a given probabilistic process. And actually not so much a discussion as a monologue. But couldn’t somebody please address that other issue, it’s just been badly neglected… 🙂

Anyway, enumerating the possible allocations of n objects into q groups is rarely an end in itself; the probability, p of each is usually what we want. This is a multinomial probability (MP) problem when q > 2, and binomial (BP) when q = 2, in which we know apriori the per-trial p values and want to determine probabilities of the various possible outcomes over some number, n, of such trials. In the given example, the per-trial probabilities of group membership are all equal (1/6) and we want to know the probability of each possible result from n = 15 trials.

One has to be careful in defining exactly what “trials” and “sample sizes” constitute in these models though, because the number of trials can be nested. We could for example, conduct n2 = 100 higher level trials, in each of which the results from n1 = 2 lower level trials are combined. This is best exemplified by Hardy-Weinberg analysis in population genetics; a lower level trial consists of randomly choosing n1 = 2 alleles from the population and combining them into a genotype. This is repeated n2 times and the expected genotype frequencies, under certain assumption of population behavior, are then compared to the observed, to test whether the assumptions are likely met or not. If only two possible outcomes of a single trial (alleles in this case) exist in the population, the model is binomial, and if more than two, multinomial.

There are two types of MP/BP models, corresponding to whether or not group identity matters. When it does, the BP/MP coefficients determine the expected probabilities of each specific outcome. For n objects, q groups and group sizes a through f, these equal the number of permutations, as given by n! / (a!b!c!d!e!f!), where “!” is the factorial operator and 0! = 1 by definition. This formula is extremely useful; without it we’d have to enumerate all permutations of a given BP/MP process. And this would choke us quickly, as such values become astronomical in a hurry: with n = 15 and q = 6, we already have 6^15 = 470 billion possible permutations.

When group identity doesn’t matter, only the numerical distribution matters, and this will decrease the total number of coefficients but increase the value of each of them. For example, in locating the n = 50 closest trees to a random sampling point, one may want to know only the expected numerical distribution across the q angle sectors around the point. In that case, the allocation [2,1,1,0] into groups [a,b,c,d] would be identical to [0,1,1,2] and to 10 others, and these thus have to be aggregated. The number of aggregations is given by the number of permutations of the observed group sizes, which in turn depends on their variation. When all differ, e.g. [0,1,2,3,4,5] for n = 15, the number of equivalent outcomes is maximized, equaling q + (q-1) + (q-2)…+ 1 (in this case, 21) [Edit: oops, confused that with what follows; the number of permutations there is given by q!]. When some but not all of the group sizes are identical it’s more complex, as determined by products of factorials and permutations. When the number of identical group sizes is maximized, the equivalent outcomes are minimized, always at either q-1 or 1. In this case there are 6 variations of [n,0,0,0,0,0].

To get the desired probability density function, the raw MP/BP coefficients, obtained by either method, are standardized to sum to 1.0.

Next time I’m going to discuss a general solution to the problem of estimating the otherwise unknown relationships between pairs of objects involved in a rate process, such as density, the number of objects per unit area or time. These can be deduced analytically using multinomial and gamma probability models in conjunction.

It will make you call your neighbors over for a party and group discussion. If it doesn’t you get a full refund.

Suppose you have n objects and want to enumerate all possible unique allocations thereof among q groups, where “unique” is defined by numerical allocation only, independent of actual group membership for each object. Each group can take any size from 0 to n, contingent on the total in the other q-1 groups. I’m working on a problem where I have to do this; there might well be an available method for doing this, but one can waste a lot of time looking for such things in my experience, you typically learn a lot more when you have to solve a problem yourself. Anyway I couldn’t find one readily so I wrote this R function as a script. Feel free to use if it’s, well, useful.

There are many things I don’t understand regarding how various folks do things, and here’s a good example. It falls, to my mind, within what Brian McGill at the Dynamic Ecology blog last year called “statistical machismo”: the use of statistical methods that are more complicated than necessary in order to appear advanced or sophisticated, when these are either unnecessary (but trendy), or worse, clearly not the best choice for the problem at hand. “Cutting edge”, their practitioners like to think of them, but I’ve got no problem in calling them sophistry that stems from a lack of real statistical understanding, combined with a willingness to do whatever will maximize the chances of getting published, of which there rarely seems to be a shortage.

I’ve had, for some time, a growing suspicion that much of the use of Bayesian statistical methods in science falls pretty squarely in this category. That of course doesn’t mean I’m right, especially since I do not fully understand everything about modern Bayesian methods, but I get the basic ideas, and the following example is a good illustration of why I think that way. It relates to my recent cogitations on the design of a general, model-based partitioning (clustering) algorithm for a common type of categorical data: data in which each sample is represented by only a small fraction of the total number of categories. In such cases, clear associations between the various categories is far from obvious.

I started thinking about the issue in relation to the estimation of forest tree community types in some widespread and very important historical tree data sets, where each sample contains individuals from, at most, only two to four tree taxa (usually, species), when there may be upwards of 10 to 30 such taxa in the population over some large landscape area (earlier post on this topic here) However, the issue has by far its greatest application in the field of population genetics, specifically w.r.t. the goal of identifying cryptic population structure–that is identifiable groups of individuals who are breeding primarily or entirely among themselves (“demes”), leading to allele and genotype frequencies that vary characteristically from deme to deme, demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase. The similarity with the tree data is that at each gene locus for any given diploid individual–which represents our sample–you have only two alleles, even though many such may occur in some larger, defined population.

In 2000, Pritchard et al. published what would have to be considered a landmark study, given that it’s been cited nearly 14,000 times since. This comes out to about 2.5 citations per day; I wouldn’t have guessed that so many popgen papers were even published at that kind of rate. The paper introduces a method and program (“STRUCTURE”) for the above-stated task, one based on Bayesian techniques, using Markov Chain Monte Carlo (MCMC), which is an iterative method for estimating parameters of the posterior distribution when no analytical techniques, or approximations thereof, are available. Furthermore, the paper has spawned several spin-off papers introducing various modifications, but all based on the same basic Bayesian/MCMC approach. And each of those has gotten hundreds to thousands of citations as well.

I freely admit that I am an unabashed maximum likelihood (ML) type of thinker when it comes to statistical inference and model selection. I’m far from convinced that Bayesianism offers any clear, definitive advantage over ML methods, while appearing to be saddled with some complex, time-consuming and uncertain estimation techniques (like MCMC), which is most definitely a disadvantage. To my view, Bayesianism as a whole might well fall within Brian’s machismo category, at least as employed in current practice, if not in its fundamental tenets. I very much doubt that many people who use it do so for any reason other than that a lot of others are using it, and so they just go with the flow, thinking all is kosher. Scientists do that a lot, at least until they develop their own understanding of the issues.

As I was thinking through the problem, it seemed to me pretty clear that, although a strict analytical solution was indeed not possible, one based on a ML approach, as heavily guided by expectations from binomial/multinomial probability and divisive clustering, was the way to go. Indeed, I can’t see any other logical and algorithmically efficient way to go about solving this type of problem. The underlying goal and assumptions remain the same as Pritchard et al’s, namely to find groups that approximate Hardy-Weinberg equilibrium, and which therefore represent approximately randomly mating groups. And there is also still a “Monte Carlo” procedure involved, but it’s quite different: always guided by a definite strategy, and much less intense and random than in the Bayesian/MCMC approach. As far as I can tell, nobody’s taken this approach (although I just found an Iowa State student’s dissertation from last year that might), and I don’t know why. I thought it was recognized that defaulting to a uniform (i.e. uninformative) prior probability distribution–because you really have no idea otherwise, or worse, when the idea of some “prior distribution” doesn’t even make sense to begin with–and you have quite a few parameters to estimate, that MCMC algorithms can be very slow to converge (if at all), and to do so to potentially unstable estimates at that. But that’s exactly what the authors did, and there are other limitations of the approach also, such as having to constrain the total number of possible demes to begin with–presumably because the algorithm would choke on the number of possible solutions otherwise.

These are the kinds of things I run into far more often than is desirable, and which generate a certain mix of confusion, frustration and depression. If I keep working on this topic–which I find fascinating and which, being statistical, generalizes to different fields, but which I really don’t have time for–I’ll post more details about the two approaches. The fan mail has been clamoring for posts on this topic. Or ask questions if you’re entirely desperate for something to do while waiting for the Final Four to go at it.