[Q] Universals, Statistics

Eddy Ruys <eddy.ruys at let.uu.nl> writes:
> Dear all,
>> Bluntly: I would guess that all typological generalizations
> obtained by extracting (statistical) patterns, implicational
> or otherwise, from a given database of languages are meaningless.
>> The reason I seem to feel this way is that, if those patterns
> are obtained in this manner, and then stated as generalizations over
> language types, one proceeds in a post-hoc manner. The correct
> procedure, I would guess, would be to first hypothesize that
> a certain pattern must exist, and then attempt to disconfirm this
> hypothesis on the basis of a data set. After reading those articles
> (but perhaps this is where my mistake lies) I was left with the
> feeling that this is not the way people proceed.
>> It is as though (borrowing an example from Richard Feynman)
> I were to observe the license-plate number ANZ 192 on my way
> to work, then calculate the unlikelyhood of observing
> exactly this plate, and conclude there is some significance
> to the observation, requiring an explanation.
> Even if this event itself were relatively unlikely (say, it's rare
> for a plate to start with three alphabetical characters), given the
> number of events I observe every day, some unlikely ones have a good
> chance of occurring. If I didn't decide beforehand that I was
> looking for rare license-plates (not rare hairdos), the observation
> is not interesting.
> Likewise, given the number of possible patterns in a data set, some
> statistically unlikely patterns will occur, even if the data set
> were completely random and there existed no underlying laws or
> tendencies governing human language variation.
Yes, this is an issue that comes up in any study involving the
post-hoc generation of hypotheses from sets of empirical observations.
However, this doesn't mean that it is impossible to calculate a degree
of confidence in a hypothesis generated this way. Statisticians are
well aware of the problem of "hallucinating" patterns in data and in
many cases have designed principled correcting factors. [1]
In the particular case you mention, where you scour a dataset looking
for SOME statistically significant pattern, there is a principled
correction to be made (called the Bonferroni correction) which in
essence takes into account the fact that if you are testing N true
null hypotheses on a single dataset, on average N/20 of them will be
disconfirmed at the 5% significance level by sheer chance alone.
However, to actually USE this test (or in many cases other tests) it
requires a clear notion of how many null hypotheses you are testing.
In the typical case for linguistic typology (as well as work in many
other non-linguistic disciplines), however, you are searching a
dataset for ANYTHING interesting and it is not clear how many null
hypotheses you could say you are testing. So of course you're right
that the ideal is developing hypotheses on one dataset and testing on
another. Unfortunately, when the datapoint is a whole language, large
reliable datasets are hard to come by.
On another note, a somewhat orthogonal complaint is that when
individual languages are datapoints, they are not really fully
independent from each other due to the relatedness of different
languages, and so the datasets are effectively somewhat smaller and
less symmetrical than they seem. This is a hotly debated topic in
linguistic typology. (In anthropology, where it's also an issue, it
is known as Galton's problem.)
Best,
Roger Levy
Footnotes:
[1] One of the simples examples of a correcting factor is that when
estimating the variance of some attribute of a population, the
variance of that attribute in any sample of the population will on
average be an underestimate (the smaller the sample the greater the
underestimate). In the extreme case of a sample size of 1, there is
obviously no variance.