Hey, guys!So, on another forum, I got into a discussion of whether linguistics is a real science.https://philosophicalvegan.com/viewtopi ... 613#p42530My nickname there is Teo123.So, what do you think about that issue? As you can see by reading the last few posts in that thread, I think that saying linguistics is not a real science is an insane and a profoundly anti-scientific assertion.The real question is where people get this nonsense from. I got a comment saying that years ago on yet another forum, I just didn't pay much attention to it. It appears as though quite a few people believe that.

Linguistics is multidisciplinary. Most linguists don't have a background in science, but some do. Since it focuses on the details of human culture, it is more of a social science (like archaeology) than a natural science (like chemistry). It's still "real".

Linguistics is a useful area of knowledge, helping us mantain the consistant meaning of formal statements over time, itself a crucial aspect of the sciences. So really, it's a choice of whether to classify it as a sceince, or define science such that useful bodies of knowledge that science relies on are derived entriely from outside of it. I've no strong feelings on this taxanomic choice, though I should point out that the latter would play directly into the hands of chaos magicians.

Yeah, I'd put it in the category of "social science" because it's not a data-intensive as the hard sciences. But it's still definitely "science" in the sense that it asks and answers questions, uses hypotheses and models, and can make predictions.

If you don't buy the "predictions" part, I just read an article that predicted the plural second person pronoun "Y'all" would soon fill a gap in the English language, same as using "they" as a gender neutral pronoun is doing. Some writers are even arguing for its acceptance:

Yes, but it's always flagged as a "southern US regional" thing. I think what the article was talking about was putting it in the same class as "you're" or "they'll" or "I'm " as an established and accepted contraction.

"Linguistics" is such a broad field that it intersects with major swaths of both "natural" and "social" sciences - at one end, you have studies looking at how different evolutionary pathways in species resulted in the ability to make different kinds of sounds and using medical imaging technology to examine the relationship between the way bits of your vocal tracts move and the frequencies that show up when you say different words; at another end, you have people looking at the the varying meanings of the word "gay" across different generations, or the cultural and historical development of emojis.

All of those fields still develop experiments, collect data, and apply statistical analysis to draw conclusions and test hypotheses. They're all scientific, it's just that some of the science is more like what you'd call "hard science" if you were so inclined, while some of it isn't. If you think that linguistics is just figuring out how to write words in IPA or philosophically musing about how "library" and "liberty" sound kind of similar, then you don't understand what it really is.

I mean, what we were discussing is that I have studied the Croatian toponyms for years, and I came to the idea that it was possible to prove etymologies of some toponyms beyond a reasonable doubt using statistics. Some patterns in the Croatian toponyms, such as that the first two consonants in the river names tend to be 'k' (first) and 'r' (second), seem to have p-values as low as 1/10'000 (you can see the details about that on my web-page).And that discussion on the other forum ended so that that brimstoneSalad (the moderator) told me he thinks I was bad at math and at the philosophy of science. I responded with "Who cares? Here are the arguments in a form that can be evaluated. It doesn't matter how good I am at mathematics if I present mathematical arguments in a form that can be evaluated." He responded with "Why should I bother to read from you? Since you are bad at math and you don't understand the philosophy of science, you are very unlikely to be correct.". I said: "You might as well say you didn't believe the Theory of relativity because Einstein was allegedly bad at math." And he responded... by banning me for trolling (comparing myself to Einstein).And by "philosophy of science" he means some modified version of the Comte's philosophy, including its political parts.

So, one of the most striking patterns I see in Croatian toponyms is that 5 river names start with k(+vowel)+r: Krka (once in Croatia, and once in Slovenia, but near Croatia), Korana, Karašica (2 rivers with the same name, both in Croatia), Krapina and Krbavica. So, since there are around 20 consonants in a language, there are around 20*20=400 possible consonant pairs for a river name to begin with. While not all consonants in a language occur equally often, an average language has slightly more than 20 consonants, so we can assume those two errors approximately offset each other out. Let's assume there are around 100 river names in Croatia (it's a safe assumption, since many river names both repeat themselves and have obvious Slavic roots, so they don't count here). What's the probability the same consonant pair will appear 5 times if the names were really random? Let me be clear, we are not asking what's the probability that this particular pattern of k(+vowel)+r would appear in the river names, we are asking what's the probability that any such pattern would appear. This is the equivalent of asking: If you randomly choose a number between 1 and 400, 100 times, what's the probability you will choose the same number n times? So, because of the Birthday Paradox, it's indeed hard to calculate that probability analytically, but I've made a computer program that estimates that numerically. Here it goes in the C programming language (my favorite compiler for C is TCC, by the way):

So, that program estimates the probability of there being a consonant pair appearing at the beginning of the river names number_of_repeats times, where number_of_repeats is entered on the keyboard.The probability of there appearing to be a common two-consonant prefix in certain number of rivers falls rapidly. The result will vary depending on how many times that program is being run, but it will always be approximately the same. So, here are some example sessions:

Because the probability of some consonant pair appearing at the beginning of the river names at least once is, by definition, 100% (unless we assume there is a language in which river names are made only of vowels).

The probability of there being any consonant pair repeating itself 4 times at the beginning of 100 random river names is significantly less than 1 percent.Taking there to be 6 such river names might be unfair, because the names of the two rivers named Karašica are exactly the same and it's possible that one was named after the other. Otherwise, the probability of there being a consonant pair repeating itself six times in random river names is so low the program can't estimate it:

So, the P-value of that pattern is indeed astoudingly low (less than 1/10000). I can not only claim that, if those river names come from the same language, there was a word *kar~kur meaning "to flow" with nearly 100% certainty, I can also claim they do come from the same language with nearly 100% certainty.The probabilities of some other patterns I found are indeed harder to model, but I expect their P-values to be similarly low.

The first error is in assuming that there are approximately 400 roughly equally likely word-initial consonant pairs in Croatian. You acknowledge that some combinations don't occur but massively underestimate the number. Most combinations that are valid will have an R or L at the second position or an S at the first position. Now, for the first two consonants in words that start CVC, your estimate is much better, but the combinations are still nowhere near equally common, which brings me to your next error.

Your null hypothesis here is that words are created in a way mathematically analogous to picking letters at random that satisfy orthographic rules. That is not the case. Compare the number of popular English first names starting with J compared to the number of words in general. There may indeed be interesting linguistic reasons for this (perhaps due to the popularity of Biblical names derived from Hebrew starting with Y), but the point is there are many possible explanations, and the fact that the letters are not distributed randomly cannot, on its own, support any specific explanation. It only supports the position that the pattern is not a coincidence of purely random assortment.

Right, but "I can't see it" is not a scientific argument. You should search the literature to see how much is already known about the etymologies of these river names and of other river names. You should see whether the k-r pattern is specific to rivers or if it also shows up in other vocabulary. You should see if it is unique to Croatian, or if it is a feature of other languages in the Balkans. There is a lot you could investigate to test your hypothesis or formulate more. This is what makes linguistics scientific, not the use of numbers.

Well, the mainstream linguistics considers the river name "Karašica" to be related to the obsolete Croatian word "karaš", for "goldfish", but there are no goldfishes there (and there is, as far as I know, no evidence that there have ever been), and neither is that the name for goldfish in the local dialect (I speak that dialect). Another commonly cited etymology is that "Karašica" comes from Turkish "kara su" (black water), but that agrees neither with historical phonology nor with basic facts (Karašica, the tributary to Dunav, is remarkably clear).Mainstream linguistics connects the name "Krbavica" with the Russian word "khrbat", meaning "mountain", as if the river was named after the mountain (even though the mountain has, as far as I am aware of, never been called that way), and as if the Proto-Slavic *x ever turned into /k/ in the local dialect.Mainstream linguistics connects the river name "Korana" with the Celtic word "carnos" (stone).Mainstream linguistics connects the river name "Krapina" with the Latin word "carpa", meaning the carp fish (and why wouldn't the long 'i' in the Latin suffix turn into front yer, that, as far as I know, remains unexplained).And, as far as I know, nobody has suggested any etymology for the river name "Krka" .

FlatAssembler wrote:Well, the mainstream linguistics considers the river name "Karašica" to be related to the obsolete Croatian word "karaš", for "goldfish", but there are no goldfishes there (and there is, as far as I know, no evidence that there have ever been), and neither is that the name for goldfish in the local dialect (I speak that dialect).

Well, you say the word is obsolete, so it certainly is not the local dialect of today. But it's definitely a reasonable guess, since the word is almost exactly the same.

Interestingly, Wikipedia suggests another possibility, but it is not cited. It claims there was a PIE word (not Ilyrian) *(s)ker (not *kar) meaning "to cut" (not "to flow"). I don't know if I really buy that etymology either, but at least *(s)ker is a real word reconstructed based on a lot of evidence, not just the names of a few Croatian rivers. And it is another example of how your observation can be explained in multiple ways.

Another commonly cited etymology is that "Karašica" comes from Turkish "kara su" (black water), but that agrees neither with historical phonology nor with basic facts (Karašica, the tributary to Dunav, is remarkably clear).

And I live near a happy town called "Chagrin," and nobody knows why. Sometimes toponyms are not very accurate.

Mainstream linguistics connects the name "Krbavica" with the Russian word "khrbat", meaning "mountain", as if the river was named after the mountain (even though the mountain has, as far as I am aware of, never been called that way), and as if the Proto-Slavic *x ever turned into /k/ in the local dialect.Mainstream linguistics connects the river name "Korana" with the Celtic word "carnos" (stone).Mainstream linguistics connects the river name "Krapina" with the Latin word "carpa", meaning the carp fish (and why wouldn't the long 'i' in the Latin suffix turn into front yer, that, as far as I know, remains unexplained).And, as far as I know, nobody has suggested any etymology for the river name "Krka" .

So, doesn't the Occam's Razor strongly support my theory?

No, it seems like most of these names already have plausible etymologies. You do not provide any unique etymologies at all, simply claiming that all the k-r at the beginning mean "flow" and the rest you just don't know, apparently tons of deviations on that theme. That does not make "fewer assumptions," because it leaves all the relevant detail of every word unexplained except two letters, nor can it predict when a river should start with those letters and when it should not. I would sooner buy the established etymologies, put forward by professional linguists who had some evidence for their assertions, than your assertion which is apparently based on no evidence at all.

As I said, it's only extraordinary if we assume all letters should be equally common at the start of words and that words are formed completely at random, both of which are obviously not true. So your data sort of demonstrate that fact, but nothing more.

FlatAssembler wrote:Well, the p-value of 1/10'000 makes dismissing that pattern as a coincidence quite an extraordinary, if not a silly, claim, doesn't it?

Only as far as it gives evidence against your null hypothesis that the words were formed by pulling letters from a bag. That's not a typical null hypothesis in linguistics, though. p-values are a tricky thing to interpret if you don't set things up properly.

Based on this chart of Croatian character frequency, picking two random consonants should result in 'k' then 'r' about 1/150 of the time, rather than 1/400. That'll be different if we focus more specifically on consecutive consonants starting at the beginning of a word, but it will work for an illustrative comparison:

Using the Poisson approximation to the binomial distribution, the likelihood of getting a particular 1/400 chance event 6+ times in 100 is about 2.7e-7.The likelihood of getting a particular 1/150 chance event 6+ times in 100 is about 6.9e-5. It's over 250 times more likely, in other words.

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

Well, yes, in texts, a handful of words occur much more commonly than other words do (in Croatian, or any other language), and the sounds that occur in those words are more common in texts than the sounds that don't happen to occur in those words. What does that have to do with what we can expect to happen in unrelated toponyms?Chances are, many of the Croatian river names are Pre-Indo-European in origin. Almost no river changed its name in the last 2000 years since it was first mentioned (even some of those that appear to be semantically transparent in Croatian, such as Vuka, in fact come from an older language, since Vuka was called Ulca in antiquity), and we can expect that the river names were changing at the same rate further in the past. They are layered through different languages with different phoneme frequencies, and those phoneme frequencies will, apart from the fact that vowels occur more commonly than consonants, cancel each other out. From the perspective of modern languages, the river names are formed completely randomly.Do you think I am missing something?

FlatAssembler wrote:Well, yes, in texts, a handful of words occur much more commonly than other words do (in Croatian, or any other language), and the sounds that occur in those words are more common in texts than the sounds that don't happen to occur in those words. What does that have to do with what we can expect to happen in unrelated toponyms?

No, that is not the effect we are talking about. It is not the distribution of words that is in question but the distribution of letters in words. Some initial letter combinations show up in far more words than other combinations do. Your null hypothesis was that they should all be exactly equally likely, and you claim to have proved the null hypothesis false, but that does not allow you to substitute whatever other hypothesis you choose. We already knew that some letters showed up more than others; that is not a surprising result.

Chances are, many of the Croatian river names are Pre-Indo-European in origin. Almost no river changed its name in the last 2000 years since it was first mentioned (even some of those that appear to be semantically transparent in Croatian, such as Vuka, in fact come from an older language, since Vuka was called Ulca in antiquity), and we can expect that the river names were changing at the same rate further in the past. They are layered through different languages with different phoneme frequencies, and those phoneme frequencies will, apart from the fact that vowels occur more commonly than consonants, cancel each other out. From the perspective of modern languages, the river names are formed completely randomly.Do you think I am missing something?

I am certain you are missing something, because phoneme and letter frequencies have never been close to uniform in this context in any language, including PIE, and these effects do not cancel out, particularly within a single language family and in words you claim are very old.

I don't really understand what you doubt.Do you agree that the frequencies of letters will be more uniform if you count in a dictionary than if you count in a text (full of pronouns and particles)?Do you agree that the Croatian toponyms, especially river names, come from different languages from different language families? Do you agree that different languages have, apart from vowels being more common than consonants, very different phoneme frequencies? Do you agree that those phoneme frequencies of different languages will mostly cancel each other out?

FlatAssembler wrote:Do you agree that the frequencies of letters will be more uniform if you count in a dictionary than if you count in a text (full of pronouns and particles)?

No.

The distribution is likely to be different, but there's no reason to assume it's more uniform. Certain sounds and letters are more common than others whether you count in a natural-language corpus or a word list.

For example, we would expect approximately zero English toponyms to start with "hrv" because that's not a combination that's used in English, so words that start like that in other languages get changed when adopted into English. You don't need any real or hypothesized etymological roots to explain the disproportionate lack of that trigram.

Edit: Here are some statistics about an English 15500-word list. More than 20% of words on the list start with p or s. Less than 2.5% start with x, y, or j.

FlatAssembler wrote:Besides, even if we assume there are only 150 consonant pairs that are likely to occur, my program says that the p-value is 0.85%. That's still quite significant, isn't it?

Also no, because we're not just talking about consonant pairs, we're talking about the first two consonants of a word, which has yet a different distribution.

16/50 US states start with M or N, which constitute only 1/13 of the alphabet. This would randomly happen less than 8 times in a million.

You're doing the equivalent of looking at that fact and hypothesizing a proto-language where an initial nasal means something like "state" or "region".

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

FlatAssembler wrote:↶↶frequencies of letters will be more uniform if you count in a dictionary than if you count in a text

In the spellcheck dictionary for Croatian that I found here, 9185 out of 568803 entries start with 'k' and have 'r' as the next consonant. This is in fact even less uniform than the 1/150 figure I used before, and becomes more disproportionate if you remove acronyms from the rest of the list to bring down the total.

All things considered, it looks like almost 1 in 60 Croatian words start with k(v)r, and "small" differences in the probability make enormous differences when you count large sets.

I just randomly generated 100,000 lists of 100 integers in the range 1-60. 33,218 of them had 6 or more matches.

So unless you've got reason to believe a significant number of those other k.r words also come from this alleged "flow" root, you appear to have a p-value of 33.2%.

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

we would expect approximately zero English toponyms to start with "hrv" because that's not a combination that's used in English, so words that start like that in other languages get changed when adopted into English.

Well, yes, but there are words in English whose first two consonants are h and r: hear, hair, here, hereafter... And those probably exist in any language that has 'h' and 'r' sounds.Now, I am not aware that there are any words in English whose first three consonants are h, r and v, in that order. And you really shouldn't expect there to be any such word in English by probability alone, some three-consonant combinations will happen to occur in a language, and some won't.The only words in Croatian I can think of whose first three consonants are 'h', 'r' and 'v' are "Hrvatska" (Croatian name for Croatia) and "hrvati" (to fight).

More than 20% of words on the list start with p or s.

I must admit I haven't really looked into it very much. When I think about it, it seems pretty obvious: some prefixes are much more common than others. Quite a few words can be formed by adding para-, pseudo- or sub- (or its equivalent suf-) to an existing word. The dictionary you cite doesn't include many of the words starting with the Latin prefix re- (not even the word "repeat"). Once you include those, such as in the Concise Oxford Dictionary, the 'r' becomes the third most common letter for a word to start with, and the most common consonant for a word to start with.Also, quite a few words in Croatian can be formed by adding the ka- prefix ("near" or "towards") to an existing word. Is the same true in toponyms? Well, maybe "Knin" means "on the road towards Nin", and maybe "Karašica" means "flowing towards *Rašica", where *Rašica (perhaps an unattested name for Danube) is to be compared to the river names such as Raša or Raška. It sounds like folk-etymology, but once you take into account how many words are formed by adding a prefix to an existing word, it doesn't sound as ridiculous.

This would randomly happen less than 8 times in a million.

I am not sure how you calculated that, a simple computer simulation tells me that a p-value is around 4/10'000. You won't get that low p-values for patterns in toponyms.Besides, I don't quite see why it's absurd to propose that something like *mans once meant "state" in many native American languages. It doesn't need to be from a proto-language, such words easily get borrowed between languages.

So unless you've got reason to believe a significant number of those other k.r words also come from this alleged "flow" root, you appear to have a p-value of 33.2%.

Well, maybe the words "krv" (blood) and "kretati" (to move) were borrowed into Proto-Slavic from Illyrian. The word "krv" certainly isn't related to Latin "sanguis" and Greek "haima". However, the word "krenuti" appears to contain the typical Slavic prefix k(a)- (towards). It's hard to tell, but probably no, they are probably not Illyrian in origin.As for the p-value, like I've said, I believe it's established that different languages, apart from the fact that vowels are more common than consonants, have drastically different phoneme frequencies, and that Croatian river names come from different languages and probably from different language families. So, again, don't you think those phoneme frequencies, in different languages the toponyms come from, can be assumed to cancel each other out?

That is in fact less than 0.6 times in a million. I'm not sure what calculation gmal did or what you did. The function pbinom computes the probability Pr[X≤15] for a random variable X ~ Binomial(50, 1/13), so taking the complement should give Pr[X>15] = Pr[X≥16].

Besides, I don't quite see why it's absurd to propose that something like *mans once meant "state" in many native American languages. It doesn't need to be from a proto-language, such words easily get borrowed between languages.

It may not be absurd, but it is definitely factually incorrect. That's sort of the point: you have to bring your knowledge of a language and its history to bear when studying it; simply performing a test which rejects a null hypothesis of randomly assorted letters won't get you anywhere. If someone proposed this "states start MN because they borrowed native American words that started with 'man'" hypothesis, they would be both unjustified and wrong, but they would be patterning themselves after your example.

So unless you've got reason to believe a significant number of those other k.r words also come from this alleged "flow" root, you appear to have a p-value of 33.2%.

Well, maybe the words "krv" (blood) and "kretati" (to move) were borrowed into Proto-Slavic from Illyrian. The word "krv" certainly isn't related to Latin "sanguis" and Greek "haima". However, the word "krenuti" appears to contain the typical Slavic prefix k(a)- (towards). It's hard to tell, but probably no, they are probably not Illyrian in origin.As for the p-value, like I've said, I believe it's established that different languages, apart from the fact that vowels are more common than consonants, have drastically different phoneme frequencies, and that Croatian river names come from different languages and probably from different language families. So, again, don't you think those phoneme frequencies, in different languages the toponyms come from, can be assumed to cancel each other out?

You just saw that they clearly don't cancel out in this exact case: Croatian words starting with kr. There are any of a number possible reasons for it, but your proposal cannot explain the vast majority of such words, only specifically rivers, and you have no specific evidence for the case of rivers. That is to say, there is not evidence that rivers are especially likely in Croatian to start kr. There is rather evidence that Croatian words in general are likely to start with kr, including rivers. And using this correct null hypothesis gives a p-value of 33.2%, i.e. statistically meaningless.

I multiplied the result by 13 as a rough approximation of the probability that any of the 13 consecutive pairs of letters would be surprising at that frequency. That overcounts cases of two pairs both occurring 16+ times, which is a pretty negligible probability, but also you and I both ignored the possibility of nonconsecutive pairs adding up to 16 out of 50.

The reason FlatAssembler is using a simulation and mentioned the birthday problem is that we are essentially making multiple comparisons and have to account for that in our statistical calculations. The correct question is the probability that anything as or more "surprising" than what we observe would happen.

So fine, maybe the correct value is 0.0004. Blindly positing an etymological root on that basis is still ridiculous. Especially when you remember that "New" and "North" account for half the N's with no further explanation needed.

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

Especially when you remember that "New" and "North" account for half the N's with no further explanation needed.

And why wouldn't that be an English folk-etymology?You know, like the Croatian river name "Vuka" appears to come from the word "vuk" (wolf), when it actually comes from the ancient name "Ulca". And like the toponyms "Vučica", "Vukovar" and "Vučedol" also appear to come from the Croatian word for "wolf", when they are actually probably named after the Vuka river. And like the river name "Cetina" appears to derive from "cetan" (cold), when in fact it comes from the ancient name "Centona". And like the island name "Lastovo" appears to come from the Croatian word "lasta" (swallow bird), when it in fact comes from the ancient name "Ladesta".And there are many more such examples.

Because we already know the etymology of most (or all?) of the states' names. "Montana" comes from Spanish for "mountain". "Minnisota" comes from the name of a river (as do several other states). Even if we didn't specifically know that the initial "mni" part means "water" in Dakota, it would be silly to assume a river's name starts with a word meaning "state".

And as I said above, "New" and "North" account for half the N states. Those words definitely don't mean "state" in any language.

There are any of a number possible reasons for it, but your proposal cannot explain the vast majority of such words, only specifically rivers, and you have no specific evidence for the case of rivers.

And the fact that the river name "Krka" was attested by Strabo in the 7th book in the 5th chapter of Geography as "Corcoras", way before Croatian language even existed, doesn't count?

What should it count for? Yeah, some of those rivers were named a long time ago. That in itself doesn't provide evidence for anything.

FlatAssembler wrote:

Especially when you remember that "New" and "North" account for half the N's with no further explanation needed.

And why wouldn't that be an English folk-etymology?

Because we know when they were named and why, and often by whom. North Carolina and North Dakota are so named because they're the northern of two states with the same name. New York and New Jersey (and other places in the world like New Brunswick and Nova Scotia and New Zealand and New South Wales and on and on) are so named because colonists were an exceptionally uncreative bunch.

I used US states as an example specifically because there's an explicit written record of the etymology of their names, and here you are still insisting on a ridiculous hypothesis.

Linguistics absolutely can be a real science, but what you're doing is pseudoscience.

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

It provides the evidence that at least some river names with that k-r-pattern don't come from Croatian. So, why assume others do, especially since none of those have obvious Croatian etymologies?

I used US states as an example specifically because there's an explicit written record of the etymology of their names

And we also have written records, by Strabo, that the name Pharos (ancient name for Hvar) comes from the name of the Greek island of Paros (because people from Paros allegedly colonized it), and that the name Issa (ancient name for Vis) comes from the name of the Illyrian king Ionius. And we have written records, by Constantine Porphyrogenitus, that the name "Croatia" comes from the Greek word for land, "khora". Do you believe those assertions? Written records are not God's word when it comes to etymologies.

OK, we aren't assuming anything. We are pointing out that you have so far based your entire hypothesis on the assertion that Croatian names of rivers are particularly likely to start kr, when they aren't. There is literally nothing to explain.

FlatAssembler wrote:Written records are not God's word when it comes to etymologies.

I don't mean written hypotheses, I mean actual written *records* of how those states were named. The Carolinas were named for King Charles. North Carolina was named for being to the north of South Carolina. Maryland was named for Queen Mary. That's not a false etymology hypothesized later based on ignorance, that's actually what the people who named those places said they were naming them for.

You might be able to argue that we're incorrect about some of the indigenous sources of state names, but states that English (or French or Spanish) settlers named with straightforward English (or French or Spanish) words are really not up for debate. You're just engaging in stubborn denialism about this, because it's inconvenient for your pet theory.

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

We are pointing out that you have so far based your entire hypothesis on the assertion that Croatian names of rivers are particularly likely to start kr, when they aren't. There is literally nothing to explain.

They are disproportionately likely to start with "kr", even if we take that the probability of that is 1/60. Assuming there are 100 river names in Croatia without an evident Croatian etymology, we should expect 1 or 2 of them to start with "kr", not 6 of them. No problem there.The question is how we should calculate the p-value. You are arguing that we should assign a high a-priori probability to a river name starting with "k(V)r-", because a disproportionate number of Croatian words in general starts with "k(V)r-", even though we have no reason to think those river names come from Croatian, in fact, at least one of those river names with the k-r-pattern was attested before Croatian even existed. If we are going to assume those river names come from Croatian, shouldn't we assign a high a-priori probability to a v-d-pattern (as in Croatian word for water, "voda", and Proto-Indo-European *wed), a r-k-pattern (as in the Croatian word for river, "rijeka"), a t-ć-pattern (as in the Croatian word for flow, "teći"), a hypothetical p-l-pattern (as in the Proto-Indo-European for flow, *plew), a hypothetical h-p-pattern (as in Proto-Indo-European for body of water, *h2ep), and a hypothetical d-n-pattern (as in Proto-Indo-European for river, *danu)? In contrast, the k-r-pattern seems to be more common in Croatian river names than any of those patterns (the only Croatian river names apparently named after the Croatian for "river" are, as far as I know, Rečina and Ričica, and the only Croatian river name fitting the d-n-pattern is the Croatian name for Danube, "Dunav").Why do we then accept that *danu once did mean "river", yet we don't accept that *kar~kur once meant something related to river?

I don't mean written hypotheses, I mean actual written *records* of how those states were named.

Fine, maybe. Still, supposing that *mans meant "state" is different than supposing *kar~kur meant "to flow" for a few reasons. First, as you also note, "North" and "New" are actual English words that would plausibly be a part of a name of a state. The same is not true for the Croatian river names matching the k-r-pattern. Second, as you also note, we have good reasons to think "Mississippi" at first didn't refer to a state, but to a river. There is no good reason to think the Croatian river names matching the k-r-pattern didn't refer to those rivers all along.There are at least five river names in Croatia that appear to be derived from the Croatian word for birch, "breza". You can perhaps also claim *brez meant "river" in some language, but since there is a Croatian word from which those river names can somewhat plausibly be derived, the argument for that wouldn't be as strong as the argument for *kar~kur meaning that.

We are pointing out that you have so far based your entire hypothesis on the assertion that Croatian names of rivers are particularly likely to start kr, when they aren't. There is literally nothing to explain.

They are disproportionately likely to start with "kr", even if we take that the probability of that is 1/60. Assuming there are 100 river names in Croatia without an evident Croatian etymology, we should expect 1 or 2 of them to start with "kr", not 6 of them. No problem there.

Careful you don't fall for exactly the same trick you warned Eebster against a few posts ago. We'd expect 1 or 2 to start with a particular 1/60 probability pair of sounds, and would be very surprised to find that particular pair 6 times. But as I already explained above, we'd expect some 1-in-60 probability event to happen 6 times out of 100 about 1/3 of the time.

It would be surprising if two people in a group of 23 shared your birthday, but it's not surprising to find a pair that share a birthday.

The question is how we should calculate the p-value. You are arguing that we should assign a high a-priori probability to a river name starting with "k(V)r-", because a disproportionate number of Croatian words in general starts with "k(V)r-", even though we have no reason to think those river names come from Croatian, in fact, at least one of those river names with the k-r-pattern was attested before Croatian even existed.

As I explained when I first brought up common and uncommon letter sequences, certain phonemes and spellings are more common in a language even among borrowed words, because languages typically change the words they borrow. English doesn't have words with 'r' between two other consonants (without any vowels) because that's not how we spell words in English. When we borrow words spelled like that in their native languages, we add vowels to get things like "Serb" and "Krishna". Words that originally started with /mn/ either lose one of the sounds (as in "mnemonic") or gain a vowel (as in "Minnesota"), because /mn/ isn't a way that words start in English.

In addition, you have no basis for assuming that all or most of the other 9185 k+r words in my sample come originally from Croatian, either. After all, toponyms aren't the only words languages borrow from each other. If you think my fraction is wrong, you're welcome to go through the list yourself and figure out the statistics for k+r among only proper nouns, or among only toponyms.

In any case, our point is not that you're definitely wrong, but simply that you haven't yet figured out the actually relevant statistics you should be using in your comparison. You're assuming that letter frequency should be uniform in river names or in borrowed words generally with no basis for that assumption. My alternate numbers have all been for illlustration purposes only. I don't know for sure that the appropriate p-value really is 0.33. You might be able to make a convincing argument that it's much lower than that. But so far, you haven't made any such argument. All you've done is hand-waving and making baseless assumptions.

I don't mean written hypotheses, I mean actual written *records* of how those states were named.

Fine, maybe. Still, supposing that *mans meant "state" is different than supposing *kar~kur meant "to flow" for a few reasons. First, as you also note, "North" and "New" are actual English words that would plausibly be a part of a name of a state. The same is not true for the Croatian river names matching the k-r-pattern. Second, as you also note, we have good reasons to think "Mississippi" at first didn't refer to a state, but to a river. There is no good reason to think the Croatian river names matching the k-r-pattern didn't refer to those rivers all along.There are at least five river names in Croatia that appear to be derived from the Croatian word for birch, "breza". You can perhaps also claim *brez meant "river" in some language, but since there is a Croatian word from which those river names can somewhat plausibly be derived, the argument for that wouldn't be as strong as the argument for *kar~kur meaning that.

So you're saying all of those k-r rivers have no reasonable existing etymological explanation? If nothing else, it's not really honest to count the repeated name twice. That would be like saying it's astonishingly unusual for not one but two things in the US to be called exactly "Mississippi".

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome