What do voles and Orkney have to do with one another? One thing somebody knowledgeable about British wildlife might be able to tell you is that Orkney is home to a unique variety of the common European vole (Microtus arvalis) called the Orkney vole.

The most remarkable thing about the Orkney vole is that the common European vole isn’t found anywhere else in the British Isles, nor in Scandinavia—it’s a continental European animal. That raises the question of how a population of them ended up in Orkney. During the last ice age, Orkney was covered by a glacier and would have been uninhabitable by voles; and after the ice retreated, Orkney was separated from Great Britain straight away; there were never any land bridges that would have allowed voles from Great Britain to colonize Orkney. Besides, there is no evidence that M. arvalis was ever present on Great Britain, nor is there any evidence that voles other than M. arvalis were ever present on Orkney; none of the three species that inhabit Great Britain today (the field vole, Microtus agrestis, the bank vole, Myodes glareolus, and the water vole, Arvicola amphibius) were able to colonize Orkney, even though they were able to colonize some islands that were originally connected to Great Britain by land bridges (Haynes, Jaarola & Searle, 2003). The only plausible hypothesis is that the Orkney voles were introduced into Orkney by humans.

But if the Orkney voles were introduced, they were introduced at a very early date—the earliest discovered Orkney vole remains have been carbon-dated to ca. 3100 BC (Martínkova et al., 2013)—around the same time Skara Brae was first occupied, to put that in context. The only other mammals on the British Isles known to have been introduced at a similarly ancient date or earlier are the domestic dog and the domestic bovines (cattle, sheep, goats)—even the house mouse is not known to have been present before c. 500 BC (Montgomery, 2014)! The motivation for the introduction remains mysterious—voles might have been transported accidentally in livestock fodder imported from the Continent, or they might have been deliberately introduced as pets, food sources, etc.; we can only speculate. It’s interesting to note that the people of Orkney at this time seem to have been rather influential, as they introduced the Grooved Ware pottery style to other parts of the British Isles.

Anyway, there is in fact another interesting connection between voles and Orkney, which has to do with the word ‘vole’ itself. Something you might be aware of if you’ve looked at old books on British wildlife is that ‘vole’ is kind of a neologism. Traditionally, voles were not thought of as a different sort of animal from mice and rats. The relatively large animal we usually call the water vole today, Arvicola amphibius, was called the ‘water rat’ (as it still is sometimes today), or less commonly the ‘water mouse’. The smaller field vole, Microtus agrestis, was often just the ‘field mouse’, not distinguished from Apodemus sylvaticus, although it was sometimes distinguished as the ‘water mouse’ or the ‘short-tailed field mouse’ (as opposed to the ‘long-tailed field mouse’ A. sylvaticus—if you’ve ever wondered why people still call A. sylvaticus the ‘long-tailed field mouse’, even though its tail isn’t much longer than that of other British mice, that’s probably why!) The bank vole, Myodes glareolus, seems not to have been distinguished from the field vole before 1832 (the two species are similar in appearance, one distinction being that whereas the bank vole’s tail is about half its body length, the field vole’s tail is about 30% to 40% of its body length).

As an example, a reference to a species of vole as a ‘mouse’ can be found in the 1910 edition of the Encyclopedia Britannica:

The snow-mouse (Arvicola nivalis) is confined to the alpine and snow regions. (vol. 1, p. 754, under “Alps”)

Today that would be ‘the snow vole (Chionomys nivalis)’.

A number of other small British mammals were traditionally subsumed under the ‘mouse’ category, namely:

Shrews, which were often referred to as shrewmice from the 16th to the 19th centuries, although ‘shrew’ on its own is the older word (it is attested in Old English, but its ultimate origin is unknown).

Bats, which in older language could also be referred to by a number of whimsical compound words, the oldest and most common being rearmouse, from a now-obsolete verb meaning ‘stir’, but also rattlemouse, flindermouse, flickermouse, flittermouse and fluttermouse. The word rearmouse is still used today in the strange language of heraldry.

And, of course, dormice, which are still referred to by a compound ending in ‘-mouse’, although we generally don’t think of them as true mice today. The origin of the ‘dor-‘ prefix is uncertain; the word is attested first in c. 1425. There was an Old English word sisemūs for ‘dormouse’ whose origins are similarly mysterious, but the -mūs element is clearly ‘mouse’.

There is still some indeterminacy about the boundaries of the ‘mouse’ category when non-British rodent species are included: for example, are birch mice mice?

So, where did the word ‘vole’ come from? Well, according to the OED, it was first used in a book called History of the Orkney Islands (available from archive.org), published in 1805 and written by one George Barry, who was not a native of Orkney but a minister who preached there. In a list of the animals that inhabit Orkney, we find the following entry (alongside entries for the Shrew Mouseſorex araneus, the [unqualified] Mousemus muſculus, and the [unqualified] Field Mousemus sylvaticus):

The Short-tailed Field Mouse, (mus agreſtis, Lin. Syſt.) which with us has the name of the vole mouſe, is very often found in marſhy grounds that are covered with moſs and ſhort heath, in which it makes roads or tracks of about three inches in breadth, and ſometimes miles in length, much worn by continual treading, and warped into a thouſand different directions. (p. 320)

So George Barry knew vole mouse as the local, Orkney dialectal word for the Orkney vole, which he was used to calling a ‘short-tailed field mouse’ (evidently he wasn’t aware that the Orkney voles were actually of a different species from the Scottish M. agrestis—I don’t know when the Orkney voles’ distinctiveness was first identified). Now, given that vole mouse was an Orkney dialect word, its further etymology is straightforward: the vole element is from Old Norse vǫllr ‘field’ (cf. English wold, German Wald ‘forest’), via the Norse dialect once spoken in Orkney and Shetland (sometimes known as ‘Norn’). So the Norse, like the English, thought of voles as ‘field mice’. The word vole is therefore the only English word I know, that isn’t about something particularly to do with Orkney or Shetland, that has been borrowed from Norn.

Of course, Barry only introduced vole mouse as a Orcadianism; he wasn’t proposing that the word be used to replace ‘short-tailed field mouse’. The person responsible for that seems to have been the author of the next quotation in the OED, from an 1828 book titled A History of British Animals by University of Edinburgh graduate John Fleming (available from archive.org). On p. 23, under an entry for the genus Arvicola, Fleming notes that

The species of this genus differ from the true mice, with which the older authors confounded them, by the superior size of the head, the shortness of the tail, and the coarseness of the fur.

He doesn’t explain where he got the name vole from, nor does he seem to reference Barry’s work at all, but he does list alternative common names of each of the two vole species he identifies. The species Arvicola aquatica, which he names the ‘Water Vole’ for the first time, is noted to also be called the ‘Water Rat’, ‘Llygoden y dwfr’ (in Welsh) or ‘Radan uisque’ (in Scottish Gaelic). The species Arvicola agrestis, which he names the ‘Field Vole’ for the first time, is noted to be also called the ‘Short-tailed mouse’, ‘Llygoden gwlla’r maes’ (in Welsh), or “Vole-mouse in Orkney”.

Fleming also separated the shrews, bats and dormice from the true mice, thus establishing division of the British mammals into basic one-word-labelled categories that we are familiar with today. With respect to the other British mammals, the naturalists seem to have found the traditional names to be sufficiently precise: for example, each of the three quite similar species of the genus Mustela has its own name—M. erminea being the stoat, M. nivalis being the weasel, and M. putorius being the polecat.

Fleming still didn’t distinguish the field vole and the bank vole; that innovation was made by one Mr. Yarrell in 1832, who exhibited specimens of each to the Zoological Society, demonstrated their distinctiveness and gave the ‘bank vole’ (his coinage) the Latin name Arvicola riparia. It was later found that the British bank vole was the same species as a German one described by von Schreber in 1780 as Clethrionomys glareolus, and so that name took priority (and just recently, during the 2010s, the name Myodes has come to be favoured for the genus over Clethrionomys—I don’t know why exactly).

In the report of Yarrell’s presentation in the Proceedings of the Zoological Society the animals are referred to as the ‘field Campagnol‘ and ‘bank Campagnol‘, so the French borrowing campagnol (‘thing of the field’, still the current French word for ‘vole’) seems to have been favoured by some during the 19th century, although Fleming’s recognition of voles as distinct from mice was universally accepted. The word ‘vole’ was used by other authors such as Thomas Bell in A History of British Quadrupeds including the Cetacea (1837), and eventually the Orcadian word seems to have prevailed and entered ordinary as well as naturalists’ usage.

Epistemic status: just a half-baked idea, which ought to be developed into something more complete, but since I’m probably not going to do that anytime soon I figured I’d publish it now just to get it out there.

Consider a statement such as (1) below.

(1) Cats are animals.

I’m used to interpreting statements such as (1) using a certain method which I’m going to call the “truth-functional method”. Its key characteristic is, as suggested by the name, that statements are supposed to be interpreted as truth functions, so that a hypothetical being which knew everything (had perfect information) would be able to assign a truth value—true or false—to every statement. There are two problems which prevent truth values being assigned straightforwardly to statements in practice.

The first is that nobody has perfect information. There is always some uncertainty of the sort which I’m going to call “truth-uncertainty”. Therefore, it’s often (or maybe even always) impossible to determine a statement’s truth value exactly. All one can do is have a “degree of belief” in the statement, though this degree of belief may be meaningfully said to be “close to truth” or “close to falsth1” or equally far from both. People disagree about how exactly degrees of belief should be thought about, but there’s a very influential school of thought (the Bayesian school of thought) which holds that degrees of belief are best thought about as probabilities, obeying the laws of probability theory. So, for a given statement and a given amount of available information, the goal for somebody practising the truth-functional method is to assign a degree of belief to the statement. At least inside the Bayesian school, there has been a lot of thought about how this process should work, so that truth-uncertainty is the relatively well-understood sort of uncertainty.

But there’s a second problem, which is that often (maybe even always) it’s unclear exactly what the statement means. To be more exact (the preceding sentence was an exemplification of itself), when you hear a statement, it’s often unclear exactly which truth function the statement is supposed to be interpreted as; and depended on which truth function it’s interpreted as, the degree of belief you assign to it will be different. This is the problem of meaning-uncertainty, and it seems to be rather less well-understood. Indeed, it’s probably not conventional to think about it as an uncertainty problem at all in the same way as truth-uncertainty. In the aforementioned scenario where you hear the statement carrying the meaning-uncertainty being made by somebody else, the typical reponse is to ask the statement-maker to clarify exactly what they mean (to operationalize, to use the technical term). There is of course an implicit assumption here that the statement-maker will always have a unique truth-function in their mind when they make their statement; meaning-uncertainty is a problem that exists only on the receiving end, due to imperfect linguistic encoding. If the statement-maker doesn’t have a unique truth function in mind, and they don’t care to invent one, then their statement is taken as content-free, and not engaged with.

I wonder if this is the right approach. My experience is that meaning-uncertainty exists not only on the recieving end, but also very much on the sending end too; I very often find myself saying things but not knowing quite what I would mean by them, but nevertheless feeling that they ought to be said, that making these statements does somehow contribute to the truth-seeking process. Now I could just be motivatedly deluded about the value of my utterances, but let’s run with the thought. One thing that makes me particularly inclined towards this stance is that sometimes I find myself resisting operationalizing my statements, like there’s something crucial being lost when I operationalize and restrict myself to just one truth function. If you draw the analogy with truth-uncertainty, operationalization is like just saying whether a statement is true or false, rather than giving the degree of belief. Now one of the great virtues of the Bayesian school of thought (although it would be shared by any similarly well-developed school of thought on what degrees of belief are exactly) is arguably that, by making it more clear exactly what degrees of belief are, it seems to make people a lot more comfortable with thinking about degrees of belief rather than just true vs. false, and thus dealing with truth-uncertainty. Perhaps, then, what’s needed is some sort of well-developed concept of “meaning distributions”, analogous to degrees of belief, that will allow everybody to get comfortable dealing with meaning-uncertainty. Or perhaps this analogy is a bad one; that’s a possibility.

Aside 1. Just as truth-uncertainty almost always exists to some degree, I’m fairly sure meaning-uncertainty almost always exists to some degree; operationalization is never entirely completely done. There’s a lot of meaning-uncertainty in statement (1), for example, and it doesn’t seem to completely go away no matter how much you operationalize.

Aside 2. The concept of meaning-uncertainty doesn’t seem to be as necessarily tied up with the truth-functional model to me as that of truth-uncertainty; one can imagine statements being modelled as some other sort of thing, but you’d still have to deal with exactly which example of the other sort of thing any given statement was, so there’d still be meaning-uncertainty of a sort. For example, even if you don’t see ought-statements as truth-functional, as opposed to is-statements, you can still talk about the meaning-uncertainty of an ought-statement, if not its truth-uncertainty.

Aside 3. Another way of dealing with meaning-uncertainty might be to go around the problem, and interpret statements using something other than the truth-functional method.

Footnotes

^ I’m inventing this word by analogy with “truth” because I get fed up with always having to decide whether to use “falsehood” or “falsity”.

I have a Tumblr blog which I use for writing short-form things that aren’t necessarily of any lasting value. But occasionally things do end up there that might be worth reading, so I intend to make an organized list of links to Tumblr posts that might be interesting to readers of this blog every year or so. The last time I did this was in December 2015 (here on WordPress and here on Tumblr), and I have been posting on Tumblr at a higher rate since then, so the list in this post is rather long, and I’ve organized it into subsections to make it more manageable. Only posts from December 2015 are included; for earlier posts, see the earlier lists.

All of this information is from the amazingly comprehensive book English Pronunciation, 1500–1700 (Volume II) by E. J. Dobson, published in 1968, which I will unfortunately have to return to the library soon.

The transcriptions of ModE pronunciations are not meant to reflect any particular accent in particular but to provide enough information to allow the pronunciation in any particular accent to be deduced given sufficient knowledge about the accent.

I use the acute accent to indicate primary stress and the grave accent to indicate secondary stress in phonetic transcriptions. I don’t like the standard IPA notation.

⁂

Oh, the holly bears a blossom
As white as the lily flower
And Mary bore sweet Jesus Christ
To be our sweet saviour
— “The Holly and the Ivy”, as sung by Shirley Collins and the Young Tradition)

In ModE flower is [fláwr], but saviour is [séjvjər]; the two words don’t rhyme. But they rhymed in EModE, because saviour was pronounced with secondary stress on its final syllable, as [séjvjə̀wr], while flower was pronounced [flə́wr].

The OF suffix -our (often spelt -or in English, as in emperor and conqueror) was pronounced /-ur/; I don’t know if it was phonetically short or long, and I don’t know whether it had any stress in OF, but it was certainly borrowed into ME as long [-ùːr] quite regularly, and regularly bore a secondary stress. In general borrowings into ME and EModE seem to have always been given a secondary stress somewhere, in a position chosen so as to minimize the number of adjacent unstressed syllables in the word. The [-ùːr] ending became [-ə̀wr] by the Great Vowel Shift in EModE, and then would have become [-àwr] in ModE, except that it (universally, as far as I know) lost its secondary stress.

English shows a consistent tendency for secondary stress to disappear over time. Native English words don’t generally have secondary stress, and you could see secondary stress as a sort of protection against the phonetic degradation brought about by English’s native vowel reduction processes, serving to prevent the word from getting too dissimilar from its foreign pronunciation too quickly. Eventually, however, the word (or really suffix, in this case, since saviour, emperor and conqueror all develop in the same way) gets fully nativized, which means loss of the secondary stress and concomitant vowel reduction. According to Dobson, words probably acquired their secondary stress-less variants more or less immediately after borrowing if they were used in ordinary speech at all, but educated speech betrays no loss of secondary stress until the 17th century (he’s speaking generally here, not just about the [-ə̀wr] suffix. Disyllabic words were quickest to lose their secondary stresses, trisyllabic words (such as saviour) a bit slower, and in words with more than three syllables secondary stress often survives to the present day (there are some dialect differences, too: the suffix -ary, as in necessary, is pronounced [-ɛ̀ri] in General American but [-əri] in RP, and often just [-ri] in more colloquial British English).

The pronunciation [-ə̀wr] is recorded as late as 1665 by Owen Price (The Vocal Organ). William Salesbury (1547–1567) spells the suffix as -wr in Welsh orthography, which could reflect a pronunciation [-ùːr] or [-ur]; the former would be the result of occasional failure of the Great Vowel Shift before final [r] as in pour, tour, while the latter would be the probable initial result of vowel reduction. John Hart (1551–1570) has [-urz] in governors. So the [-ə̀wr] pronunciation was in current use throughout the 17th century, although the reduced forms were already being used occasionally in Standard English during the 16th. Exactly when [-ə̀wr] became obsolete, I don’t know (because Dobson doesn’t cover the ModE period).

⁂

Bold General Wolfe to his men did say
Come lads and follow without delay
To yonder mountain that is so high
Don’t be down-hearted
For we’ll gain the victory
— “General Wolfe” as sung by the Copper Family

Our king went forth to Normandy
With grace and might of chivalry
The God for him wrought marvelously
Wherefore England may call and cry
— “Agincourt Carol” as sung by Maddy Prior and June Tabor

This is another case where loss of secondary stress is the culprit. The words victory, Normandy and chivalry are all borrowings of OF words ending in -ie /-i/. They would therefore have ended up having [-àj] in ModE, like cry, had it not been for the loss of the secondary stress. For the -y suffix this occurred quite early in everyday speech, already in late ME, but the secondarily stressed variants survived to be used in poetry and song for quite a while longer. Alexander Gil’s Logonomia Anglica (1619) explicitly remarks that pronouncing three-syllable, initially-stressed words ending in -y with [-ə̀j] is something that can be done in poetry but not in prose. Dobson says that apart from Gil’s, there are few mentions of this feature of poetic speech during the 17th century; we can perhaps take this an indication that it was becoming unusual to pronounce -y as [-ə̀j] even in poetry. I don’t know exactly how long the feature lasted. But General Wolfe is a folk song whose exact year of composition can be identified—1759, the date of General Wolfe’s death—so the feature seems to have been present well into the 18th century.

⁂

They’ve let him stand till midsummer day
Till he looked both pale and wan
And Barleycorn, he’s grown a beard
And so become a man
— “John Barleycorn” as sung by The Young Tradition

In ModE wan is pronounced [wɒ́n], with a different vowel from man [man]. But both of them used to have the same vowel as man; in wan the influence of the preceding [w] resulted in rounding to an o-vowel. The origins of this change are traced by Dobson to the East of England during the 15th century. There is evidence of the change from the Paston Letters (a collection of correspondence between members of the Norfolk gentry between 1422 and 1509) and the Cely Papers (a collection of correspondence between wealthy wool merchants owning estates in Essex between 1475 and 1488); the Cely Papers only exhibit the change in the word was, but the change is more extensive in the Paston Letters and in fact seems to have applied before the other labial consonants [b], [f] and [v] too for these letters’ writers.

There is no evidence of the change in Standard English until 1617, when Robert Robinson in The Art of Pronunciation notes that was, wast (as in thou wast) and what have [ɒ́] rather than [á]. The restriction of the change to unstressed function words initially, as in the Cely Papers suggests the change did indeed spread from the Eastern dialects. Later phoneticians during the 17th century record the [ɒ́] pronunciation in more and more words, but the change is not regular at this point; for example, Christopher Cooper (1687) has [ɒ́] in watch but not in wan. According to Dobson, relatively literary words such as wan and quality, not often used in everyday speech, did not reliably have [ɒ́] until the late 18th century.

Note that the change also applied after [wr] in wrath, and that words in which a velar consonant ([k], [g] or [ŋ]) followed the vowel were regular exceptions (cf. wax, wag, twang).

⁂

I’ll go down in some lonesome valley
Where no man on earth shall e’er me find
Where the pretty little small birds do change their voices
And every moment blows blusterous winds
— “The Banks of the Sweet Primroses” as sung by the Copper family

The expected ModE pronunciation of OE wind ‘wind’ would be [wájnd], resulting in homophony with find. Indeed, as far as I know, every other monosyllabic word with OE -ind has [-ájnd] in Modern English (mind, grind, bind, kind, hind, rind, …), resulting from an early ME sound change that lengthened final-syllable vowels before [nd] and various other clusters containing two voiced consonants at the same place of articulation (e.g. [-ld] as in wild).

It turns out that [wájnd] did use to be the pronunciation of wind for a long time. The OED entry for wind, written in the early 20th century, actually says that the word is still commonly taken to rhyme with [-ajnd] by “modern poets”; and Bob Copper and co. can be heard pronouncing winds as [wájndz] in their recording of “The Banks of the Sweet Primroses”. The [wínd] pronunciation reportedly became usual in Standard English only in the 17th century. It is hypothesized to be a result of backformation from the derivatives windy and windmill, in which lengthening never occurred because the [nd] cluster was not in word-final position. It is unlikely to be due to avoidance of homophony with the verb wind, because the words spent several centuries being homophonous without any issues arising.

⁂

Meeting is pleasure but parting is a grief
And an inconstant lover is worse than a thief
A thief can but rob me and take all I have
But an inconstant lover sends me to the grave
— “The Cuckoo”, as sung by Anne Briggs

As the spelling suggests, the word have used to rhyme with grave. The word was confusingly variable in form in ME, but one of its forms was [haːvə] (rhyming with grave) and another one was [havə]. The latter could have been derived from the former by vowel reduction when the word was unstressed, but this is not the only possible sources of it (e.g. another one would be analogy with the second-person singular form hast, where the a was in a closed open syllable and therefore would have been short); there does not seem to be any consistent conditioning by stress in the forms recorded by 16th- and 17th-century phoneticians, who use both forms quite often. There are some who have conditioning by stress, such as Gil, who explicitly describes [hǽːv] as the stressed form and [hav] as the unstressed form. I don’t know how long [hǽːv] (and its later forms, [hɛ́ːv], [héːv], [héjv]) remained a variant usable in Standard English, but according to the Traditional Ballad Index, “The Cuckoo” is attested no earlier than 1769.

⁂

Now the day being gone and the night coming on
Those two little babies sat under a stone
They sobbed and they sighed, they sat there and cried
Those two little babies, they laid down and died
— “Babes in the Wood” as sung by the Copper family

In EModE there was occasional shortening of stressed [ɔ́ː], so that it developed into ModE [ɒ́] rather than [ów] as normal. It is a rather irregular and mysterious process; examples of it which have survived into ModE include gone (< OE ġegān), cloth (< OE clāþ) and hot (< OE hāt). The 16th- and 17th-century phoneticians record many other words which once had variants with shortening that have not survived to the present-day, such as both, loaf, rode, broad and groat. Dobson mentions that Elisha Coles (1675–1679) “knew some variant, perhaps ŏ in stone“; the verse from “Babes in the Wood” above would be additional evidence that stone at some point by some people was pronounced as [stɒn], thus rhyming with on. As far as I know, there is no way it could have been the other way round, with on having [ɔ́ː]; the word on has always had a short vowel.

⁂

“So come riddle to me, dear mother,” he said
“Come riddle it all as one
Whether I should marry with Fair Eleanor
Or bring the brown girl home” (× 2)

“Well, the brown girl, she has riches and land
Fair Eleanor, she has none
And so I charge you do my bidding
And bring the brown girl home” (× 2)
— “Lord Thomas and Fair Eleanor” as sung by Peter Bellamy

In “Lord Thomas and Fair Eleanor”, the rhymes on the final consonant are often imperfect (although the consonants are always phonetically similar). These two verses, however, are the only ones where the vowels aren’t the same in the modern pronunciation—and there’s good reason to think they were the same once.

The words one and none are closely related. The OE word for ‘one’ was ān; the OE word for ‘none’ was nān; the OE word for ‘not’ was ne; the second is simply the result of adding the third as a prefix to the first: ‘not one’.

OE ā normally becomes ME [ɔ́ː] and then ModE [ów] in stressed syllables. If it had done that in one and none, it’d be a near-rhyme with home today, save for the difference in the final nasals’ places of articulation. Indeed, in only, which is a derivative of one with the -ly suffix added, we have [ów] in ModE. But the standard ModE pronunciations of one and none are [wʌ́n] and [nʌ́n] respectively. There are also variant forms [wɒ́n] and [nɒ́n] widespread across England. How did this happen? As usual, Dobson has answers.

The [nɒ́n] variant is the easiest one to explain, at least if we consider it in isolation from the others. It’s just the result of sporadic [ɔ́ː]-shortening before [n], as in gone (see above on the on–stone rhyme). As for [nʌ́n]—well, ModE [ʌ] is the ordinary reflex of short ME [u], but there is a sporadic [úː]-shortening change in EModE besides the sporadic [ɔ́ː]-shortening one. This change is quite common and reflected in many ModE words such as blood, flood, good, book, cook, wool, although I don’t think there are any where it happens before n. So perhaps [nɔ́ːn] underwent a shift to [nóːn] somehow during the ME period, which would become [núːn] by the Great Vowel Shift. As it happens there is some evidence for such a shift in ME from occasional rhymes in ME texts, such as hoom ‘home’ with doom ‘doom’ and forsothe ‘forsooth’ with bothe ‘bothe’ in the Canterbury Tales. However, there is especially solid evidence for it in the environment after [w], in which environment most instances of ME [ɔ́ː] exhibit raising that has passed into Standard English (e.g. who < OE hwā, two < OE twā, ooze < OE wāse; woe is an exception in ModE, although it, too, is listed as a homophone of woo occasionally by Early Modern phoneticians). Note that although all these examples happen to have lost the [w], presumably by absorption into the following [úː] after the Great Vowel Shift occurred, there are words such as womb with EModE [úː] which have retained their [w], and phoneticians in the 16th and 17th centuries record pronunciations of who and two with retained [w]. So if ME [ɔ́ːn] ‘one’ somehow became [wɔ́ːn], and then raising to [wóːn] occurred due to the /w/, then this vowel would be likely to spread by analogy to its derivative [nɔ́ːn], allowing for the emergence of [wʌ́n] and [nʌ́n] in ModE. The ModE [wɒ́n] and [nɒ́n] pronunciations can be accounted for by assuming the continued existence of an un-raised [wɔ́ːn] variant in EModE alongside [wuːn].

As it happens there is a late ME tendency for [j] to be inserted before long mid front vowels and, a little less commonly, for [w] to be inserted before word-initial long mid back vowels. This glide insertion only happened in initial syllables, and usually only when the vowel was word-initial or the word began with [h]; but there are occasional examples before other consonants such as John Hart’s [mjɛ́ːn] for mean. The Hymn of the Virgin (uncertain date, 14th century), which is written in Welsh orthography and therefore more phonetically transparent than usual, evidences [j] in earth. John Hart records [j] in heal and here, besides mean, and [w] in whole (< OE hāl). 17th-century phoneticians record many instances of [j]- and [w]-insertion, giving spellings such as yer for ‘ere’, yerb for ‘herb’, wuts for ‘oats’ (this one also has shortening)—but they frequently condemn these pronunciations as “barbarous”. Christopher Cooper (1687) even mentions a pronunciation wun for ‘one’, although not without condemning it for its barbarousness. The general picture seems to be that glide insertion was widespread in dialects, and filtered into Standard English to some degree during the 16th century, but there was a strong reaction against it during the 17th century and it mostly disappeared—except, of course, in the word one, which according to Dobson the [wʌ́n] pronunciation becomes normal for around 1700. The [nʌ́n] pronunciation for ‘none’ is first recorded by William Turner in The Art of Spelling and Reading English (1710).

Finally, I should mention that sporadic [úː]-shortening is also recorded as applying to home, resulting in the pronunciation [hʌ́m]; and Turner has this pronunciation, as do many English traditional dialects. So it’s possible that the rhyme in “Lord Thomas and Fair Eleanor” is due to this change having applied to home, rather than preservation of the conservative [-ówn] forms of one and none.

Consider the following very simple game: a Bernoulli trial (a trial which results in one of two possible outcomes, labelled “success” and “failure”) is carried out with success probability . Beforehand, you are told the value of and asked to give a definite prediction of the trial’s outcome. That is, you have to predict either success or failure; just saying “the probability of success is ” is not enough. You win if and only if you predict the correct outcome.

Here are two reasonable-sounding strategies for this game:

If , predict success. If , predict failure. If , predict success with probability 0.5 and failure with probability 0.5.

Predict success with probability and failure with probability .

In game-theoretic language, the difference between strategies 1 and 2 is that strategy 1 involves the use of a pure strategy if possible, i.e. one in which the choice of what to predict is made deterministically, while strategy 2 is always mixed, i.e. the choice of what to predict is made randomly.

But which is better? Note that the answer may depend on the value of . Try to think about it for a minute before moving on to the next paragraph.

If , then the strategies are identical and therefore both equally good.

If , let be the probability of the more probable outcome (i.e. if and if ). If the more probable outcome happens, then you win for sure under strategy 1 but you only have probability of winning under strategy 2. If the less probable outcome happens, then you lose for sure under strategy 1 but you still have probability of winning under strategy 2. Therefore the probability of winning is under strategy 1 and under strategy 2. So strategy 1 is better than strategy 2 if and only if

i.e.

This quadratic inequality holds if and only if . But is the probability of the more probable outcome, and therefore for sure. Therefore, strategy 1 is always better if .

I find this result weird and a little counterintuitive when it’s stated so abstractly. It seems to me like the most natural way of obtaining a definite value from the distribution—drawing randomly from the distribution—should be the best one.

But I guess it does makes sense, if you think about it as applying to a concrete situation. For example, if you were on a jury and you thought there was a probability that the defendant was guilty, it would be crazy to then flip 10 coins and precommit to arguing for the defendant’s guilt if every one of them came up heads. The other jurors would think you were mad (and probably be very angry with you, if they did all come up heads).

The result has interesting implications for how people should act on their beliefs. If you believe that degrees of belief can be usefully modelled as probabilities, and you try to apply this in everyday reasoning, you will often be faced with the problem of deciding whether to act in accordance with a belief’s truth even if you only place a certain probability on that belief being true. Should you always act in accordance with the belief if , or should you have probability of acting in accordance with it at any given time? Until I wrote this post it wasn’t obvious to me, but the result in this post suggests you should do the former.

I do wonder if there is anything strategy 2 is good for, though. Comment if you have an idea!

I’ve been thinking about how to use a computer program to randomly generate a probability distribution of a given finite size. This has turned out to be an interesting problem.

My first idea was to generate n − 1 uniform variates in [0, 1] (where n is the desired size), sort them, add 0 to the front of the list and 1 to the back of the list, and then take the non-negative differences between the adjacent variates. In Python code:

My second idea was to simply generate n uniform variates in [0, 1], add them together, and take the ratio of each individual variate to the sum.

def randpd2(size):
variates = [random.random() for i in range(size)]
s = sum(variates)
return [i/s for i in variates]

Both of these functions do reliably generate probability distributions, i.e. lists of non-negative real numbers (encoded as Python float objects) that sum to 1, although very large lists generated by randpd2 sometimes sum to something slightly different from 1 due to floating-point imprecision.

But do they both generate a random probability distribution? In precise terms: for a given size argument, are the probability distributions of the return values randpd1(size) and randpd2(size) always uniform?

I don’t really know how to answer this question. In fact, it’s not even clear to me that there is a uniform distribution over the probability distributions of size n for every positive integer n. The problem is that the probability distributions of size n are the solutions in of the equation , where , , … and are dummy variables, and therefore they comprise a set S whose dimension is n − 1 (not n). Because S is missing a dimension, continuous probability distributions over it cannot be defined in the usual way via probability density mappings on . Any such mapping would have to assign probability density 0 to every point in S, because for every such point x, there’s a whole additional dimension’s worth of points in every neighbourhood of x which are not in S. But then the integral of the probability density mapping over S would be 0, not 1, and it would not be a probability density mapping.

But perhaps you can map S onto a subset of , and do something with a uniform distribution over the image. In any case, I’m finding thinking about this very confusing, so I’ll leave it for readers to ponder over. Given that I don’t currently know what a uniform probability distribution over the probability distributions of size n even looks like, I don’t know how to test whether one exists.

I can look at the marginal distributions of the individual items in the returned values of randpd1 and randpd2. But these marginal distributions are not straightforwardly related to the joint distribution of the list as a whole. In particular, uniformity of the joint distribution does not imply uniformity of the marginal distributions, and uniformity of the marginal distributions does not imply uniformity of the joint distribution.

But it’s still interesting to look at the marginal distributions. First of all, they allow validation of another desirable property of the two functions: the marginal distributions are the same for each item (regardless of its position in the list). I’m not going to demonstrate this here because it would be tedious, but it does look like this is the case. Therefore we can speak of “the marginal distribution” without reference to any particular item. Second, they reveal that randpd1 and randpd2 do not do exactly the same thing. The marginal distributions are different for the two functions. Let’s first look just at the case where size is 2.

>>> data1 = [randpd1(2)[0] for i in range(100000)]
>>> plt.hist(data1)

>>> data2 = [randpd2(2)[0] for i in range(100000)]
>>> plt.hist(data2)

The first plot looks like it’s been generated from a uniform distribution over [0, 1]; the second plot looks like it’s been generated from a non-uniform distribution which concentrates the probability density at . It’s easy to see why the distribution is uniform for randpd1: the function works by generating a single uniform variate p and then returning [min(p, 1 - p), max(p, 1 - p)], and given that the distribution of p is uniform the distribution of 1 - p is also uniform. The function randpd2, on the other hand, works by generating two uniform variates p + q and returning [p/(p + q), q/(p + q)]. However, I don’t know what the distribution of p/(p + q) and q/(p + q) is exactly, given that p and q are uniformly distributed. This is another thing I hope readers who know more about probability and statistics than me might be able to enlighten me on.

Here are the graphs for size 3:

>>> data1 = [randpd1(3)[0] for i in range(100000)]
>>> plt.hist(data1)

>>> data2 = [randpd2(3)[0] for i in range(100000)]
>>> plt.hist(data2)

The marginal distribution for randpd1 is no longer uniform; rather, it’s right triangular, with the right angle at the point (0, 0) and height 2. That means that, roughly speaking, a given item in the list returned by randpd1(3) is twice as likely to be close to 0 as it is to be close to , and half as likely to be close to 1 as it is to be close to .

In general, the marginal distribution for randpd1 is the distribution of the minimum of a sample of uniform variates in [0, 1] of size n − 1, where n is the value of size. This is because randpd1 works by generating such a sample, and the minimum of that sample always ends up being the first item in the returned list, and the marginal distributions of the other items are the same as the marginal distribution of the first item.

It turns out to be not too difficult to derive an exact formula for this distribution. For every , the minimum is greater than x if and only if all n − 1 variates are greater than x. Therefore the probabilities of these two events are the same. The probability of an individual variate being greater than x is 1 − x (because, given that the variate is uniformly distributed, x is the probability that the variate is less than or equal to x) and therefore, given that the variates are independent of each other, the probability of all being greater than x is . It follows that the probability of the minimum being less than or equal to x is . That is, the cumulative distribution mapping (CDM) f of the marginal distribution for randpd1 is given by

The probability distribution defined by this CDM is a well-known one called the beta distribution with parameters (1, n − 1). That’s a nice result!

The marginal distribution for randpd2, on the other hand, is similar to the one for size 2 except that the mean is now something like rather than , and because the support is still the whole interval [0, 1] this results in a left-skewing of the distribution. Again, I don’t know how to characterize this distribution exactly. Here are the graphs for sizes 4 and 5:

>>> data = [randpd2(4)[0] for i in range(100000)]
>>> plt.hist(data)

>>> data = [randpd2(5)[0] for i in range(100000)]
>>> plt.hist(data2)

It looks like the marginal distribution generally has mean , or something close to that, for every positive integer n, while still having density approaching 0 at the left limit.

In conclusion… this post doesn’t have a conclusion, it just has a bunch of unanswered questions which I’d like to know the answers to.

Is the concept of a uniform distribution over the probability distributions of size n sensical?

If so, do the returned values of randpd1 and randpd2 have that distribution?

If not, what distributions do they have?

What’s the exact form of the marginal distribution for randpd2?

Which is better: randpd1 or randpd2? Or if one isn’t clearly better than the other, what is each one best suited to being used for?

Are there any other algorithms one could use to generate a random probability distribution?

One of the classes I’m taking this term is about modelling the evolution of communication systems. Everything in the class is done via simulation, which is probably the best way to do it, and certainly necessary at the point where it starts to involve genetic algorithms and such. However, some of the earlier content in the class dealt with problems that I suspected were solvable by a purely mathematical approach, so as somebody with a maths degree I felt it necessary to rise to the challenge and try to derive the solutions mathematically. This post is my attempt to do that.

Let us begin by thinking very abstractly about a system which takes something in and gives something out. Suppose there is a finite, positive number m of things which may be taken in (possible inputs), which we shall call input 1, input 2, … and input m. Suppose likewise that there is a finite, positive number n of things which may be given out (possible outputs), which we shall call output 1, output 2, … and output n.

One way in which the behavior of such a system could be modelled is as a straightforward mapping from inputs to outputs. However, this might be too deterministic: perhaps the system doesn’t always output the same output for a given input. So let’s use a more general model, and think of the system as a mapping from inputs to probability distributions over outputs. For every pair (i, j) of integers such that 0 ≤ i ≤ m and 0 ≤ j ≤ n, let pi, j denote the probability that input i is mapped to output j. The mapping as a whole is determined by the mn probabilities of the form pi, j, and therefore it can be thought of as an m-by-n matrix A:

The rows of A correspond to the possible inputs and the columns of A correspond to the possible outputs. Probabilities are non-negative real numbers, so A is a non-negative real matrix. Also, the probabilities of mutually exclusive, exhaustive outcomes sum to 1, so the sum of each row of A is 1. This condition can be expressed as a system of linear equations:

Alternatively, and more compactly, it may be expressed as the matrix equation

where x is the n-dimensional vector whose components are all equal to 1 and y is the m-dimensional vector whose components are all equal to 1.

In general, if x is an n-dimensional vector, and we think of x as a random variable determined by the output of the system, then Ax is the vector of expected values of x conditional on each input. That is, for every integer i such that 1 ≤ i ≤ m, the ith component of Ax is the expected value of x conditional on meaning i being the input to the system.

Accordingly, if we have not just one, but pn-dimensional vectors x1, x2, … and xp (where p is a positive integer), we can think of these p vectors as the columns of an n-by-p matrix B, and then we can read off all the expected values from the matrix product

like so: for every pair (i, k) of integers such that 0 ≤ i ≤ m and 0 ≤ k ≤ p, the (i, k) entry of AB is the expected value of xk conditional on meaning i being the input to the system.

In the case where B happens to be another non-negative real matrix such that

so that the entries of B can be interpreted as probabilities, the matrix B as a whole can be interpreted as another input-output system whose possible inputs happen to be the same as the possible outputs of A. In order to emphasize this identity, let us now call the possible outputs of A (= the possible inputs of B) the signals: signal 1, signal 2, … and signal n. The other things—the possible inputs of A, and the possible outputs of B—can be thought of as meanings. Note that there is no need at the moment for the input meanings (the possible inputs of A) to be the same as the output meanings (the possible outputs of B); we make a distinction between the input meanings and the output meanings.

Together, A and B can be thought of as comprising a “product system” which works like this: an input meaning goes into A, a signal comes out of A, the signal goes into B, and an output meaning comes out of B. For every integer k such that 0 ≤ k ≤ p, the random variable xk (the kth column of B) can now be interpreted as the probability of the product system outputting output meaning k, as a random variable whose value is determined by the signal. That is, for every integer j such that 0 ≤ j ≤ n, the jth component of xk (the (j, k) entry of B) is the probability of output meaning k coming out if the signal happens to be signal j. It follows by the law of total probability that the probability of output meaning k coming out, if i is the input meaning, is the expected value of xk conditional on i being the input meaning. Now, by what we said a couple of paragraphs above, we have that for every integer i such that 0 ≤ i ≤ m, the expected value of xk conditional on i being the input meaning is the (i, k) entry of AB. So the “product system”, as a matrix, is the matrix product AB. That’s why we call it the “product system”, see? 🙂

In the case where the possible input meanings are the same as the possible output meanings and m = p, we may think about the “product system” as a communicative dyad. The speaker is A, the hearer is B. The speaker is trying to express a meaning, the input meaning, and producing a signal in order to do so, and the hearer is interpreting that signal to have some meaning, the output meaning. The output meaning the hearer understands is not necessarily the same as the input meaning the speaker was trying to express. If it is different, we may regard the communication as unsuccessful; if it is the same, we may regard the communication as successful.

The key question is: what is the probability that the communication is successful? Given the considerations above, it’s very easy to answer. If the input meaning is i, we’re just looking for the probability that output meaning i given this input meaning. That probability is simply the (i, i) entry of AB, i.e. the ith entry along AB‘s main diagonal.

What if the input meaning isn’t fixed? Then the answer will in general depend on the probability distribution over the possible input meanings. But in the simplest case, where the distribution is uniform (no input meaning is any more probable than any other), the probability of successful communication is just the mean of the input meaning-specific probabilities, that is, the sum of the main diagonal entries of AB, divided by m (the number of the main diagonal entries, i.e. the number of meanings). In linear algebra, we call the sum of the main diagonal entries of a square matrix its trace, and we denote it by tr(C) where C is the matrix. So our formula for the communication success probability p is

If the probability distribution over the input meanings isn’t uniform, the probability of successful communication is just the weighted average of the input meaning-specific probabilities, with the weights being the respective input meaning probabilities. The general formula can therefore be written as

where D is the diagonal matrix of size m whose main diagonal is the probability distribution over the input meanings (i.e. for every integer i such that 0 ≤ i ≤ m, the ith diagonal entry of D is the probability of input meaning i being the one the speaker tries to express). It doesn’t matter whether D is left-multiplied or right-multiplied, because the trace of the product is the same in either case. In the case where the probability distribution over the input meanings is uniform the diagonal entries of D are all equal to , i.e , where Im is the identity matrix of size m, and therefore (3) reduces to (2).

To leave you fully convinced that this formula works, here are some simulations. The 5 graphs below were generated using a Python script which you can view on GitHub. Each one involves 3 possible meanings, 3 possible signals, randomly-generated speaker and hearer matrices and a randomly-generated probability distribution over the input meanings. If you look at the code, you’ll see that the blue line is generated by simulating communication in the obvious way, by randomly drawing an input meaning, randomly drawing a signal based on that particular input meaning, and finally randomly drawing an output meaning based on that particular signal. The position on the x-axis corresponds to the number of trials (individual simulated communicative acts) carried out so far and the position on the y-axis corresponds to the proportion of those trials involving a successful communication (one where the output meaning ended up being the same as the input meaning). For each graph, there were 10 sets of 500 trials; each individual set of trials corresponds to one of the light blue lines, while the darker blue lines gives the results averaged over those ten sets. The horizontal green line indicates the success probability as calculated by our formula. This should be close to the success proportion for a large number of trials, so we should see the blue and green lines converging on the right side of each graph. That is what we see, so the formula works.

I’ve split this post into two parts because it would be really long otherwise. Part 2 will be coming up later, hopefully.

Surprisingness

Let’s think about how we might define surprisingness as a mathematical quantity.

Surprisingness is a property of events as perceived by observers. After the event occurs, the observer is surprised to a certain degree. The degree of surprisingness depends on how likely the observer thought it was that the event would occur, or to put it briefly the subjective probability of the event. In fact, to a reasonable approximation, at least, the degree of surprisingness depends only on the subjective probability of the event. The greater the subjective probability, the less the surprise.

For example, if you flip a coin and it comes up heads, this is not very surprising—the probability of this event was only , which is reasonably high. But if you flip ten coins and they all come up heads, you are much more surprised, due to the fact that the probability of ten heads in ten flips is only

This is a much smaller quantity than , hence the large increase in surprise. Even if you are unable to do the calculation and work out that the probability of ten heads in ten flips is exactly , you are probably aware on an intuitive level that it is a lot smaller than .

These considerations suggest surprisingness might be definable as a mapping on (the set of the real numbers between 0 and 1, inclusive), such that for every (i.e. every member of ), the value is the common surprisingness of the events of subjective probability . Listed below are three properties such a mapping should have, if it is to be in accordance with the everyday meaning of “surprise”.

Property 1. For every (i.e. every member of ), the value should be a non-negative real number, or (positive infinity).

Justification. Surprisingness seems to be reasonably well-modelled as (to use statistics jargon) an interval variable: two surprisingness values can be compared to see which is less and which is greater, there is a sensical notion of “distance” between surprisingness values, etc. Therefore, it makes sense for surprisingness to be represented by a real number. Moreover, there is a natural minimum surprisingness, namely the surprisingness of an event of subjective probability 1 (i.e. which the observer was sure would happen)—such an event is not at all surprising, and it makes sense to speak of being, e.g., “twice as surprised” by one event compared to another with reference to this natural minimum. To use statistics jargon again, surprisingness is a ratio variable. Naturally, it makes sense to set the minimum value at 0.

Property 2. The mapping should be strictly decreasing. That is, for every pair of members of such that , we should have .

Justification. As said above, events of high subjective probility are less surprising than those of low subjective probability.

Property 3. For every pair of members of , we should have

Justification. Suppose and are independent events of subjective probabilities and respectively. Then, assuming subjective probabilities are assigned in a consistent way, obeying the laws of probability, the subjective probability of (the event that both and occur) is and therefore the surprisingness of is . But given that and are independent (so the observation of one does not change the observer’s subjective probability of the other), the surprisingness of overall, i.e. , should be the same as the total surprisingness of the individual observations of and , i.e. .

Remarkably, these three properties determine the form of almost exactly (up to a vertical scale factor). Feel free to skip the proof below; it is somewhat long and tedious and doesn’t use any ideas that will be used later on in the post, and I haven’t put in much effort to make it easy to follow.

Theorem 1. The mappings on having properties 1–3 above are those of the form (i.e. those mappings on such that for every ), where is a real number greater than 1.

Proof. Suppose is a mapping on having properties 1–3 above. Let be the composition of and . Then the domain of is the set of the such that , i.e. ; and for every pair of non-positive real numbers, we have

In the case where , we see that , and we also have because , so combining the two we have , from which it follows by subtraction of from both sides that . Similarly, for every non-positive and every non-negative (the set is the set of the integers) we have , and if we assume as an inductive hypothesis that it follows that . This proves by induction that .

Now, let . For every non-positive (the set is the set of the rational numbers), there is a non-positive and a positive such that , and therefore

from which it follows that .

The equation in fact holds for every non-positive , not just for . To prove this, first, observe that is strictly decreasing, because for every pair of non-positive real numbers such that , we have and therefore (the mapping being strictly decreasing by property 2). Second, suppose is a non-positive real number and . Then , i.e. either (case 1) or (case 2). Because every non-empty interval contains a rational number, it follows that there is an such that (in case 1) or (in case 2).

In both cases, we have . This is obvious in case 1, because . As for case 2, observe first that is non-negative-valued and therefore . If is positive, it will then follow that and therefore (because ). To prove that is positive, observe that is strictly decreasing, and we have , i.e. , and , i.e. .

Because in both cases, we have . And by the strict decreasingness of it follows that either (in case 1) or (in case 2). Rearranging these two inequalities gives us (in case 1) or (in case 2). But we already have (in case 1) or (in case 2), so in both cases there is a contradiction. It follows that cannot hold; we must have .

Finishing off, note that for every , we have . Let ; then because , and , i.e. , from which it follows that for every . From this we can conclude that for every mapping on such that properties 1–3 hold, we have

where is a real number greater than 1.

In the rest of this post, suppose is a real number greater than 1. The surprisingness mapping will be defined by

It doesn’t really make any essential difference to the nature of which value of is chosen, because for every , we have (where the symbol with no base specified refers to the natural logarithm, the one to base ), and therefore the choice of base amounts to a choice of vertical scaling. Below are three plots of surprisingness against probability for three different bases (2, and 10); you can see that the shape of the graph is the same for each base, but the vertical scaling is different.

Note that regardless of the choice of base, we have

That is, events thought sure to occur are entirely unsurprising, and events thought sure not to occur are infinitely surprising when they do happen.

Also,

I’m inclined to say on this basis that 2 is the most sensible choice of base, because of all the possible probabilities other than 0 and 1, the one which is most special and thus most deserving of corresponding to the unit surprisingness is probably , the probability exactly halfway between 0 and 1. However, base is also quite convenient because it simplifies some of the formulae we’ll see later on in this post.

Entropy

First, two basic definitions from probability theory (for readers who may not be familiar with them). Suppose is a finite set.

Definition 1. The probability distributions on are the positive real-valued mappings on such that

The probability distributions on can be thought of as random generators of members of . Each member of is generated with probability equal to the positive real number less than or equal to 1 associated with that member. The probabilities of generation for each member have to add up to 1, and this is expressed by equation (1). Note that the requirement that the probabilities add up to 1 implies the requirement that the probabilities are each less than or equal to 1, so there is no need to explicitly state the latter requirement in the definition.

Henceforth, suppose is a probability distribution on .

Definition 2. For every real-valued mapping on , the expected value or mean of the transformation of by is given by the formula

Let , , … and be the members of and let , , … and . Then for every real-valued mapping on , the expected value is the weighted average of , , … and , with the weights the respective probabilities , , … and .

If a very large number of values are generated (independently) by , then the average value under of these values can be calculated as

where , , … and are the frequencies of the values , , … and in the sample. Given that is very large it is very likely that the relative frequencies , , … and will be close to , , … and , respectively, and therefore the average will be close to .

Now, the definition of the key concept of this post:

Definition 3. The expected surprisingness or entropy (that’s the Greek capital letter eta, not the Latin capital letter H) of is (here, is the composition of and , i.e. ).

The use of the word “entropy” here is closely related to the use of the same word in thermodynamics. However, I won’t go into the connection here (I don’t really know enough about thermodynamics to be able to talk intelligently about it).

Having the concept of entropy to hand gives us a fun new question we can ask about any given probability distribution: what is its entropy, or, more evocatively, how surprising is it?

By (2), we have

Using the definition of , this can also be written as

Using logarithm properties, it can even be written as

This shows that is the logarithm of the reciprocal of

which is the geometric weighted average of , , … and , with the weights being identical to the values averaged. A geometric weighted average is similar to an ordinary weighted average, except that the values averaged are multiplied rather than added and the weights are exponents rather than factors. The product (5) can therefore be thought of as the “expected probability” of the value generated by . To the extent that is likely to generate one of the members of which has a low individual probability of being generated, the product (5) is small and accordingly is large. This may help give you a slightly more concrete idea why can be thought of as the expected surprisingness of .

Note that the value of is determined completely by the probabilities , , … and . The values , , … and are irrelevant in and of themselves. This is evident from the formula (3), in which the expression only appears as a sub-expression of . To understand how is determined by these probabilities, it helps to look at the graph of the mapping , which I have plotted below. The entropy is a sum of exactly values of this mapping.

It can be seen from the graph that the mapping has a maximum value. Using calculus, we can figure out what this maximum value is and exactly where it is attained.

Theorem 1. The maximum value attained by is , and this maximum is attained at the point (and nowhere else).

Proof. The derivative of is (by the product rule), which is positive if (because then ), equal to 0 if (because then ), and negative if (because then ). Therefore increases in value from the vicinity of 0 to and decreases in value from towards , which means it attains a minimum at .

Because is a sum of values of , we may conclude that the inequality

always holds. However, this upper bound on the value of can be improved, as we’ll see below. After all, in proving that it holds we’ve made no use of equation (1).

Cross entropy

The concept of cross entropy is a useful generalization of the concept of entropy.

Suppose another probability distribution on . Think of and as two different models of the random member-of- generator: is the right model (or a more right model, if you don’t like the idea of one single right model), and is an observer’s working model. The probabilities , , … and can be thought of as the real probabilities of generation of , , … and , respectively, while the probabilities , , … and can be thought of as the observer’s subjective probabilities. Although the real probabilities determine what the observer observes, the subjective probabilities are what determine how surprised the observer is by what they observe. Therefore, if the observer calculates entropy by averaging their surprisingness over a large number of observations, they will get something close to

i.e. . This quantity is called the cross entropy from to and denoted . Note that if then ; that’s why the concept of cross entropy is a generalization of the concept of entropy. Note also that is not necessarily the same as .

Your intuition should tell you that will always be greater than , if . Why? Think about it: is the expected surprisingness if the observer has the wrong model, is the expected surprisingness if the observer has the right model. Having the right model should lead to the observer being better able to predict which member of is generated and thus to the observer being less likely to be surprised.

It is indeed the case that

if (which further reassures us that is a very good mathematical model of the intuitive notion of surprisingness). The inequality (6) is called Gibbs’ inequality. In order to prove it, first, observe that it may be rewritten as

Now, the quantity on the left-hand side of (7) has a name of its own: it’s called the Kullback-Leibler divergence from to and denoted . It measures the “penalty” in increased surprisingness the observer gets for having the wrong model; it can also be thought of as a measure of how different the probability distributions and are from each other, hence the name “divergence”. As for the “Kullback-Leibler” part, that’s just because mathematicians have come up with lots of different ways of measuring how different two probability distributions are from each other and Kullback and Leibler were the mathematicians who came up with this particular measure. I won’t be referring to any other such measures in this post, however, so whenever I need to refer to Kullback-Leibler divergence again I’ll just refer to it as “divergence”.

So Gibb’s inequality, reformulated, states that the divergence between two unequal probability distributions is always positive. To prove this, it’s helpful to first write out an explicit expression for :

Second, we prove a lemma.

Lemma 1. For every positive , we have , where is the natural logarithm (logarithm to base ) of , with equality if and only if .

Proof. The inequality in question is equivalent to (by multiplication of both sides by ). The right-hand side of this inequality is equal to . Consider the mapping . The derivative of this mapping is , which is negative if (because then and therefore ), equal to 0 if (because then ) and positive if (because then and therefore ). Therefore is strictly decreasing on and strictly increasing on . It follows that its minimum value is attained at the point . And that minimum value is .

Using Lemma 1, we have

for every , with equality if and only if , i.e. . Given that , there is at least one such that and therefore (8) holds without equality. It follows that

which proves Gibbs’ inequality.

Gibbs’ inequality is quite powerful and useful. For example, it can be used to figure out what the maximum possible entropy is. Suppose is such that , regardless of the value of . Then if , we have by Gibbs’ inequality, and therefore is the maximum entropy possible for probability distributions on and is the unique probability distribution on whose entropy is . Is there a probability distribution on such that , regardless of the value of ? There is indeed. Let be the uniform distribution on , i.e. the mapping (remember, is the number of members has). Then

so is a probability distribution, and regardless of the value of we have

Therefore, the maximum entropy possible on is , and the uniform distribution on is the probability distribution which attains this maximum.

A consequence of this is that the divergence of from the uniform distribution on is given by

which is just the negation of plus a constant (well, a constant depending on the size of the set which is distributed over). Therefore, among probability distributions on specifically, entropy can be thought of as a measure of divergence from the uniform distribution. Among probability distributions in general, entropy is a measure of both divergence from the uniform distribution and the size of the distributed-over set.

Philosophical implications

So far, we’ve seen that the informal concept of surprisingness can be formalized mathematically with quite a high degree of success—one might even say a surprising degree of success, which is fitting—and that’s pretty neat. But there are also some deeper philosophical issues related to all this. I’ve avoided talking about them up till now because philosophy is not really my field, but I couldn’t let them go completely unmentioned.

Suppose that you know that one of values , , … and values (where is a positive integer) will be generated by some probability distribution , but you don’t know the probabilities; you have absolutely no information about the probability distribution other than the set of values it may generate. What should your subjective probability distribution be? A possible answer to this question is that you shouldn’t have one—you simply don’t know the probabilities, and that’s all that can be said. And that’s reasonable enough, especially if you don’t like the concept of subjective probability or think it’s incoherent (e.g. if you’re a strict frequentist when it comes to interpreting probability). But if you accept that subjective probabilities are things it makes sense to talk about, well, the whole point of subjective probabilities is that they represent states of knowledge under uncertainty, so there’s no reason in principle to avoid having a subjective probability distribution just because this particular state of knowledge is particularly uncertain. Of course this subjective probability distribution may change as you gather more information—think of it as your “best guess” to start with, to be improved later on.

There are some people who’d say that you can choose your initial “best guess” on an essentially arbitrary basis, according to your own personal whims. No matter which subjective probability distribution you choose to start with, it will get better and better at modelling reality as you change it as you gather more information; the initial state isn’t really important. The initial state is of interest only in that if we have some updating mechanism in mind, we’d like to be able to prove that convergence to the real probability distribution will always happen independently of the initial state.

There is another position which can be taken, however, which is that there is in fact a certain objectively best subjective probability distribution to start with. This position is associated with the marquis de Laplace (1749–1827), who wrote a classic text on probability theory, the Théorie analytique des probabilités (Laplace was also a great contributor to many other fields of mathematics, as mathematicians back then tended to be—they didn’t do specialization back then). In Laplace’s opinion, the correct distribution to start with was the uniform distribution. That is, given that we know nothing about the probabilities, assume they are all the same (after all, we have no reason to believe any one is larger than any other). This principle is called the principle of indifference.

The concept of probability as a whole can be based on the idea of the principle of indifference. The idea would be that on some level, any probability distribution is over a set of interchangeable values and therefore uniform. However, we are often interested only in whether one of a particular class of values is generated (not in which particular member of that class) and the probability of interest in that case is the sum of the probabilities of each of the values in that class, which, because the underlying distribution is uniform, can also be expressed as the ratio of the number of values in the class to the total number of values which may be generated. I don’t know how far this is how Laplace thought of probability; I don’t want to be too hasty to attribute views to a 19th-century author which might be out of context or just incorrect (after all, I can’t read French so all my information about Laplace is second-hand).

It’s not hard to argue with the principle of indifference. It’s quite difficult to think of any reasonable justification for it at all. It was indeed attacked in vehement terms by later probability theorists, such as John Venn (1834–1923) (that’s right, the Venn diagram guy). In modern times, however, it has been championed by the statistical physicist E. T. Jaynes (1922–1998), who also came up with an interesting extension of it.

In the mathematical section of this post, we saw that the uniform distribution over , , … and was the one with the most entropy, i.e. the one which is “most surprising on average”. Therefore, a natural generalization of the principle of indifference would be to say that in any circumstance in which a subjective probability distribution must be assumed in a situation of uncertainty, one should assume whichever distribution has the most entropy while still being compatible with what is known. For example, if it is known what the mean is then the assumed distribution should be the one with the most entropy among all distributions having that specific mean. This is called the principle of maximum entropy or MaxEnt for short.

The MaxEnt principle makes sense intuitively, sort of, in a way that the principle of indifference on its own doesn’t. If you don’t know something about the probability distribution, you should expect to be surprised more often by the values it generates than if you do know that something. It’s still on fairly shaky ground, though, and I don’t know how far it is accepted by people other than Jaynes as a normative principle, as opposed to just one way in which you might choose the subjective probability distribution to start with, in competition with other ways of doing so on the basis of strictly pragmatic considerations (which seems to be how a lot of people in the practical applications side of things view it). In any case it gives us a philosophical motivation for examining the mathematical problem of finding the probability distributions that have the most entropy given particular constraints. The mathematical problem is interesting itself, but the philosophical connection makes it more interesting.

Addendum

In part 2 of this post, I’m going to describe how the famous normal distribution can be characterized as a maximum entropy distribution: namely, if it’s the normal distribution with mean (a real number) and standard deviation (a positive real number), then it’s the absolutely probability distribution over with the most entropy among all absolutely continuous probability distributions over with mean and standard deviation . That roughly means that by MaxEnt, if you know nothing about a continuous probability distribution other than that it has a particular mean and a particular standard deviation, your best guess to start with is that it’s normal. You can understand the famous Central Limit Theorem from that perspective: as you add up independent, identically distributed random variables, the mean and variance of the distribution in question will be carried over into the sum (elementary statistical theory tells us that the mean of any sum of random variables is the sum of the individual means, and the variance of any sum of independent random variables is the sum of the individual variances), but every other distinctive property of the distribution is gradually kneaded out by the summing process, so that as the sum gets larger and larger all we can say about the probability distribution of the sum is that it has this particular mean and this particular variance. I intend to finish off part 2 with a proof of the Central Limit theorem from this perspective, although that might be a little ambitious. Before that, though, the other big thing which I need to cover in part 2 is defining entropy in the case of a continuous probability distribution—I’ve only been talking about discrete probability distributions in part 1, and it turns out the extension is not entirely a straightforward matter.

A non-negative real- or ∞-valued mapping f on a field of sets is said to be finitely additive if and only if for every pair of disjoint sets A and B in , we have

The most important examples of finitely additive mappings are measures, including probability measures, although not every finitely additive mapping is a measure (measures are mappings on σ-algebras, which are a special sort of field of sets, that are countably additive, which is a stronger property than finite additivity).

From the definition it is immediately evident that finite additivity allows us to express the value under a mapping f on a field of sets of any binary union of pairwise disjoint sets in in terms of the values under f of the individual sets. In fact, the same can be said for unions of any arity, provided they are pairwise disjoint. For every field of sets, every finitely additive mapping f on , every [1] and every n-tuple (A1, A2, …, An) of pairwise disjoint sets in , we have

But what about unions of sets in that are not necessarily pairwise disjoint? Can the values under f of such unions be expressed in terms of the values under f of the individual sets then? The answer is no. However, such unions’ values under f can be expressed in terms of the values under f of the individual sets and their intersections, by what is known as the inclusion-exclusion principle. For every and every , we have

The sum on the right-hand side of (2) is a rather complicated one, cumbersome to write down as well as computationally expensive to compute by virtue of its large number of terms (one for every non-empty subset of , and there are 2n − 1 of those). Therefore, it is also convenient to use Bonferroni’s inequalities, which say that for every , we have

with the sign standing for “greater than or equal to, if m is even; less than or equal to, if m is odd”. The sum on the right-hand side of (3) has terms only for every non-empty subset of with no more than m members, and there are only 2m − 1 of those. In particular, if m = 1 the terms are just the values under f of the individual sets A1, A2, … and An and therefore we have

which is Boole’s inequality.

Note that Bonferroni’s inequalities hold when m ≥ n as well as when m < n. When m ≥ n, the sum on the RHS of (3) is exactly the same as the sum on the RHs of (2). Because there are both even and odd integers m such that m ≥ n, and any quantity which is both less than or equal to and greater than or equal to another quantity has to be equal to it, it follows that Bonferroni’s inequalities imply that the equation (2) holds and thus generalize the inclusion-exclusion principle.

In order to prove that Bonferroni’s inequalities hold, and thus prove the inclusion-exclusion principle, we can use induction. First, consider the case where m = 0. The only subset of of cardinality 0 is the empty one, so in this case the sum on the right-hand side of (3) is empty and (3) therefore reduces to the statement that

which is true because f is non-negative- or ∞-valued.

Now, suppose and Bonferroni’s inequalities hold in the case where m = M, and consider the case where m = M + 1. We shall use another inductive proof, within the inductive proof we’re currently carrying out, to show that Bonferroni’s inequalities hold for every in when m has this particular value. In the case where n = 0, the left-hand side of (3) reduces to f(∅) and the right-hand side is the empty sum once again, because there are no subsets of ( is the set of the integers greater than or equal to 1 and less than or equal to 0, and there are of course no such integers). Because f(∅) = 0 it follows that (3) is true, regardless of the direction of the inequality required.

As for the successor case, suppose that and Bonferroni’s inequalities hold in the case where n = N. Consider the case where n = N + 1. It is helpful at this point to write (3) for the given values of m and n as

where

If we also let

then we have (−1)M + 1(f(B) − b) ≥ 0 because Bonferroni’s inequalities hold in the case where n = N. Now, how are f(A) and a related to f(B) and b?

We obviously have A = B ∪ AN + 1, but B and AN + 1 are not necessarily disjoint so this doesn’t immediately tell us anything about the relationship of f(A) and f(B). However, AN + 1 is certainly disjoint from B \ AN + 1, so we have

That’s about all we can usefully say for the moment. As for a and b, well, the terms of b are a subset of those of a so it’s quite easy to write down the difference a − b. If we manipulate that difference a little bit, we can start getting it to look like something that could occur on the right-hand side of (3).

Let

Then we have (−1)M(f(C) − c) ≥ 0, because Bonferroni’s inequalities hold in the case where m = M, and we have a − b = f(AN + 1) − c.

Now, if we use the distributivity of intersection over union we can rewrite C as B ∩ AN + 1. It follows that B is the disjoint union of C and the set B \ AN + 1 which turned up above when we were contemplating the relationship of f(A) and f(B), and therefore we have f(B) = f(C) + f(B \ AN + 1). Using this new equation we may rewrite (4) as

from which it follows that f(A) − f(B) = f(AN + 1) − f(C)—a nicely analogous equation to a − b = f(AN + 1) − c. Finally, let us add the two quantities (−1)M + 1(f(B) − b) and (−1)M(f(C) − c) ≥ 0 which we know to be non-negative. The sum of two non-negative quantities is non-negative also, so we have

Consider an entity (for example, a language) which may or may not have a particular property (for example, obligatory coding of grammatical number). For convenience and interpretation-neutrality, we shall say that the entity is positive if it has this property and negative if it does not have this property. Consider the entity as it changes over the course of a number of events (for example, transmissions of the language from one generation to another) in which the entity’s state (whether it is positive or negative) may or may not change. For every nonnegative integer , let represent the entity’s state after exactly events have occurred, with negativity being represented by 0 and positivity being represented by 1. The initial state is a constant parameter of the model, but the states at other times are random variable whose “success” probabilities (i.e. values of 1 under their probability mass functions) are determined by and the other parameters of the model.

The other parameters of the model, besides , are denoted by and . These represent the probabilities that an event will change the state from negative to positive or from positive to negative, respectively. They are assumed to be constant across events—this assumption can be thought of as an interpretation of the uniformitarian principle familiar from historical linguistics and other fields. I shall call a change of state from negative to positive a gain and a change of state from positive to negative a loss, so that can be thought of as the gain rate per event and can be thought of as the loss rate per event.

Note that the gain resp. loss probability is / only if the state is negative resp. positive as the event begins. If the state is already positive resp. negative as the event begins then it is impossible for a further gain resp. loss to occur and therefore the gain resp. loss probability is 0 (but the loss resp. gain probability is /). Thus the random variables , , , … are not necessarily independent of one another.

I am aware that there’s a name for a sequence of random variables that are not necessarily independent of one another, namely “stochastic process”. However, that is about the extent of what I know about stochastic processes. I think the thing I’m talking about in this post is a very simple example of a stochastic process–an appropriate name for it would be the gain-loss process. If you know something about stochastic processes it might seem very trivial, but it was an interesting problem for me to try to figure out knowing nothing already about stochastic processes.

1.2. The solution

Suppose is a nonnegative integer and consider the state after exactly events have occurred. If the entity is negative as the th event begins, the probability of gain during the th event is . If the entity is positive as the th event begins, the probability of loss during the th event is . Now, as the th event begins, exactly events have already occurred. Therefore the probability that the entity is negative as the th event begins is and the probability that the entity is positive as the th event begins is . It follows by the law of total probability that

This recurrence relation can be solved using the highly sophisticated method of “use it to find general equations for the first few terms in the sequence, extrapolate the pattern, and confirm that the extrapolation is valid using a proof by induction”. I’ll spare you the laborious first phrase, and just show you the second and third. The solution is

Just so you can check that this is correct, the proofs by induction for the separate cases are given below.

Case 1 (. Base case. The expression

evaluates to 0 if , because the sum is empty.

Successor case. For every nonnegative integer such that

we have

Case 2 (). Base case. The expression

evaluates to 1 if , because the sum is empty.

Successor case. For every nonnegative integer such that

we have

I don’t know if there is any way to make sense of why exactly these equations are the way they are; if you have any ideas, I’d be interested to hear your comments. There is a nice way I can see of understanding the difference between the two cases. Consider an additional gain-loss process which changes in tandem with the gain-loss process that we’ve been considering up till just now, so that its state is always the opposite of that of . Then the gain rate of is (because if gains, loses) and the lose rate of is (because if loses, gains). And for every nonnegative integer , if we let denote the state of after exactly events have occurred, then

because if and only if . Of course, we can also rearrange this equation as .

Now, we can use the equation for Case 1 above, but with the appropriate variable names for substituted in, to see that

and it then follows that

Anyway, you may have noticed that the sum

which appears in both of the equations for is a geometric progression whose common ratio is . If , then and therefore (because and are probabilities, and therefore non-negative). The probability is then simply constant at 0 if (because gain is impossible) and constant at 1 if (because loss is impossible). Outside of this very trivial case, we have , and therefore the geometric progression may be written as a fraction as per the well-known formula:

It follows that

From these equations it is easy to see the limiting behaviour of the gain-loss process as the number of events approaches . If , then and therefore (because and are probabilities, and therefore not greater than 1). The equations in this case reduce to

which show that the state simply alternates deterministically back and forth between positive and negative (because is 0 if is even and 1 if is odd and is 1 if is even and 0 if is odd).

Otherwise, we have and therefore

Now the equations for and above are the same apart from the term in the numerator which contains as a factor, as well as another factor which is independent of . Therefore, regardless of the value of ,

This is a nice result: if is sufficiently large, the dependence of on , , … and is negligible and its success probability is negligibly different from . That it is this exact quantity sort of makes sense: it’s the ratio of the gain rate to the theoretical rate of change of state in either direction that we would get if both a gain and loss could occur in a single event.

In case you like graphs, here’s a graph of the process with , , and 500 events. The x-axis is the number of events that have occurred and the y-axis is the observed frequency, divided by 1000, of the state being positive after this number of events has occurred (for the blue line) or the probability of the state being positive according to the equations described in this post (for the green line). If you want to, you can view the Python code that I used to generate this graph (which is actually capable of simulating multiple-trait interactions, although I haven’t tried solving it in that case) on GitHub.

2. The continuous process

2.1. The problem

Let us now consider the same process, but continuous rather than discrete. That is, rather than the gains and losses occuring over the course of a discrete sequence of events, we now have a continuous interval in time, during which at any point losses and gains might occur instantaneously. The state of the process at time shall be denoted . Although multiple gains and losses may occur during an arbitrary subinterval, we may assume for the purpose of approximation that during sufficiently short subintervals only one gain or loss, or none, may occur, and the probabilities of gain and loss are directly proportional to the length of the subinterval. Let be the constant of proportionality for gain and let be the constant of proportionality for loss. These are the continuous model’s analogues of the and parameters in the discrete model. Note that they may be greater than 1, unlike and .

2.2. The solution

Suppose is a non-negative real number and is a positive integer. Let . The interval in time from time 0 to time can be divided up into subintervals of length . If is small enough, so that the approximating assumptions described in the previous paragraph can be made, then the subintervals can be regarded as discrete events, during each of which gain occurs with probability if the state at the start point of the subinterval is negative and loss occurs with probability if the state at the start point of the subinterval is positive. For every positive integer between 0 and inclusive, let denote the state of this discrete approximation of the process at time . Then for every integer between 0 and (inclusive) we have

provided and are not both equal to 0 (in which case, just as in the discrete case, the state remains constant at whatever the initial state was).

Many of the factors in this equation can be cancelled out, giving us

Now consider the case where in the limit approaches . Note that approaches 0 at the same time, because , and therefore the limit of is not simply 0 as in the discrete case. If we rewrite the expression as

and make the substitution , giving us

then we see that the limit is in fact , an exponential function of . It follows that

This is a pretty interesting result. I initially thought that the continuous process would just have the solution , completely independent of and , based on the idea that it could be viewed as a discrete process with an infinitely large number of events within every interval of time, so that it would constantly behave like the discrete process does in the limit as the number of events approaches infinity. In fact it turns out that it still behaves like the discrete process, with the effect of the initial state never quite disappearing—although it does of course disappear in the limit as approaches , because approaches 0: