Thursday, May 10, 2007

Scrabble point, letter distributions and actual English usage

A few days ago I got sucked into a game of Scrabble with my mother and grandmother. The matriarchs are both prolific players, but being so audacious, I expected to win anyway. Instead, my mother and I tied, even after exhausting the rules for tiebreaking.

I place blame on the old American HeritageDictionary we were using. Clearly written by troglodytes, it fails to recognize "dreg" as a word, recognizing the noun exlusively as being the plural "dregs" (are there nouns that have no singular form, only a plural form?). And I blame drawing two Ls late in the game. I whined that for only being worth one point a piece, they are difficult to use.

Point scores correlate inversely with the letter frequency distribution (LFD) at a statistically significant .71. That's fairly rigorous, especially given the number of tiles worth a single point (70 of 98, or 71%), with the most frequently used single-pointer, E, appearing more than three times as often as the least common, L.

The game's tile distribution is even more impressive, correlating with LFD at .91. Throwing points into the mix for a multivariate analysis shores it up a smidgeon to .92.

Still, 15% of the relationship is otherwise 'explained'. Consequently, some letters are mathematically more favorable even after both frequency and point value are taken into consideration. An index created by building a regression formula from the data indicated previously, comparing predicted LFD with actual LFD, dividing the former by the latter, and multiplying by 100 for aesthetic facilitation follows. The higher an index score, the better the letter is, after controlling for its scoring value and tile frequency. From the best to the worst:

Keep in mind that although the scores vary greatly, they are relative to one another (with a SD of 49 index points)--Butts did a tremendous job mirroring actual LFD in the game he created.

To further evidence his impresiveness, consider that the LFD says nothing about how various letters appear in the English language, only their respective frequencies. Some letters are more difficult to use than their absolute real world frequency suggests.

H, to Butts' credit, probably offers the best illustration. Using the above methodology, it is the most propitious letter to draw. But while nearly one-third of the English alphabet is used more often than H, the letter is included in the two most frequently occuring digraphs (TH, HE), as well as the first and third more frequently used trigraphs (THE, THA). H needs a little more help from his friends than, say, M does.

Butts astutely restricted the frequency of S. Due to plurality, it's the game's most reliably independently playable letter. Adding an S to the end of another word sets you up to perpendicularly drop a word beginning with S or use the S at the end of a newly-created word coming from the left or from above.

As mentioned previously, the index scores are based on letter frequencies in written English, not in various letters' utility in a game of Scrabble. X appears to be the most burdensome letter to draw, but an open A allows for the use of the X and an easy nine points. V appears to be the most unfavorable letter to draw in most cases.

With a combined index score of 87 (excluding Y), Butts' overloaded the game with vowels. But especially late in the game, vowels often remedy the frustration entailed in trying to play even two- or three-letter words--words in which at least one-half and one-third of the letters required are vowels.

Keeping the points for each letter the same, a mathematically 'optimal' tile distribution (with a parenthetical variance from the original game) follows:

Dropping the rare letters will increase playability, but reduce the tileset's total point value. Whether or not the ease of constructing longer words of more frequent intersection will overcome the point deficiency is difficult to discern.

The family has a couple of gamesets. I'll have to insist my mix be used every other time to see if an appreciable total score per game develops!

Scrabblers are likely speculating that Super Scrabble, introduced three years ago, does even better. Increasing n, in this case by doubling the number playable tiles, certainly allows for that. But rather than simply doubling the frequency of each letter and augmenting the board size, Mattel appears to have calibrated letter frequencies to more accurately reflect actual written usage (although the points for each letter remain unchanged). The game's tile distribution correlates with LFD at .96 (including points strengthens the correlation marginally at the third decimal place).

The index scores by letter for Super Scrabble, as well as the rest of the data used, are available here.

The free online dictionary does list 'barrack' as a noun, noting that it is often used in the plural. But I've not heard it used singularly. Rickets is a disease--is it necessarily plural, or, like Taurus, just ends in S?

Nice piece of work; no epigonic effort, this. Would be good to see a *score* frequency distribution (SFD) too, both for words played in a game, and for total game scores - would make for some nice histograms, and would allow scrabble dabblers like me to see just how far down the rank they stand.

Also, with the *word* SFD, you could then compare the distribution as played with that derived form a dictionary list. Of course a complicating factor there is the double/triple letters/words in Scrabble - all part of the fun.

Perhaps scrabulous on facebook could be persuaded to share their records for such a worthy undertaking....

The facebook connection is about the only way I'd be able to get a hold of that information, if I'm understanding you correctly. What exactly do you mean by SFD? The average score received for a word that uses X letter?