Archive

Several people, notably Robert Teague and Philip Neal, have theorized that the cipher is based on anagrams. The idea is that to decipher the VMs text, you first convert the VMs symbols to alphabet letters, and then you find an anagram of those letters that makes a valid plaintext word.

Variations on this theme are that e.g. some of the plaintext letters in the target word are allowed to be missing and can be inferred from “context”, and/or that the VMs words are anagrams of the plaintext words that have letters arranged alphabetically.

For this analysis, we look at Folio68r3 in particular:

The labels on the stars are, from the ten o’clock pie slice going anti-clockwise, in the Voyn 101 encoding:

<68r3.pieslice_1>61oe7a9.8oay9.oae1coe=

<68r3.pieslice_2>oh1o89.8ayaee.ok98*.oK98=

<68r3.pieslice_3>ohoe19.1G9.179,9h9=

<68r3.pieslice_4>ohos.okoy9=

Anagrams: Combinations/Permutations

If you make the reasonable assumption that all the labels on the star shapes in f68r3 are star names, and you assume a target language, then this puts severe constraints on the cipher, since there are only so many star names and only so many cipher schemes that can be consistent amongst all the stars.

For each of the 12 star labels on f68r3 (61oe7a9 8oay9 oae1coe oh1o89 8ayaee ok98* oK98 ohoe19 1G9 179,9h9 ohos okoy9 ) there are a number of possible anagrams of the symbols (anagrams):

It appears like there is a lot of freedom: the first label “61oe7a9” can be rearranged 5040 different ways – and you can pick any letter for any of the 7 symbols! However, when you start actually choosing the mapping between symbols and alphabet characters, and require that at least one of the anagrams for each deciphered label appears in a dictionary of star names, you rapidly eliminate many of the possibilities.

This is still true even if you allow that a single symbol in the label can map to two alphabet characters – the problem appears to be still over-constrained

For example, say you pick a mapping for the first label:

S: 6 1 o e 7 a 9
R: d r e ba an a l
61oe7a9 -> drebaanal -> aldebaran

then you want to apply this to the next label “8oay9”. You already have the cipher for symbols o, a and 9 … they translate to e, a and l respectively – you know that the star name must contain e, a and l. You know that the star name must be between 5 and 7 characters long, and you just need to pick suitable letters for 8 and y.

So far so good. Now we come to the third label: “oae1coe” which we can decipher all but one symbol of, to: “eabar?eba”.

Unhappily there are no 9 or 10 letter stars in the dictionary that can match this letter assortment. and so this eliminates our initial choice of mapping that gave us “aldebaran” for the first word, and “alcyone” for the second, because it cannot produce a dictionary word for the third label. Back to square one.

You can keep on doing this: selecting mappings, trying them out on the first two or three labels, and finding that there is no fit to the dictionary.

Although at first sight there seems to be a lot of flexibility from the use of anagrams in the cipher, in practice if an anagram of each and all the deciphered words is required to appear
in a dictionary, then that severely constrains the solution space.

Allowing Missing Characters

Let’s assume that each of the VMs symbols maps to a single plaintext alphabet letter. Which 16 of the 26 letters in the alphabet will we choose to map? Let’s look at our list of star names and find the 16 most used alphabet letters:

a i r e l s h n u t b k m d o c

(shown in order of frequency). How many different ways can we map the 16 VMs symbols to these 16 letters? That is:

Factorial 16 = 16! = 20,922,789,888,000

which is about 21 trillion (give or take), and is quite a lot. To explore this huge space of possibilities, we can use a Monte Carlo method. Basically we write a program to shuffle the 16 alphabet letters, look to see how that mapping works, then shuffle again, and so on. As we go, we keep track of the best mapping found so far.

Suppose we have the following mapping (cipher):

S: o 9 1 e a 8 h y 7 k 6 c * K G s
R: a e n c i l h r o s u d k t b m

Let’s go through the f68r3 star labels and convert them to plaintext using the cipher. After we convert them, let’s look the plaintext characters up in our list of star names using a “fuzzy match”. Here are the results:

Just how interesting/plausible/believable is this? We can make a control experiment by using a dictionary of dog breeds of about the same size: does mapping the VMs labels on f68r3 to star names produce a better fit than mapping the labels to dog breeds?

We run the program for a few million mappings, first with the star names, then with the dog breed names. For about 4 million mappings, there is a slightly better (9 out of 12) mapping to dog breeds compared with star names (8/12):

Of course, this doesn’t disprove that the labels on f68r3 are in fact star names: it may well be that one of the 16! combinations produces a perfect fit, with “61oe7a9” deciphered to “aldebaran” and “8oay9” deciphered to “alcyone” etc..

Character Position Based Cipher

Looking just at the star labels in Folio 68r, we can extract the letter/symbol frequencies as a function of position in the label.

In other words, the most likely symbol to find in the first position of any VMs label is “o” (this is the Voyn_101 encoding). Second most likely is “1” and so on. For the second symbol in the label, the most likely is again “o” in first position, then “k”, and so on.

This is another take on the Crust/Mantle business.

Clearly, the symbols at the ends of the words have a completely different distribution to those at the front.

So the plaintext has a quite different letter distribution: basically “a” is always the most likely, regardless of the position in the word.

A hypothesis is that there is a different mapping being used depending on the character position in the word. E.g. if I am enciphering “aldebaran” I would use one mapping for the first “a”, a different mapping for the “l” and so on.

If we also allow a VMs symbol to map to two plaintext characters, we can generate a “likely” mapping as follows:

Using this set of mappings, we can translate all the labels to plaintext. The results are disappointing: there are no matches in the list of starnames to the deciphered labels, even allowing anagrams.

But we can then start shuffling the mapping order around for each character position, and use a Monte Carlo approach to see what we come up with. This is what I am trying now. It’s not clear to me that I should allow anagrams in the solutions, since shuffling the letters destroys the ordering in the tables above, that I have assumed.

Hypothesis

The following hypothesis occurred to me while I was investigating a cipher theory proposed by Rich Santa Coloma. (This is not a new idea amongst Voynich researchers, but it was new to me!)

The VMs “words” are codes for plaintext character groups, probably trigraphs, digraphs and single characters.

How does one use this system?

1) Take each word in the plaintext
2) Break it up into a sequence of one or more trigraphs, digraphs and single characters by referring to a code table
3) Write the code for each, separated by a space, and terminate the last tri/di-graph/character code by a VMs “9”.

The labels are probably treated differently: there may well be a separate set of codes just for the labels.

As an example, take the following “sentence” of 33 “words” from the Herbal folios:

Each of these words is built of one or more codes. E.g. the first word in the list above is “h1cok 2oe 1c” and may be deciphered as
h1cok = “qui”,
2oe = “de”
1c = “m”

to make the Latin word “quidem”.

An interesting feature of this cipher/code is that you may have several choices of how to split each plaintext word into tri/di/mono-graphs, but without ambiguity for the decipherer. This may be an explanation for the different frequency distributions between the VMs folios and Currier hands: they were written by different scribes who tended to split the plaintext words differently.

Does the Theory fit the Data, for Latin?

We first take a substantial body of text from the VMs, e.g. the Recipes folios, and feed it through an application code that extracts all the VMs words, and groups them according to the procedure described above, using one or more arbitrary characters as word ending marks. Typically we use VMs “9”. Each sentence so derived is analysed: each of the tokens is analysed for n-gram content and frequencies are tallied.

At the end of the processing, the n-grams are sorted into frequency order: the most frequent n-grams appear first in the list.

At this point the application moves to its second stage. It ingests a large list of Latin phrases, generated by Knox (thanks, Knox!) and processes each word in each unique phrase for n-gram content, so extracting the n-gram frequencies for Latin. The phrases are placed in a sorted list: shortest first. The n-grams are sorted by frequency, most frequent first.

The third stage of the application is to generate a set of Genetic Algorithm chromosomes. Each chromosome takes the Top N n-grams from the Voynich n-gram list and pairs them with a random selection of the n-grams from the Latin list.

For example, for a Chromosome of length 15 (in fact the GA uses much longer lengths, typically 200) the following table might be used:

The chromosomes are “scored” by having them translate/decipher a training set of sentences from the input VMs folios. To calculate the score of each chromosome for each sentence, the sentence word tokens are converted to Latin n-grams using the chromosome’s table. Then the tokens are joined together to form the plaintext words. The plaintext words are looked up in the Latin dictionary: the chromosome’s score is increased for valid words, and decreased for invalid words. Once all the words in the sentence have been deciphered in this way, it is compared with each of the Latin phrases: if a Latin phrase appears in the sentence, the score of the chromosome is increased substantially.

The best chromosome found by a Monte Carlo method (basically generating random chromosomes, and retaining the best scoring chromosome) is placed at the top of a list, and then the remaining chromosomes needed for the Genetic Algorithm are generated.

The GA phase now begins: the chromosomes are genetically altered, mated and selected to optimise the best chromosome’s score on the training sentences. This phase is compute intensive.

In this report, the GA has been running for 311 “epochs” (each epoch is a new generation of chromosomes). The cost (score) of the best chromosome is 62.8, whereas the average score of all the chromosomes in the population is 61.2. In this Epoch, there has been no change to the best chromosome since the last Epoch (“same 1”), 21% of the chromosomes have been mutated, a fresh chromosome (“New 1”) was inserted at this Epoch (to ensure diversity – this is not usually done in GA, but I find it produces more reliable training). “MS 15” means that the maximum number of no-change Epochs seen so far has been 15 … the larger this number is, the more stagnant the chromosome pool is, and the nearer to a solution we are.

The following line shows in detail how the best chromosome has scored: its table produces 128 valid Latin words, from a total of 408 translations i.e. about 31%. In the 25 sentences being used in training, 40 common Latin phrases have been found.

The next two lines show the first 15 n-grams in the mapping that the chromosome is using.

Then the status report shows how the chromosome fared on translating a sentence picked at random from the VMs folios. Since the GA is being trained only on the first few sentences, the remainder are essentially “unseen”, and so a valid, sensible translation in a non-trained sentence is significant.

The sentence picked is number 129 (the training set is the first 25 sentences in this run, so number 129 is well outside that). The VMs source sentence is shown with hyphens “-” separating the tokens that make up words. E.g. “2o ok1c” is the first word. Beneath is the Latin translation. A Latin word followed by a single quote means that that word appears in the Latin dictionary, and is thus valid. A star appearing after a set of valid Latin words indicates that the Latin phrase made up by the words is common, or at least appears in Knox’s list.

A Caution

"Students who have approached the Voynich text from the point of view of the professional cryptanalyst have been led on at first by a deceptive surface appearance of simplicity, only to bog down sooner or later in an exasperating quagmire of paradoxes and enigmas that reveal themselves one by one as the analysis proceeds."
- Mary d'Imperio