How about a “Verbose Homophonic cipher”?

I’ve had a bit of hiatus from the VMs, but it’s always popping up in my mind and niggling me, even when I haven’t got time to spend on it. The latest niggle was the idea that the VMs scribe used a set of simple tables that showed how to convert plaintext letters into codes. So, in an example table, letter “A” is written “4oh”, letter “B” is written “8am” and so on. Also, spaces in the plaintext have their own code. Veteran VMs researcher Philip Neal informed me that this is called a “verbose homophonic cipher”.

Elaborating on the idea: the scribe uses one of the set of tables for each folio s/he is writing. To encipher the plaintext onto the folio, it’s simply a matter of writing down the VMs “word” for each letter in the plaintext word. If there is more space on the line for the next plaintext word, the scribe writes down the code for space, and then the codes for the letters in the next word. Long spaces are written by writing the code for space more than once … The next line is used for the next word, and so on.

On the next folio, a different table may be used.

It’s hard to imagine the justification for such a scheme, but it does appear (at least initially) to fit some of the features of the VMs script (especially the repeating VMs words often seen).

I made a quick test that looks at VMs word frequencies on a single folio (in the Recipes section, which has the densest text). These showed a word frequency distribution that looks similar to the letter frequency distribution in Latin, apart from the most frequently occurring word (which is much more frequent) and which it is suggested would code for a space in the cipher.

However, on a typical folio, there are usually many more VMs words than there are plaintext letters. So the scheme has to be extended to allow the scribe a choice between several different VMs words to encode a single letter. Each table must have a set of words appearing in each plaintext letter column. Something like this:

Plaintext

(space)

a

b

…

VMs words

8am ay okoe

4ohoe 2ay 1coe

faiis 4ay oka

…

If this is indeed the scheme, one would expect to see patterns in the VMs word sequences that match patterns seen in the letter sequences of e.g. Latin words. Also, as Philip Neal pointed out, patterns like “word1 word2 word2 word1” would indicate a plaintext letter sequence of either “vowel consonant consonant vowel” or vice versa.

Looking through the whole of the VMs for sequence patterns (on the same line of text), I found the following:

There are no 4 word sequences that repeat at all

There are only four 3 word sequences that repeat, and each only twice

There are no sequences at all of the form “xyyx”

(all of which I find rather surprising, and thought provoking).

So it looks like this hypothesis is dead in the water, and can be ticked off that long list of “things it might have been but in fact don’t fit”!

(It turns out that Elmar Vogt has been working on a related, but more sophisticated, idea which he describes on his blog and is called a “Stroke Theory”.)

* A “verbose cipher”: *at least some* of the single plaintext letters get mapped to (what appear to be) multiple letters in the target alphabet.

* A “pure verbose cipher”: *all* of the individual plaintext letters get mapped to (what appear to be) multiple letters in the target alphabet.

* A “homophonic cipher”: a cipher where the encipherer has a choice of possible ciphertext shapes for each plaintext letter.

* A “verbose homophonic cipher”: a cipher where the encipherer has a choice of possible ciphertext sequences (some of which appear to be multiple shapes) for each plaintext letter.

To my eyes, common Voynichese sequences (such as qo / ol / al / or / ar / ee / eee / am / an / ain / aiin / air / aiir, and even o + gallows and y + gallows) do give every indication of being in verbose cipher: and I think the stats bear this out. But the plaintext can’t then be ‘simple’ language, because the average enciphered word length would be substantially longer than we see.

In this way, positing Voynichese as “verbose enciphered shorthand” kind of balances the overall equation. So on the one hand, its shorthand aspect is shortening the text (but introducing a few extra tokens); while on the other hand, its verbose aspect is bulking it back out again.

However, even though this helps explain a great deal of how Voynichese letters form the patterns they do, there is – as you point out – very probably a yet further layer of obfuscation going on that functions to prevent long sequences being repeated. Now, I really don’t think that this extra layer will turn out to be anything as complex as a full-on polyalpha (because we seem to have many universal features of the cipher that do remain constant). However, there seems to be something added in to the mix to some of the characters (probably gallows-related) that is ~just enough~ to disrupt our stats gathering.

And that’s pretty much where the Voynichese verbose cipher reasoning chain currently halts. Just so you know there’s nothing new under the sun! 🙂

Thanks, Nick – useful comments, as usual. I am aware of your theories about nulls and the other letter features, as you have explained them before. However, using them with a GA I was not able to find a good match to the languages I tried, so there is probably an extra level of obfuscation in play, if indeed you are correct.

A certain text generated by substituting synthetic words for end-to-end character 2-grams has more than 2000 types and more than 20000 tokens. The ten most common 2-grams have more than one substitute. A link to the stats is in the “Website” box for comments to the blog.
This is a simplified form of generated text in which:
1) more than ten 2-grams have multiple substitutes
2) there is transposition of fractionated strings
Word series stats for a text generated in this manner will vary with different source languages.
The length and number of repeated word series and/or the average number of times word series repeat can be reduced by controlling “1” and/or “2”, above.
A simpler method would be to forget about “1” and “2” but use a simple transposition of 2-grams before testing with GA against the same source language from which the character 2-grams were obtained. I suggest beginning with no transposition.
If this explanation is not clear and if you are interested, send me an E-mail.

The code was an attempt to mimic language. It has more repeated word n-grams than the VMs. It does not accumulate as many unique words as the texts I have studied, including the VMs. The VMs is well within the ballpark of known writing. The unaltered code is not. Fractionation and/or a larger CT vocabulary can correct that. We can’t be sure the accumulation of uniques in the VMs isn’t distorted by alternate word forms, errors of writing, errors in reading, and illegible glyphs. Missing and out-of-order pages do not matter for this. If all that could be overcome, I believe the VMs would still be within the range of ordinary text.

In one sense, code eliminates the problem of the peculiar word structure. However, the question of vocabluary construction remains. Whether we have an enciphered code or a cipher-only, there are not enough single letters to map to discrete (non-overlapping) n-grams in the VMs. Worse than that, we have not found a set of discrete n-grams. It’s obvious to me that the problem can only be overcome with the concept of unwritten glyphs. That strays from the topic here. Other problems, which have been discussed, in matching the VMs are more difficult. The best we can hope for with GA at this point is a significantly better than random match to a language. If that happens, we will be partially right about some of the VMs characteristics, if not about how they happened. This we can try without assuming a post Fifteenth Century mindset in the development of the VMs script.

This is good to hear – I think we are on the same page (folio)! Having spent many pleasurable hours checking various exotic cipher and code ideas, none of them remotely fits when using a GA (except, notably, an nGram mapping with the language of Dante as the plaintext, a form of early Italian, which produces results significantly better than all other languages tried, including Latin, German, English, Spanish, Dutch, Chinese etc. – see below).

My faith in the GA technique is that it very quickly gives an idea of how well a code/cipher theory fits the VMs text.

A significant problem is the machine transcriptions we have of the VMs. Basically (as you and I have found out before) they differ substantially, to the extent that statistics obtained with, say, EVA do not match well with statistics obtained with, say, Voyn_101. A particular problem is glyph bloat … my opinion is that GC’s Voyn_101 transcription contains many more glyphs than the scribes were actually using. Little differences between the ways of writing “9” for example, are classified as different glyphs. This plays havoc with statistical analysis.

So I have a procedure that filters the Voyn_101 and remaps e.g. those multiple “9” glyphs to the same glyph.

Anyway, your idea of plaintext letter doublets mapping to VMs glyphs is excellent. We need something like this to account for multiple recurring VMs “words” like “8am 8am 8am” and to allow a sufficiently large vocabulary for the cipher/code system. It perhaps couples with a set of code pages one of which is selected for use at the start of each folio.

But how would this fit with the labels? Most of the labels are single Voynich “words”. These would decipher as plaintext letters or letter pairs, which is an odd way of labeling things if those labels don’t appear in the surrounding text. (E.g. one can imagine writing “The herb marked as “A” is deadly nightshade”, and placing an “A” next to the drawing … but we don’t see VMs labels appear like this in the VMs text.)

Here is an extract of the Dante Alighieri text that matches decently using nGrams to the VMs:

I should have paid attention to labels all along. Verbosity definitely is a problem. Another set of code pages might contain whole word substitutions for nouns and labels are, perhaps, all nouns. That’s the best I can do in trying to plug holes. In the running text, a word re-location system could cause repeated words of high frequency.

A Caution

"Students who have approached the Voynich text from the point of view of the professional cryptanalyst have been led on at first by a deceptive surface appearance of simplicity, only to bog down sooner or later in an exasperating quagmire of paradoxes and enigmas that reveal themselves one by one as the analysis proceeds."
- Mary d'Imperio