I'm starting to learn about cryptanalysis and I am having a bit of difficulty understanding the Kasiski test's index of coincidence. I have a book (Cryptography Theory And Practice by Douglas Stinson) about it that I'm going through, but it seems to skip something, which is causing my confusion. More likely it's me who is constantly missing a line or something.

and uses CHR to find the possible keyword length. It occurs 5 times in the total ciphertext, which I did not type out completely. To show that the key is most likely 5 characters, the book also shows the index of coincidence for words of length m, from 1 to 5.

What I don't get is how it got the index of coincidences for the ms.

With m = 1, the index of coincidence is 0.045. With m = 2, the two
indices are 0.046 and 0.041. With m = 3, we get 0.043, 0.050, 0.047.
With m = 4, we have indices 0.042, 0.039, 0.046, 0.040. Then trying
m = 5, we obtain the values 0.063, 0.068, 0.069, 0.061 and 0.072.

Even with the formula to get the index of coincidence, I don't see how the values were gotten. Can someone please explain?

1 Answer
1

Kasiki's test and the index of coincidence are used to attack a Vigenère cipher (or other polyalphabetic ciphers with small alphabet and small key size) - they both try to get the length of the keyword.

Kasiki's test gets probable prime factors of the keyword length, while the coincidence index test gets us an estimation of the absolute length of the keyword.

The original Kasiki test

This is what was done with CHR in your example: We look at repeated sequences of three or more characters, and at which distances they occur. We collect all these distances, and look at the prime factors of these.

The idea is that probably such a repeated sequence comes from the same plain text sequence, which then randomly hit the same keyword position. They will only hit the same position if their distance is a multiple of the keyword length.

Of course, we also can get random hits here (where different plaintext sequences got encrypted by different key sequences to get the same ciphertext sequence) - so be careful which prime factors to select. Either combine this method with Friedman's test mentioned below, or try different keyword length's for further analysis.

The coincidence index

The coincidence index of a text is defined as

Here n is the size of the alphabet, ml the number of occurrences for the character l, k the total size of the text. (δ(xi, xi) = 1 if the characters at position i and j are equals, else 0. But for calculation the second sum is more convenient.)

The coincidence index of a totally random text would be $1/k$ (and this is also the total minimum), while for natural language texts it is higher (0.067 for english, a bit higher for German). For a ciphertext encrypted by a monoalphabetic cipher it is still the same as for the original plaintext, for polyalphabetic ciphers (like Vigenère) it is between those.

(Actually, there are different definitions of coincidence index, but the values you have seem to use this one.)

Column-wise coincidence index to get the keyword length

You now take guesses at the key length h, and writes the cipher text in h columns. Then (if we guessed right) effectively the letters in each column are encrypted with the same key letter, e.g. monoalphabetically. We now calculate the coincidence index of each column individually. If we had a good guess at the key length, the coincidence index is near the coincidence index of the original plaintext column (which, if long enough, is a good approximation for the coincidence index of the used language).

In your example, for the lengths 1,2,3,4 the coincidence indices are only around 0.045, which is not much higher than 0.385, the expected value for a random 26-letter distribution. For length 5, you see a sudden jump: the indices are around the value for English (0.067). For 6, it will likely get lower again.

This will get you quite similar results as the Kasiki test, but is easier automated.

Friedman's method

Actually, Friedman derived from this a formula which spares us the calculation of each column's coincidence index - we just need the global coincidence index of the ciphertext (and the one of the natural language) to get an estimation of the keyword length.

While the English Wikipedia page does not mention this formula, the german language Wikipedia shows this: