2024-09-03

n-gram frequencies as multi-dimensional Markov chains?

Why measure n-gram frequencies when you could guess them instead?

This post is just me rambling about possible wyas of doing things. It is not mathematically rigorous and should not be taken to have any meaning at all.

When carrying out elementary cryptanalysis on some text, you might look at the individual letter frequencies to see what type of cipher it might be - if you've got a bit of a spiky distribution, it's probably a monoalphabetic subsitution cipher, unless "e" is the most common followed by "t" and so on, in which case it's probably a transposition cipher. If it's flat, you've probably got yourself a polyalphabetic substitution cipher.

You might also look at bigram frequencies, since they might provide more statistically rich information than letter frequencies, which can be useful when carrying out some sort of automated attack that aims to maximise the textual fitness of some decryption. You might also look at the trigrams and so on. But to do this, you need to measure these frequencies, which requires writing code to do it for you, generally. It's not difficult to do so, but it is a bit tedious.

So, what happens if you're a mathematician looking for some fun in these preliminary stages of cryptanalysis? I think I may have a solution.

Let's say we have an alphabet of length (for English, ). Then we might represent the letter frequencies of a text as a column vector of height , where is the relative frequency of the th letter of the alphabet (sadly not zero-indexed). Note that . Then we might represent bigram frequencies as a matrix , where is the probability of being followed by . Then we can divide each element of the matrix by the sum of the elements in its column, giving a new matrix

Note that the sum of the elements in each column of is 1 and thus this is a left-stochastic matrix.

We can therefore think of these bigram frequencies as a Markov chain, where each letter has a particular probability of moving to some other letter. If we begin with some arbitrary letter distribution , then we can plug it into this Markov chain to predict what the distribution of the letters following these letters might be. If we keep applying this, then we'll eventually come to a stationary state which we can plug into the network and get exactly the same distrubution back out. The stationary state of a Markov chain is typically denoted as so we will adopt that notation too. Note that . This stationary state must be the letter distribution of our alphabet. is also any one column of .

Now let's reconsider as a function (since matrix multiplication is just a function [which is why it's distributive]):

where denotes the inner product of two vectors. We can extend this to some -dimensional stochastic tensor ,

whereby some input -dimensional tensor is mapped over the first dimensions of , the inner product calculated and placed into the output vector, and the output vector divided by in order to maintain the property that the sum of all elements in the vector is equal to 1. If we define a 'stretching' function which repeats the input vector over the 'columns' of a tensor, then we define the stationary state of the -tensor as being the vector for which .

A proof that such a solution exists, and is unique, will be left as an exercise for the reader (mainly since I don't have such a proof, but I'm somewhat confident that it works). Such a tensor may be constructed in a similar fashion to how the stochastic matrix for bigrams was constructed. We could think of this tensor as a multi-dimensional Markov chain, where we have a network that represents the probability of being followed by . I'm not sure how such a network could be represented graphically.

But what if we're not interested in letter frequencies? What if we have a burning desire to derive bigram frequencies from a 5-gram frequency distribution? Or, indeed, any -gram distribution from some -gram distribution? Don't worry, we can do that. Just define as

and then . Then adapt our stretching function to and we find that the -gram distribution from some -gram distribution is such that .

Or rather, I think it works. I haven't actually tried that last bit out properly but I'm reasonably certain it works. Certainly more than 50% certain, but I will make no more advances on that. I think it looks pretty reasonable, although I will admit that the dimensions of each bit aren't particular clear and I should really find a nice way of notating that… Actually all of the notation in this post has been a bit unclear I think! Well done if you've managed to decipher my inane babbling.

So, what was the point of all this again? I'm not too sure, but I think it's kind of fun. Initially I came up with this concept when I was doing some research into zero-plaintext attacks for Hill ciphers, but couldn't find a practical use for it. Let me know if you do, or if you manage to figure out that one of the described techniques definitely works or doesn't.

previous post: I added LaTeX support to this site

next post: Glitch's October code jam submission: Modal Madness

view all posts