2024-09-03

n-gram frequencies as multi-dimensional Markov chains?

Why measure n-gram frequencies when you could use guess them instead?


When carrying out elementary cryptanalysis on some text, you might look at the individual letter frequencies to see what type of cipher it might be - if you've got a bit of a spiky distribution, it's probably a monoalphabetic subsitution cipher, unless "e" is the most common followed by "t" and so on, in which case it's probably a transposition cipher. If it's flat, you've probably got yourself a polyalphabetic substitution cipher.

You might also look at bigram frequencies, since they might provide more statistically rich information than letter frequencies, which can be useful when carrying out some sort of automated attack that aims to maximise the textual fitness of some decryption. You might also look at the trigrams and so on. But to do this, you need to measure these frequencies, which requires writing code to do it for you, generally. It's not difficult to do so, but it is a bit tedious.

So, what happens if you're a mathematician looking for some fun in these preliminary stages of cryptanalysis? I think I may have a solution.

Let's say we have an alphabet aa of length kk (for English, k=26k=26). Then we might represent the letter frequencies of a text as a column vector VV of height kk, where ViV_i is the relative frequency of the iith letter of the alphabet (sadly not zero-indexed). Note that i=1kVi=1\sum_{i=1}^kV_i=1. Then we might represent bigram frequencies as a k×kk\times k matrix MM, where Mi,jM_{i,j} is the probability of aja_j being followed by aia_i. Then we can divide each element of the matrix by the sum of the elements in its column, giving a new k×kk\times k matrix T2T_2,   (T2)i,j=Mi,jp=1kMp,j\;(T_2)_{i,j}=\frac{M_{i,j}}{\sum_{p=1}^kM_{p,j}}. Note that the sum of the elements in each column of T2T_2 is 1 and thus this is a left-stochastic matrix.

We can therefore think of these bigram frequencies as a Markov chain, where each letter has a particular probability of moving to some other letter. If we begin with some arbitrary letter distribution VV, then we can plug it into this Markov chain to predict what the distribution of the letters following these letters might be. If we keep applying this, then we'll eventually come to a stationary state which we can plug into the network and get exactly the same distrubution back out. The stationary state of a Markov chain is typically denoted as π\pi so we will adopt that notation too. Note that T2π=πT_2\pi=\pi. This stationary state must be the letter distribution of our alphabet. π\pi is also any one column of limsZ+T2s\lim_{s\in\mathbb{Z}^+\to\infty}T_2^s.

Now let's reconsider T2T_2 as a function (since matrix multiplication is just a function [which is why it's distributive]):

T2 ⁣: RkRk, T2(v)i=v,((T2)i,1(T2)i,k), T_2\!:\ \mathbb{R}^k\mapsto\mathbb{R}^k,\ T_2{(v)}_i=\left\langle v,\left(\begin{matrix} (T_2)_{i,1} \\ \vdots \\ (T_2)_{i,k} \end{matrix}\right)\right\rangle,

where a, b\langle a,\ b\rangle denotes the inner product of two vectors. We can extend this to some nn-dimensional stochastic tensor TnRknT_n\in\mathbb{R}^{k^n},

Tn ⁣: Rk(n1)Rk, Tn(γ)i=1kn2γ,((Tn)i, , 1(Tn)i, , k), T_n\!:\ \mathbb{R}^{k^{(n-1)}}\mapsto\mathbb{R}^k,\ T_n{(\gamma)}_i=\frac{1}{k^{n-2}}\cdot\left\langle \gamma,\left(\begin{matrix} (T_n)_{i,\ \ldots,\ 1} & \cdots & \\ \vdots & \ddots & \\ & & (T_n)_{i,\ \ldots,\ k} \end{matrix}\right)\right\rangle,

whereby some input (n1)(n-1)-dimensional tensor γ\gamma is mapped over the first (n1)(n-1) dimensions of TnT_n, the inner product calculated and placed into the output vector, and the output vector divided by mn2m^{n-2} in order to maintain the property that the sum of all elements in the vector is equal to 1. If we define a 'stretching' function Γq: RkRkq\Gamma_q:\ \mathbb{R}^k\mapsto\mathbb{R}^{k^q} which repeats the input vector over the 'columns' of a tensor, then we define the stationary state π\pi of the nn-tensor as being the vector for which TnΓn(π)=πT_n\circ\Gamma_n{(\pi)}=\pi.

A proof that such a solution exists, and is unique, will be left as an exercise for the reader (mainly since I don't have such a proof, but I'm somewhat confident that it works). Such a tensor may be constructed in a similar fashion to how the stochastic matrix for bigrams was constructed. We could think of this tensor as a multi-dimensional Markov chain, where we have a network that represents the probability of aαaβa_\alpha a_\beta being followed by aγa_\gamma. I'm not sure how such a network could be represented graphically.

But what if we're not interested in letter frequencies? What if we have a burning desire to derive bigram frequencies from a 5-gram frequency distribution? Or, indeed, any pp-gram distribution from some nn-gram distribution? Don't worry, we can do that. Just define Tn,pRknT_{n,p}\in\mathbb{R}^{k^n} as

Tn,p ⁣: Rk(np)Rkp, T(γ)i1, , in=1knp1γ,((Tn)i, , 1(Tn)i, , k),pnp T_{n,p}\!:\ \mathbb{R}^{k^{(n-p)}}\mapsto\mathbb{R}^{k^p},\ T{(\gamma)}_{i_1,\ \ldots,\ i_n}=\frac{1}{k^{n-p-1}}\cdot\left\langle \gamma,\left(\begin{matrix} (T_n)_{i,\ \ldots,\ 1} & \cdots & \\ \vdots & \ddots & \\ & & (T_n)_{i,\ \ldots,\ k} \end{matrix}\right)\right\rangle,\quad p\leq n-p

and then Tn=Tn,1T_n=T_{n,1}. Then adapt our stretching function to Γr,q: RkrRkq, Γq=Γ1,q\Gamma_{r,q}:\ \mathbb{R}^{k^r}\mapsto\mathbb{R}^{k^q},\ \Gamma_q=\Gamma_{1,q} and we find that the pp-gram distribution πp\pi_p from some nn-gram distribution is such that Tn,pΓ(np),p(πp)=πpT_{n,p}\circ\Gamma_{(n-p),p}(\pi_p)=\pi_p.

Or rather, I think it works. I haven't actually tried that last bit out properly but I'm reasonably certain it works. Certainly more than 50% certain, but I will make no more advances on that. I think it looks pretty reasonable, although I will admit that the dimensions of each bit aren't particular clear and I should really find a nice way of notating that... Actually all of the notation in this post has been a bit unclear I think! Well done if you've managed to decipher my inane babbling.

So, what was the point of all this again? I'm not too sure, but I think it's kind of fun. Initially I came up with this concept when I was doing some research into zero-plaintext attacks for Hill ciphers, but couldn't find a practical use for it. Let me know if you do, or if you manage to figure out that one of the described techniques definitely works or doesn't.