Cryptanalysis with N-Grams

This post is the third post in a series on computing with natural language data sets. For the first two posts, see the relevant section of our main content page.

A Childish Bit of Fun

In this post, we focus on the problem of decoding substitution ciphers. First, we’ll describe a few techniques humans use to crack ciphers. We’ll find these unsatisfactory, and move on to a simplistic algorithm which does a local search on the space of all possible decryptions, where we utilize our word segmentation algorithm from last time to determine the likelihood that a decryption is correct. We will continue this series’s trend of working in Python, so that we can reuse our code from previous posts. Finally, we’ll experiment by running the code on actual substitution ciphers used in history. And next time, we’ll work on improving the search algorithm for speed and accuracy. As usual, all of the code used in this blog post is available on this blog’s Github page.

So let’s get down to it.

Definition: A message which is human readable is called plaintext, and an encrypted message is called ciphertext.

When I was but a young lad, I enjoyed working through the puzzles in a certain book on long car rides. I wasn’t able to find the book on Amazon, but every page was just another famous quote obfuscated by a substitution cipher, and the reader’s goal was to decipher the quote by hand. In other words, each puzzle provided the ciphertext, and you had to find the corresponding plaintext. Most people intuitively understand the idea behind a substitution cipher, but we can also define it cleanly with the terminology from our post on metrics on words. (If the word “monoid” scares you, skip the mathematical definition and read the example below first.)

Definition: Let $ \Sigma$ be a fixed alphabet. A substitution key is a bijection $ f: \Sigma \to \Sigma$. A substitution cipher is the induced monoid homomorphism $ \overline{f}$ on $ \Sigma^*$, the set of all strings of letters in $ \Sigma$.

Then for any plaintext message $ w \in \Sigma^*$, the ciphertext is precisely $ \overline{f}(w)$, and for any encrypted message $ w_{\textup{enc}} \in \Sigma^*$, the corresponding plaintext message is $ \overline{f}^{-1}(w_{\textup{enc}})$. Note that $ \bar{f}$ is uniquely determined by $ f$, so that the problem of deciphering a message reduces to determining the correct key.

To explain this in plain English, the substitution key is a one-way matching-up of letters in the English alphabet. For instance, we might take A to G, B to S, C to Q, etc., but we could have A go to G and G doesn’t have to go to A. When encrypting a plaintext message, we just replace each A with a G, each B with an S, and continue until we have counted for every letter in the message. The “bijection” condition above just means that every letter is associated with some other letter, and there is no conflict of association (for instance, we can’t have that both A and B are associated with L, because this would yield multiple decryptions of any encryption and vice versa). Finally, the “induced monoid homomorphism” simply means that to encrypt a message, we use the key to encrypt it letter-by-letter. Here is an example:

Plaintext message: holy spumoni batman
Substitution key (associations are vertical):
                   abcdefghijklmnopqrstuvwxyz
                   nopqrstuvwxyzabcdefghijklm

Ciphertext: ubyl fchzbav ongzna

This particular key has a famous name (rot13), because it is simply a rotation of the alphabet by 13 letters. Also, we note that if we call $ r$ the substitution cipher induced by this key, then we see $ r$ has the very special property that $ r = r^{-1}$ or equivalently $ r^2 = \textup{id}_{\Sigma^*}$. So applying the encryption method twice actually gives us back the original plaintext message. But of course a rotation cipher is too simplistic; in general a substitution key can match any two letters together to make for a more complex code.

Decoding the substitution ciphers in my childhood puzzle book involved a few tricks, which when combined and applied (more or less at random) likely yielded the right decryption. These included:

  • Looking at one-letter words, and picking I, A, and occasionally O as the substitution.
  • Looking at short two letter words and three letter words, and trying words like “to,” “an,” and “the” in their place. In other words, partially decipher part of the text, and see if using that partial substitution leads to absurd decryptions of other parts of the message.
  • Looking for doubled-letters, and replacing them with common doubled-letters like “ee,” “tt,” “ss,” “ff,” and “ll”
  • Find the most common letter, and try substituting that with common letters like e, s, t, r, l, or n.

Unfortunately, most of these rely heavily on a few cop outs. First, the text from this puzzle book included punctuation, word spaces, and other distinguishing features of the English language. In real life, ciphers are usually just one fat block of text, or separated into blocks of a fixed width; when decrypted correctly, a human’s natural ability at word segmentation makes the message obvious. Second, I knew ahead of time that the plaintext message was a famous quote! I had inside knowledge about the content of the message, and so decryption came more easily. Often encoded message are not full sentences, and often the people doing the encrypting will strip out common words, but still maintain the sensibility of the decoded phrase (e.g. “dropoff midnight joeys bar back door”). Furthermore, the text could deliberately have spelling errors and other types of message adulteration to avoid decryption.

In other words, the messages in the puzzle book were flawless and designed to be easy to solve. We are more interested in designing a solver which maintains quality in the midst of imperfection and trickery.

That being said, the patterns we used as a child give insight into how we might construct an algorithm to decrypt messages. During manual decryption, one would often get very close to the solution and notice that one incorrectly substituted two letters, but that the rest of the message is correct. By twiddling the incorrectly substituted letters, one would arrive at the correct decryption, and pat oneself on the back. This is the key behind the algorithm that follows, in that we will start with a random decryption, and incrementally improve it until we can’t do so any more. But before we get there, we need to figure out how to represent our data appropriately.

Representing a Cipher as a Piece of Data

One easy way to represent a cipher key is much like the example above: simply use a 26-character string of letters like “nopqrstuvwxyzabcdefghijklm” where we assume that the letter ‘a’ maps to the first character of the string, ‘b’ to the second, and so on. This representation will benefit us later when we want to make slight adjustments to a key: we can simply swap any two letters in the list, or do permutations of triples of letters.

Now that we have a key, we can design a function that encrypts a message. In Python, we use the nice string methods for character replacement:

import string

alphabet = "abcdefghijklmnopqrstuvwxyz"
def encrypt(msg, key):
   return msg.translate(string.maketrans(alphabet, key))

The translate method of a string is very special: it requires a table of translations from the entire ASCII alphabet of 256 characters to operate. To alleviate the pain of setting this up for relatively simple translations, Python provides us with the “maketrans” function, which when given an input and output alphabet, constructs a translation table in which the i-th character of the first argument is translated to the i-th character of the second argument, and leaves everything else unchanged. Note that here we don’t include capital letters. The interested reader can see the source code for the minor modifications that fix capitalization; it’s not very interesting, so we omit it here.

For example, with the following key we can encrypt some test messages:

>>> key = "qwertyuiopasdfghjklzxcvbnm"
>>> encrypt("why hello there", key)
'vin itssg zitkt'

And a decryption function is quite similar:

def decrypt(msg, key):
   return msg.translate(string.maketrans(key, alphabet))

See that decrypting the encrypted message above works as expected:

>>> decrypt("vin itssg zitkt", key)
'why hello there'

Next, we need to be able to “twiddle” a key. Our ultimate algorithm will start with a random key, and improve it incrementally by changing two letters at a time. We can use the same “translate” function again to do so:

def keySwap(key, a, b):
   return key.translate(string.maketrans(a+b, b+a))

In other words, if the letter $ x$ is mapped to $ a$, and $ y$ is mapped to $ b$, then this function returns a key that maps $ x$ to $ b$ and $ y$ to $ a$. So now that we’ve got a representation for a key, let’s figure out how to “make a key better.”

Letter Trigrams

Our general strategy is as follows: start with a random key, and then come up with some sort of way to judge the key based on its decryption. From there, swap pairs of letters in the key to look at keys which are “close by.” If any swap is judged to be better than the current key, use that as the new key, and start the process over again. We stop looking for new keys after a certain number of steps, or we get a decryption with a certain level of accuracy.

This algorithm is a common paradigm in optimization. The usual name is “steepest descent” or “steepest ascent,” and the analogy is evident. Suppose we want to get to the top of the highest peak in Tibet. We can start from some random place in Tibet, and look around us. If we are standing next to a place that is higher than we are presently, move to that location and repeat. There are obviously some problems with this algorithm: first, we will always get to some peak, but we may not get to the highest peak. To alleviate this, we can run the algorithm from a large number of random starting locations, and compare all of the peaks we arrive at. Certainly, with enough random starting points, we are very likely to find the highest peak (eventually, we will randomly start on the goal itself, or at least very close to it!). Second, if we were to try this algorithm in Illinois, we might never find any hills at all! This would leave us blindly wandering around some corn field, and is clearly a waste of our time. Before we do any work, we should have a good idea that the space we’re searching through is more like Tibet than Illinois, and preferably we’d only have one major peak.

Now, our description of “close by” keys is really that. Recalling our first post in the series on metrics on words, we want to investigate cleverly-chosen keys which are close to the given key with respect to the Levenshtein metric. Using the analogy, we are looking for which directions are higher in 26 dimensions, so we can’t look in all directions for an increase. Instead, we want to know ahead of time which directions are more likely to be higher, and check those directions only. We will see how to do this momentarily.

The difficult part, really, is determining the “value” of a given decryption. The underlying problem is one of the main underlying problem in this whole series: how do we tell if a string of characters is sensible language? On one hand, if we know it’s sensible we can segment it. We saw to that last time. But how do we tell if a string of characters is sensible?

Of course, it’s well beyond the scope of this post to give an exact answer to that problem, but it turns out that a reasonable approximate answer is within our grasp. As we did in word segmentation by looking at sequences of words, let’s look here at sequences of letters.

Definition: A letter n-gram is a sequence of $ n$ letters, i.e. an element of $ \Sigma^n$.

Note that with a large corpus of internet text (as we discussed in word segmentation), we can compute the counts of triples of letters. Borrowing again from Norvig’s page, we have a list of letter trigrams and letter bigrams. All possible 2-grams and 3-grams occur in the files, and here is a sample of the most common and least common from both:

For bigrams:

in	134812613554
th	133210262170
er	119214789533
re	108669181717
he	106498528786
...
qy	6901470
zq	6170496
jx	5682177
qz	4293975
jq	2858953

And trigrams:

the	82103550112
ing	43727954927
and	43452082914
ion	39907843075
tio	32705432538
ent	31928292897
...
jwq	10340
jqy	8871
zqy	8474
jzq	7180
zgq	6254

Note that even though we aren’t discerning words themselves, a true decryption will definitely contain the common trigrams and bigrams, but if our key is wrong, there are likely (just by randomness) some uncommon trigrams and bigrams in the resulting decryption. Thus, we can take the set of all letter trigrams in a sequence, compute the probability of each trigram occuring at random, and take the product of all of them to get a score for a given decryption.

Implementation: Steepest Ascent, and Generating Neighbors

The steepest ascent algorithm is pretty much the same for any problem. In pseudo-python, it looks something like:

def steepestAscent(posn, evaluatePosn, generateNeighbors, numSteps):
   val = evaluatePosn(posn)
   neighbors = generateNeighbors(posn)

   for i in numSteps:
     next = neighbors.next()
     nextVal = evaluatePosn(next)

     if nextVal > val:
        val = nextVal
        posn = next
        neighbors = generateNeighbors(next)

   return posn

So we need a function which generates neighbors of a given position, and an evaluation function. Before we actually write the real implementation of our steepest ascent algorithm on decryption keys, let’s write the evaluation function, and then the function to generate neighboring keys.

First, the evaluate the quality of a decryption, we need a function that extracts all letter trigrams. We can do this in general for letter n-grams:

def letterNGrams(msg, n):
   return [msg[i:i+n] for i in range(len(msg) - (n-1))]

Now recall the class we designed in the post on word segmentation that loads the word-count file, and computes probabilities. As it turns out, we can reuse that code to load any file where each line has a word and a count. So with a slight modification and an include, we import the word segmentation algorithm from last time (this is included in our implementation on this blog’s Github page. See the file segment.py for the minor changes).

In other words, to load our trigram word file, we simply instantiate the object, and then use the same sort of logarithmic sum as we did for word segmentation:

trigramLetterProb = OneGramDist('count-3l.txt')
def trigramStringProb(msg):
   return sum(math.log10(trigramLetterProb(trigram))
      for trigram in letterNGrams(msg, 3))

So our “evaluatePosn” function above will simple be this “trigramStringProb” function. To give an example of this:

>>> trigramStringProb("hellotherefriend")
-42.62490211229232
>>> x = [encrypt("hellotherefriend",
                 shuffled(alphabet)) for i in range(20)]
>>> y = [(z,trigramStringProb(z)) for z in x]
>>> for (a,b) in y:
...     print("%s, %f" % (a,b))
...
zuaayfzuruoriueb, -75.711233
ejzzfgejljmlsjwk, -90.312349
ghnnctghshisuhoe, -64.815609
ikggrzikykfyqkld, -91.449079
akhhuqakfktfjkdo, -89.589869
gayynvgarahrpadl, -68.828649
pottgipoaovakosb, -68.253187
hozznjhowolwuobx, -81.541286
ihuusoihnhvnrhyl, -78.413089
dmqquldmbmcbnmyk, -81.434687
amvvtwamomkozmfr, -76.938938
znddfwznunjuhnie, -76.856587
rxddemrxkxuktxla, -82.547184
tjddsktjxjoxfjun, -83.725443
wxbbcdwxjxvjhxsy, -95.146985
hsvvonhsasgamspc, -66.135361
inaavdinoneomngl, -59.646548
tcjjpxtcbcqbkcov, -88.000009
burrpvbujuajgusk, -75.768519
fkmmbxfkvkqvyknl, -94.509938

And so we see that for incorrect decryptions, the score is orders of magnitude smaller: they naturally contain many uncommon letter trigrams.

To generate neighboring keys, we also use n-grams, but this time we work with bigrams. The idea is this: take the most uncommon bigram found in the attempted decryption of the message, and fix the key by replacing the bigram with a more common bigram.

We do this quite gradually, in that if we see, say, “xz,” we may note that “ez” is more common and just swap x and e in the key. This way we don’t just always try to replace all uncommon bigrams with “th” or “in.” In code:

bigramLetterProb = OneGramDist('count-2l.txt')
def neighboringKeys(key, decryptedMsg):
   bigrams = sorted(letterNGrams(decryptedMsg, 2),
                    key=bigramLetterProb)[:30]

   for c1, c2 in bigrams:
      for a in shuffled(alphabet):
         if c1 == c2 and bigramLetterProb(a+a) >
                         bigramLetterProb(c1+c2):
            yield keySwap(key, a, c1)
         else:
            if bigramLetterProb(a+c2) > bigramLetterProb(c1+c2):
               yield keySwap(key, a, c1)
            if bigramLetterProb(c1+a) > bigramLetterProb(c1+c2):
               yield keySwap(key, a, c2)

   while True:
      yield keySwap(key, random.choice(alphabet),
                         random.choice(alphabet))

First, we create the bigram letter probability lookup table, as with trigrams. To generate the neighboring keys, first we sort all bigrams in the word so that the least common come first, and then we take the first 30 of them. Then we shuffle the alphabet (so as not to bias the beginning of the alphabet) and look for improvements in the key. Note that this function is an iterator. In other words, it doesn’t “return” a value in the sense that it halts execution. Instead, it “yields” a value until the caller asks for another (with the “next()” function), and then it returns to the computation it was doing until it reaches another yield. This way, we don’t have to generate a huge list of keys ahead of time, and risk the possibility that they are never used. Instead, we cook up new keys on demand, and only compute as many as are needed. Since we will often discover a better key within the first few iterations, this will undoubtedly save us a lot of unnecessary computation. Finally, after exhausting our list of the 30 least common bigrams, we make random swaps, hoping for an improvement.

Notice that this function requires both the key and the decrypted message. This alters our steepest ascent algorithm superficially, because the generateNeighbors function requires two arguments. We also changed all of the variable names to be relevant to this problem.

def steepestAscent(msg, key, decryptionFitness, numSteps):
   decryption = decrypt(msg, key)
   value = decryptionFitness(decryption)
   neighbors = iter(neighboringKeys(key, decryption))

   for step in range(numSteps):
      nextKey = neighbors.next()
      nextDecryption = decrypt(msg, nextKey)
      nextValue = decryptionFitness(nextDecryption)

      if nextValue > value:
         key, decryption, value = nextKey, nextDecryption, nextValue
         neighbors = iter(neighboringKeys(key, decryption))
         print((decryption, key))

   return decryption

Note that we also print out the partial results, so the user can visually see the intermediate key choices. Now the very last step is to actually run the code on some random starting keys, and collect, segment, and display the results:

def shuffled(s):
   sList = list(s)
   random.shuffle(sList)
   return ''.join(sList)

def preprocessInputMessage(chars):
   return ''.join(re.findall('[a-z]+', chars.lower()))

def crackSubstitution(msg, numSteps = 5000, restarts = 30):
   msg = preprocessInputMessage(msg)
   startingKeys = [shuffled(alphabet) for i in range(restarts)]
   localMaxes = [steepestAscent(msg, key, trigramStringProb, numSteps)
                 for key in startingKeys]

   for x in localMaxes:
      print(segmentWithProb(x))

   prob, words = max(segmentWithProb(decryption) for decryption in localMaxes)
   return ' '.join(words)

We first preprocess the message to ensure everything is lowercase characters (we remove anything else), and then generate a bunch of random starting keys, perform the steepest ascent on those keys, and display the results, along with the most probable, as chosen by word segmentation. Let’s see how it performs in the real world.

The Germans and the Russians: Real Codes Decrypted

Let’s try running our code on a test message:

>>> msg = 'ujejvzrqfeygesvsoofsujwigfeestgufvvzgujjejcfwf\
qfevlgwvswpzsizfnasvvgeswnqfevrpfovfyswnqafigfvqegisogarv\
zgoflgljlgwvfxgfkvsckaxkvtfikjkoozjpnseafeestgufvvzgujje'
>>> crackSubstitution(msg)
[... lots of output ...]
'dorothy parker it is said once arrived at the door of an
apartment in which a glittering party was taking place at
precisely the same moment a beautiful but vacuous showgirl
arrived at the door'

In addition to being correct, most of the attempts were very close as well, giving the first few letters such decodings as “coroti” or “gorothy”. Let’s try a harder one. This message was sent by Baron August Schluga, a German spy in WWI (source):

>>> msg = 'NKDIFSERLJMIBFKFKDLVNQIBRHLCJUKFTFLKSTENYQNDQNT\
TEBTTENMQLJFSNOSUMMLQTLCTENCQNKREBTTBRHKLQTELCBQQBSFSKLTMLS\
SFAINLKBRRLUKTLCJUKFTFLKFKSUCCFRFNKRYXB'
>>> crackSubstitution(msg)
[... lots of output ...]
'english complaining over lack of munitions they regret
that the promised support of the french attack north of
arras is not possible on account of munition
insufficiency wa'

Here’s a code sent by Aldrich Ames, the most notorious CIA mole to have ever been caught. It was a portion of a message sent in 1992 that was found on his person when he was arrested:

>>> msg = 'cnlgvqvelhwttailehotweqvpcebtqfjnppedmfmlfcyfsqf\
spndhqfoeutnpptppctdqnifsqdtwhtnhhlfjolfsdhqfedhegnqtwvnqht\
nhhlfjwebbitspthdtxqqfoeutyfslfjedefdnifsqgnlngnpcttqedoedf\
gqfitlxni'
>>> crackSubstitution(msg)
...
'parch third weez bridge with s pile to mass info fr op you
 to us and to give asses spent about new de add rom ground
 to indicate what de add rom will be used next to give
 your o minion about caracas p eeting in october xab'

This is a bit more disheartening because it’s obviously close to correct, but not quite there.   A human can quickly fix the errors, switching p with m, and z and k. That being said, perhaps by slightly increasing the number of steps we could find our way to the right decryption.

Here’s another, from the same source above, sent by Confederate General J.E. Johnston during the Siege of Vicksburg, May 25, 1863 during the U.S. Civil War. It was intercepted and then deciphered by Lincoln’s three-man team of cryptanalysts:

>>> msg = 'AKMSSROOHDCRDSMCRVRORGDWTHDRPBGRDORWRUUNGFHPGYGQ\
WTRNTCGYGQPTRDLPTHAHOPKGQPHTGWMDCWTHKHROPTHHDHFYHDNMFIHCWTM\
PROYGQKEGKHBHBGTDOPGDXM'
>>> crackSubstitution(msg)
...
'x raff is send in fa division when it po in si will come to \
you which do you thing the x est route how and where is the \
enemy encambedwhatisyourkorepepohnstonja'

Again, we are close but not quite there. In fact, this time the decryption even fails to segment the last block, even when there are useful pieces inside it. This is a side-effect of our zealous segmentation model that punishes unknown words exponentially in their length.

Unfortunately, our program seems to fail more often than succeed. Indeed, the algorithm is not all that good. One might hope the following function is usually the identity on any given message, but it rarely is:

def testDecryption(msg):
   crackSubstitution(encrypt(msg, shuffled(alphabet)))

In fact I have yet to see one example of this function returning sensible output for any input.

>>> testDecryption("the quick brown fox jumped over \
the lazy dog")
...
'the pladjusigbvicklymenizesthefrowniq'
>>> testDecryption("i love mathematics and computer \
programming")
...
'of typlasmplasourancutldispedetheallonh'
>>> testDecryption("the time has come the walrus said to \
speak of many things of ships and shoes and sealing wax of \
cabbages and kings of why the sea is boiling hot and whether \
pigs have wings")
...
'the tive has move the calfwssaidtosreakouvanythingsoushirs\
andshoesandsealingcaboumappagesandkingsouchytheseaispoiling\
hotandchethefrigshazecings'

That last one was close, but still rather far off. What’s more, for random small inputs the function seems to generate sensible output!

>>> testDecryption("slkfhlakjhahlaweirhurv")
[... some output ...]
'l some store test and he chi'

Disregarding the fun we could have decrypting random message, we cry, “What gives?!” It seems that there’s some sort of measure of entropy that comes into play here, and messages with less entropy have a larger number of interpretations. Here entropy could mean the number of letters used in the message, the number of distinct letters used in the message, or some combination of both.

The last obvious annoyance is that thee program is really slow. If we watch the intermediate computations, we note that it will sometimes approach the correct solution, and then stray away. And what’s more, the number of steps per run is fixed. Why couldn’t it be based on the amount of progress it has made so far, pushing those close solutions a bit further to the correct key, perhaps by spending more time just on those successful decryptions.

Next time, we’ll spend our time trying to improve the steepest ascent algorithm. We’ll try to make it more quickly abandon crappy starting positions, and squeeze the most out of successful runs. We’ll further do some additional processing in our search for neighboring keys, and potentially involve letter 2-, 3-, 4-, and even 5-grams.

But all in all, this worked out pretty well. We often emphasize the deficiencies of our algorithms, but here we have surprisingly many benefits. For one, we actually decoded real-life messages! With this program in hand even as few as twenty years ago, we would have had a valuable tool for uncovering nefarious secrets encoded via substitution.

Even more, by virtue of our data-driven solution, we inherently bolster our algorithm with additional security. It’s stable, in the sense that minor data corruption (perhaps a typo, or nefarious message obfuscation) is handled well: our segmentation algorithm allows for typos, and a message with one or two typos still had a high letter trigram score in our probabilistic model. Hence we trivially bypass any sort of message obfuscation: if the average human can make sense of the decrypted message, so could our program.

And as with word segmentation, it’s extensible: simply by swapping out the data sets and making minor alphabet changes, we can make our program handle encrypted messages in any language. This exponentially increases the usefulness of our approach, because data sets are cheap, while sophisticated algorithms are expensive (at least, good ones can be sophisticated).

So look forward to next time when we improve the accuracy and speed of the steepest-ascent algorithm.

Until then!

Word Segmentation, or Makingsenseofthis

A First Look at Google’s N-Gram Corpus

In this post we will focus on the problem of finding the appropriate word boundaries in strings like “homebuiltairplanes”, as is common in web URLs like www.homebuiltairplanes.com. This is an interesting problem because humans do it so easily, but there is no obvious programmatic solution. We will begin this article by addressing the complexity of this problem, continue by implementing a simple model using a subset of Google’s n-gram corpus, and finish by describing our future plans to enhance the model. As usual, all of the code and data used in this post is available from this blog’s Github page.

Word Segmentation

We just claimed word segmentation is a hard problem, but in fact the segmentation part is quite easy! We’ll give a quick overview of the segmentation algorithm which assumes that we can evaluate a segmentation for optimality.

First, we note that by an elementary combinatorial argument there are $ 2^{n-1}$ segmentations of a word with $ n$ letters. To see this, imagine writing a segmentation of “homebuiltairplanes” with vertical bars separating the letters, as in “home | built | ai |  rpla | nes”. The maximum number of vertical bars we could place is one less than the number of letters, and every segmentation can be represented by describing which of the $ n-1$ gaps contain vertical bars and which do not. We can hence count up all segmentations by counting up the number of ways to place the bars. A computer scientist should immediately recognize that these “bars” can represent digits in a binary number, and hence all binary numbers with $ n-1$ digits correspond to valid segmentations, and these range from 0 to $ 2^{n-1} – 1$, giving $ 2^{n-1}$ total numbers.

One can also prove this by fixing the number $ k$ of segments the word is broken into and counting all the segmentations for each $ k$. We leave it as an exercise to the reader.

The key fact to glean from that counting problem is that enumerating all possible segmentations is unfeasible. The right way to approach this problem is to mimic a human’s analysis. This will result in a dynamic program, and readers who have followed along will recognize the approach from our post on the Levenshtein metric. In addition we have a Python primer which doubles as a primer on dynamic programming.

A human who looks at the word “homebuiltairplanes” performs a mental scanning of the word. First we check if “h” is a good word, and then move to “ho”, “hom”, and finally decide “home” is likely the right word. From there we split the word into “home” and “builtairplanes”, and segment the remainder appropriately. In addition, one does not consider any of the segmentations whose first word is “h”, and neither one which begins with “homebu,” and the reason is that these are not words (nor are they likely to mean anything even if they aren’t words; we will discuss this nuance soon). Hence, all of the $ 2^{15}$ segmentations that make the first word “h” can be disregarded, as can the $ 2^{11}$ whose first word is “homebu”.

However, following these directions too strictly (and programs are quite strict in following directions) will get us into a number of traps. The mathematical mind will come up with many examples for which this reasoning is ambiguous or flat-out faulty. Here are two such suspects. The first is “lovesnails”. If we prematurely pick the first word as “love”, then we will be forced to choose the segmentation “love snails”, when it might truly be “loves nails”. Regardless of which segmentation is more likely, we simply need to consider them both in whatever judgement system we come up with. The second example is “bbcamerica”, as in www.bbcamerica.com. In particular, if we expect to find dictionary words in our segmentation, we’ll be quite disappointed, because no word begins with “bb”. Our solution then is to break free of the dictionary, and to allow potentially odd segmentations, being confident that our criterion for judging segmentation optimality will successfully discard the bad ones.

Before we liberate ourselves from the confines of Oxford’s vocabulary, let’s implement the procedure that performs the segmentation, abstracting out the optimality criterion. We note that (as usual with a dynamic program) we will need to memoize our recursion, and one should refer to our post on the Levenshtein metric and the Python primer on dynamic programming for an explanation of the “@memoize” decorator.

First, we implement the “splitPairs” function, which accepts a string $ s$ as input and returns a list containing all possible split pairs $ (u,v)$ where $ s = uv$. We achieve this by a simple list comprehension (gotta love list comprehensions!) combined with string slicing.

def splitPairs(word):
   return [(word[:i+1], word[i+1:]) for i in range(len(word))]

Indeed, “word[a:b]” computes a substring of “word” including the indices from $ a$ to $ b-1$, where blank entries for $ a$ and $ b$ denote the beginning and end of the string, respectively. For example, on the input string “hello”, this function computes:

>>> splitPairs("hello")
[('h', 'ello'), ('he', 'llo'), ('hel', 'lo'),
 ('hell', 'o'), ('hello', '')]

Note that the last entry in this list is crucial, because we may not want to segment the input word at all, and in the following we assume that “splitPairs” returns all of our possible choices of action. Next we define the “segment” function, which computes the optimal segmentation of a given word. In particular, we assume there is a global function called “wordSeqFitness” which reliably computes the fitness of a given sequence of words, with respect to whether or not it’s probably the correct segmentation.

def segment(word):
   if not word: return []
   allSegmentations = [[first] + segment(rest)
                       for (first, rest) in splitPairs(word)]
   return max(allSegmentations, key = wordSeqFitness)

In particular, we are working by induction on the length of a word. Assuming we know the optimal segmentations for all substrings not including the first letter, we can construct the best segmentation which includes the first letter. The first line is the base case (there is only one choice for the empty string), and the second line is the meat of the computation. Here, we look at all possible split pairs, including the one which considers the entire word as a good segmentation, and we find the maximum of those segmentations with respect to “wordSeqFitness”. By induction we know that “segment” returns the optimal segmentation on every call, since each “rest” variable contains a strictly smaller substring which does not include the first letter. Hence, we have covered all possible segmentations, and the algorithm is correct.

Again, note how simple this algorithm was. Call it Python’s syntactic sugar if you will, but the entire segmentation was a mere four lines of logic. In other words, the majority of our work will be in figuring out exactly how to judge the fitness of a segmented word. For that, we turn to Google.

Google’s N-Gram Corpus

The biggest obstacle in “word segmentation” turns out to be identifying what words are, or rather what strings are most likely to have meaning. If we stick to a naive definition of what a word is, i.e., anything found in the Oxford English Dictionary, we end up with a very weak program. It wouldn’t be able to tell that “bbc” should be a word, as in our above example, and further it wouldn’t be able to tell which words are more likely than others (the OED has some pretty obscure words in it, but no mention of which are more common).

The solution here is to create a probabilistic model of the English language. Enter Google. To date Google has provided two major data corpora: the first, released in 2006, contains the frequency counts of language data from over a trillion words from a random sample of the entire internet. They gathered all n-grams, sequences of $ n$ tokens (a “token” was defined arbitrarily by the researchers at Google), and reported the counts of those which occurred above a certain threshold. Unfortunately this corpus is over a hundred gigabytes of uncompressed data (24GB compressed), and consequently it can’t be found online (you have to order it on 6 DVDs from UPenn instead).

The second corpus is more specifically limited to language data from scanned books. Again organized as n-grams, this data set has some advantages and disadvantages. In particular, it was scanned directly from books, so optical character recognition was used to determine what image segments correspond to words, and what words they are. Due to deterioration or printing errors in the scanned books (for older years, these cover all books printed), this results in a lot of misinterpreted tokens. On the other hand, all of these sources are proofread for mistakes. Considering the banality of the web, we can probably assume this corpus has a higher data quality. Furthermore, tokens that appear commonly on the web (abbreviations like “lol” and “LotR” come to mind) are unlikely to occur in a book. For better or worse, if we use the book data set, we probably won’t have an easy time segmenting words which use such terms.

In the future we plan to obtain and process through the book data set (unless we can find the web data set somewhere…), but for now we use a convenient alternative. A researcher and Stanford professor Peter Norvig modified and released a subset of Google’s web corpus. He made all of the entries case insensitive, and he removed any tokens not consisting entirely of alphabet letters. Our first investigation will use his 1-gram file, which contains the 333,333 most commonly used single words. Perhaps unsurprisingly, this covers more than 98% of all 1-grams found, and the entire corpus easily fits into memory at a mere 5MB. Even better, we will design our program so that it works with data in any form, so the data sets can be swapped out at will.

Each line in Norvig’s data file has the form $ word \left \langle tab \right \rangle n$. The first few lines are:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709
is	4705743816
on	3750423199
that	3400031103
by	3350048871
this	3228469771
with	3183110675
i	3086225277
you	2996181025

While the last few are (quite uselessly):

goolld	12711
goolh	12711
goolgee	12711
googook	12711
googllr	12711
googlal	12711
googgoo	12711
googgol	12711
goofel	12711
gooek	12711
gooddg	12711
gooblle	12711
gollgo	12711
golgw	12711

The important aspect here is that the words have associated counts, and we can use them to probabilistically compare the fitness of two segmentations of a given word. In particular, we need to define a probabilistic model that evaluates the likelihood of a given sequence of words being drawn from a theoretical distribution of all sequences of words.

Naive Bayes, the Reckless Teenager of Probability Models

The simplest useful model we could create is one in which we simply take the probability of each word in the segmentation and multiply them together. This is Bayes’s Rule, with the added assumption that each event (the occurrence of a word) is independent of the others. In common parlance, this is the equivalent to saying that the probability of a sequence of words occurring is the probability that the first word occurs (anywhere in the sequence) AND the second word occurs AND the third word occurs, and so on. Obviously the independence assumption is very strong. For a good counterexample, the sequence “united states of america” is far more common than the sequence “states america of united”, even though in this model they have the same probability.

The simplicity of this model makes it easy to implement, and hence it comes up often in applications, and so it has a special name: naive Bayes. In essence, a piece of data has a number of features (here, the features are the words in the sequence), and we make the assumption that the features occur independently of one another. The idea has been applied to a cornucopia of classification problems, but it usually serves a stepping stone to more advanced models which don’t ignore the underlying structure of the data. It can also be an indicator of whether more advanced models are really needed, or a measure of problem complexity.

For word segmentation, however, it’s a sensible simplification. Indeed, it is quite unlikely that a sequence of words will show up in a different order (as the complexity of the words in the segmentation increases, it’s probably downright impossible), so our “united states of america” example doesn’t concern us. On the other hand, we still have the example “lovesnails”, which the naive Bayes model might not sensibly handle. For a more contrived example, consider the segmentations of “endear”. Clearly “end” and “ear” are common words, but they are very rarely written in that order. Much more common is the single word “endear”, and an optimal model would take this into account, as well as three, four, and five-word sequences in our segmentation.

For a general model not pertaining just to segmentation, we’d truly love it if we could take into account $ n$-grams for any $ n$, because indeed some (very long) texts include words which have high probability of being placed together, but are far apart. Take, for instance, a text about sports in the UK. It is quite unlikely to see the word “soccer”, and more likely to see the word “wimbledon,” if the word United Kingdom shows up anywhere in the text (or, say, words like “Manchester”). There are some models which try to account for this (for instance, see the work on the “sequence memoizer” done at Columbia). However, here we expect relatively few tokens, so once we get there, we can limit our concerns to 5-grams at most. And, as most mathematicians and programmers agree, the best solution is often the simplest one.

Implementation

Our naive Bayes probability model will end up being a class in Python. Upon instantiating the class, we will read in the data file, organize the word counts, and construct a way to estimate the probability of a word occurring, given our smaller data set.

In particular, our class will inherit the functionality of a dictionary. The relevant line is

class OneGramDist(dict):

This allows us to treat any instance of the class as if it were a dictionary. We can further override the different methods of a dictionary, extending the functionality in any way we wish, but for this application we won’t need to do that.

The relevant methods we need to implement are an initialization, which reads the data in, and a “call” method (the method called when one treats the object as a function; this will be clear from the examples that follow), which returns the probability of a given word.

   def __init__(self):
      self.gramCount = 0
      for line in open('one-grams.txt'):
         (word, count) = line[:-1].split('\t')
         self[word] = int(count)
         self.gramCount += self[word]

   def __call__(self, word):
      if word in self:
         return float(self[word]) / self.gramCount
      else:
         return 1.0 / self.gramCount

In the init function, we take each line in the “one-grams.txt” file, which has the form discussed above, and extract the word and count from it. We then store this data into the “self” object (which, remember, is also a dictionary), and then tally up all of the counts in the variable “gramCount”. In the call function, we simply return the number of times the word was counted, divided by the total “gramCount”. If a word has not been seen before, we assume it has a count of 1, and return the appropriate probability. We also note that this decision will come back to bite us soon, and we will need to enhance our model to give a better guess on unknown probabilities.

To plug this in to our segmentation code from earlier, we instantiate the distribution and implement “wordSeqFitness()” as follows:

singleWordProb = OneGramDist()
def wordSeqFitness(words):
   return functools.reduce(lambda x,y: x+y,
     (math.log10(singleWordProb(w)) for w in words))

First, this requires us to import both the “math” and “functools” libraries. The “reduce” function is the same as the “fold” function (see our primer on Racket and functional programming), and it combines a list according to a given function. Here the function is an anonymous function, and the syntax is

lambda arg1, arg2, ..., argN: expression

Here our lambda is simply adding two numbers. The list we give simply computes the probabilities of each word, and takes their logarithms. This actually does compute what we want to compute, but in logarithmic coordinates. In other words, we use the simple observation that $ \log(ab) = \log(a) + \log(b)$. The reason for this is that each single word probability is on the order of $ 10^{-10}$, and so taking a product of around forty words will give a probability of near $ 10^{-400}$, which is smaller than the smallest floating point number Python allows. Python would report a probability of zero for such a product. In order to remedy this, we use this logarithm trick, and we leave it as an exercise to the reader to compute how ridiculously small the numbers we can represent in this coordinate system are.

Drawbacks, and Improvements

Now let us see how this model fares with a few easy inputs.

$ python -i segment.py
>>> segment("hellothere")
['hello', 'there']

So far so good.

>>> segment("himynameisjeremy")
['himynameisjeremy']

What gives?! Let’s fiddle with this a bit:

>>> segment("hi")
['hi']
>>> segment("himy")
['hi', 'my']
>>> segment("himyname")
['hi', 'my', 'name']
>>> segment("himynameis")
['himynameis']
>>> wordSeqFitness(['himynameis'])
-11.76946906492656
>>> wordSeqFitness(['hi', 'my', 'name', 'is'])
-11.88328049583244

There we have it. Even though we recognize that the four words above are common, their combined probability is lower than the probability of a single long word! In fact, this is likely to happen for any long string. To fix this, we need to incorporate information about the frequency of words based on their length. In particular, assume we have an unknown word $ w$. It is unlikely to be a word if it’s longer than, say, ten characters long. There are some well-known distributions describing the frequency of actual words by their length, and it roughly fits the curve:

$ f = aL^bc^L; a = 0.16, b = 2.33, c = 0.5$

Unfortunately this doesn’t help us. We want to know among all strings of a given length what proportion of them are words. With that, we can better model the probability of an unseen word actually being a word.

Luckily, we aren’t helpless nontechnical lambs. We are programmers! And we have a huge dictionary of words from Google’s n-gram corpus sitting right here! A quick python script (after loading in the dictionary using our existing code above) gives us the necessary numbers:

>>> for i in range(1,15):
...    words = [x for x in singleWordProb if len(x) == i]
...    print(len(words)/(26.0 ** i))
...
1.0
1.0
0.738336367774
0.0681436224222
0.00336097435179
0.000158748771704
6.23981380309e-06
2.13339205291e-07
6.52858936946e-09
1.84887277585e-10
4.6420710809e-12
1.11224101901e-13
2.54478475236e-15
5.68594239398e-17

Looking at the exponents, we see it’s roughly exponential with a base of $ 1/10$ for each length after 2, and we verify this by looking at a log plot in Mathematica:

The linearity of the picture tells us quite immediately that it’s exponential. And so we update the unknown word probability estimation as follows:

   def __call__(self, word):
      if word in self:
         return float(self[word]) / self.gramCount
      else:
         return 1.0 / (self.gramCount * 10**(len(key) - 2))

And indeed, our model now fares much better:

>>> segment("homebuiltairplanes")
['homebuilt', 'airplanes']
>>> segment("bbcamerica")
['bbc', 'america']
>>> segment("nevermind")
['nevermind']
>>> segment("icanhascheezburger")
['i', 'can', 'has', 'cheez', 'burger']
>>> segment("lmaorofllolwtfpwned")
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
>>> segment("wheninthecourseofhumanevents")
['when', 'in', 'the', 'course', 'of', 'human', 'events']
>>> segment("themostmercifulthingintheworldithinkistheinabilityofthehumanmindtocorrelateallitscontentsweliveonaplacidislandofignoranceinthemidstofblackseasofinfinityanditwasnotmeantthatweshouldvoyagefar")
['the', 'most', 'merciful', 'thing', 'in', 'the', 'world', 'i', 'think', 'is', 'the', 'inability', 'of', 'the', 'human', 'mind', 'to', 'correlate', 'all', 'its', 'contents', 'we', 'live', 'on', 'a', 'placid', 'island', 'of', 'ignorance', 'in', 'the', 'midst', 'of', 'black', 'seas', 'of', 'infinity', 'and', 'it', 'was', 'not', 'meant', 'that', 'we', 'should', 'voyage', 'far']
>>> s = 'Diffusepanbronchiolitisdpbisaninflammatorylungdiseaseofunknowncauseitisasevereprogressiveformofbronchiolitisaninflammatoryconditionofthebronchiolessmallairpassagesinthelungsthetermdiffusesignifiesthat'
>>> segment(s)
['diffuse', 'pan', 'bronchiolitis', 'dpb', 'is', 'an', 'inflammatory', 'lung', 'disease', 'of', 'unknown', 'cause', 'it', 'is', 'a', 'severe', 'progressive', 'form', 'of', 'bronchiolitis', 'an', 'inflammatory', 'condition', 'of', 'the', 'bronchioles', 'small', 'air', 'passages', 'in', 'the', 'lungs', 'the', 'term', 'diffuse', 'signifies', 'that']

Brilliant! The last bit is from the wikipedia page on diffuse panbronchiolitis, and it only mistakes the “pan” part of the admittedly obscure technical term. However, we are much more impressed by how our model embraces internet slang :). We can further verify intended behavior by deliberately misspelling a long word to see how the model fares:

>>> segment("antidisestablishmentarianism")
['antidisestablishmentarianism']
>>> segment("antidisestablishmentarianasm")
['anti', 'disestablishment', 'ariana', 'sm']

That second segmentation is certainly more likely than the original misspelled word (well, if we assume that no words are misspelled before segmentation).

Perhaps the most beautiful thing here is that this entire program is independent of the data used. If we wanted to instead write a program to segment, say, German words, all we’d need is a data file with the German counts (which Google also provides with its book corpus, along with Russian, Chinese, French, Spanish, and Hebrew). So we’ve written a much more useful program than we originally intended. Now compare this to the idea of a program which hard codes rules for a particular language, which was common practice until the 1980’s. Of course, it’s obvious now how ugly that method is, but apparently it’s still sometimes used to augment statistical methods for language-specific tasks.

Of course, we still have one flaw: the data is sloppy. For instance, the segmentation of “helloworld” is just “helloworld.” It turns out that this token appears on the internet in various forms, and commonly enough to outweigh the product of “hello” and “world” alone. Unfortunately, we can’t fix this by fiddling with the data we already have. Instead, we would need to extend the model to look at frequency counts of sequences of words (here, sequences of length 2). Google provides sequence counts of up to length 5, but they quickly grow far too large to fit in memory. One possible solution, which we postpone for a future post, would be to set up a database containing all 1-gram and 2-gram data, and therein bypass the need to store a big dictionary in memory. Indeed, then we could avoid truncating the 1-gram data as we did in this post.

The Next Steps

Next time, we will look at another application of our truncated corpus: cryptanalysis. In particular, we will make an effort to break substitution ciphers via a local search method. In the not so near future, we also plan to investigate some other aspects of information representation. One idea which seems promising is to model a document as a point in some high dimensional space (perhaps each dimension corresponds to a count of a particular 1-gram), and then use our familiar friends from geometry to compare document similarity, filter through information, and determining the topic of a document via clustering.  The idea is called the vector space model for information retrieval. In addition, we can use the corpus along with our friendly Levenshtein metric to implement a spell-checker. Finally, we could try searching for phonetically similar words using a Levenshteinish metric on letter $ n$-grams (perhaps $ n$ is between 2 and 4).

Until next time!

P.S., in this model, “love snails” is chosen as the best segmentation over “loves nails.” Oh the interpretations…

Metrics on Words

We are about to begin a series where we analyze large corpora of English words. In particular, we will use a probabilistic analysis of Google’s ngrams to solve various tasks such as spelling correction, word segmentation, on-line typing prediction, and decoding substitution ciphers. This will hopefully take us on a wonderful journey through elementary probability, dynamic programming algorithms, and optimization.

As usual, the code implemented in this post is available from this blog’s Github page, and we encourage the reader to use the code to implement our suggested exercises. But before we get there, we should investigate some properties of our domain: the set of all finite strings of letters.

Words, Words, Words.

If we consider a fixed alphabet $ \Sigma$ (any set, really, but for our purposes a finite one), we may consider the set $ \Sigma^*$ of all finite strings of elements from $ \Sigma$, also called words. For example, given $ \Sigma = \left \{ u,v,w \right \}$, we have the word $ u \cdot w \cdot w \cdot v \cdot u \in \Sigma^*$. Also, we allow the empty word $ \varepsilon$ to be a string of length zero. The formal name for the “star” operation is the Kleene star. Most of our work here will be done over the English alphabet of letters $ \Sigma = \left \{ a, b, c, \dots , z \right \}$.

As usual, we are looking for some sort of underlying structure. Here the structure is that two words can be concatenated to make a larger string. In the parlance of abstract algebra, $ \Sigma^*$ is a monoid with respect to the concatenation operation. If we denote the operation by (pretending it is) multiplication, we write $ u \cdot v = uv$, and the monoid structure just means two things. First, the $ \cdot$ operation is associative, so that any three words $ r,s,t$ satisfy $ (r \cdot s) \cdot t = r \cdot (s \cdot t)$. Second, it has an identity element (here the empty word), so that $ \varepsilon w = w \varepsilon$ for all words $ w$. For computer scientists, these are just natural properties of functions like C’s strcat(), but in mathematics they define the structure of the space of all words. To be completely clear, these two properties (a set with an associative binary operation and an identity element) define a monoid.

We make a few abuses of notation here. In every monoid the operation is a pretend multiplication, so in general we will call it multiplication. We will also write strings (abusively, “products”) as $ a^4b^2$, which would formally be $ a \cdot a \cdot a \cdot a \cdot b \cdot b$. We lose nothing by the abuses, but gain brevity.

The Kleene starred monoid $ \Sigma^*$ has an additional property; it is the free monoid generated by $ \Sigma$. This won’t mean anything to the reader who isn’t familiar with universal properties, but it essentially tells us that any word $ w \in \Sigma^*$ is uniquely written as a product of letters in $ \Sigma$.

Now, the structure we’ve described here is not particularly rich. In fact, free objects are the algebraic objects which are usually “completely understood.” For our purposes the language of abstract algebra is just a mature setting for our discussion, but the concepts we introduce will give an extra perspective on the topic. In other words, as we don’t plan to use any monoids more complicated than the English alphabet free monoid described above, we have no interesting (general) theorems to apply to our activities.

Before we turn to a more concrete setting, we have one more definition. A monoid homomorphism between two monoids $ M_1, M_2$ is a function $ f : M_1 \to M_2$ which respects the multiplication operations and preserves the identity element. Rigorously, we have that all words $ u,v \in M_1$ satisfy $ f(uv) = f(u)f(v)$, where the operation on the left side of the equality (before the application of $ f$) is multiplication in $ M_1$, and the one on the right hand side is multiplication in $ M_2$.

One easy example of a monoid homomorphism from our English alphabet free monoid is the length homomorphism. Rigorously, the set of natural numbers $ \mathbb{N}$ is a monoid under addition, and the function $ \textup{length} : \Sigma^* \to \mathbb{N}$ which assigns to each word its length is a homomorphism of monoids. This is intuitively clear: the length of a concatenation of two words is the sum of the lengths of the pieces.

A more complex example which shows up in functional programming has to do with lists. Let $ X, Y$ be two classes of objects of some fixed types, then we may consider $ X^*$ as the set of all lists of objects in $ X$. This is again a free monoid over $ X$ with the operation of list appending and the empty list as the identity. Note that $ X$ sits inside $ X^*$ in a natural way: each element of $ X$ can be considered a list of length one. With this understanding, we may be sloppy in calling the “product” of $ x,y \in X$ the list $ xy \in X^*$ (note, $ X$ itself need not have any operations).

Now for any fixed operation $ g : X \to Y$, we may form the map homomorphism $ \mu_g: X^* \to Y^*$ inductively as follows:

$ \mu_g(\varepsilon) = \varepsilon$
$ \mu_g(x_1 \dots x_n) = g(x_1) \mu_g(x_2 \dots x_n))$

This is precisely the map operation defined in our primer on functional programming. We encourage the reader to investigate how to phrase the other two functions (filter and fold) as monoid homomorphisms, or prove it cannot be done (thanks to Matt for pointing out this author’s mistake with regards to that).

Metrics, and String Comparisons

Since our goal is to do things like spelling correction, we are quite interested in strings of letters which are not actually words. We want to be able to tell someone that the word “beleive” is probably a misspelling of the word “believe.” So let us fix our alphabet $ \Sigma = \left \{ a, b, c, \dots , z \right \}$ and consider the free monoid $ \Sigma^*$. As we have noted, this is the set of all words one could type with the lowercase English alphabet, so it includes all of our egregious typos. It is a simplistic model, since we ignore punctuation, capitalization, and stylistic marks that convey meaning. But it is as good a place as any to start our investigation.

To mathematically describe what it means for a misspelled word to be “almost” the intended word, we need to bring in the concept of a metric. In other words, we want to view our set $ \Sigma^*$ as a metric space in which we can measure the distance between any two words (when viewed as a metric space, we will call them points). Then the set of all valid English words $ E \subset \Sigma^*$ is a subspace. To correct a misspelled word $ w \in \Sigma^*$, we can simply use the closest point in $ E$ with respect to the metric.

Of course, the hard part is describing the right metric. But before we get there, we must define a metric so we know what properties to aim for in constructing a metric on words.

Definition: A metric $ d : X \times X \to \mathbb{R}$ is a function on a set $ X$ which has the following three properties for all $ x,y,z \in X$

  • $ d(x,y) \geq 0$, and $ d(x,y) = 0$ if and only if $ x = y$.
  • $ d(x,y) = d(y,x)$
  • $ d(x,y) + d(y,z) \geq d(x,z)$ (the triangle inequality)

A space $ X$ equipped with a fixed metric $ d$ is said to be a metric space.

There are plenty of interesting examples of metrics, and we refer the interested reader to Wikipedia, or to any introductory topology text (or the end of a real analysis text). We will focus on the Levenshtein metric.

If we think for a minute we can come up with a list of ways that people make typing mistakes. Sometimes we omit letters (as in diferent), sometimes we add too many letters (e.g., committment), and sometimes we substitute one letter for another (missussippi could be a phonetic error, or a slip of the finger on a qwerty keyboard). Furthermore, we can traverse from one word to another by a sequence of such operations (at worst, delete all letters and then insert the right letters). So it would make sense to take the distance between two words to be the smallest number of such transformations required to turn one word into another.

More rigorously, let $ u = u_1 \dots u_k$ be the unique way to write $ u$ as a product of letters, and let $ v = v_1 \dots v_j$ be the same for $ v$. An elementary edit of $ u$ is one of the following:

  • a deletion: the transformation $ u_1 \dots u_i \dots u_k \to u_1 \dots \widehat{u_i} \dots u_k$ for some $ 1 \leq i \leq k$, where the hat omits omission in the $ i$-th spot.
  • an insertion: the transformation $ u_1 \dots u_k \to u_1 \dots u_i x \dots u_k $ for some $ 1 \leq i \leq k$, and some letter $ x \in \Sigma$.
  • a substitution: the transformation $ u_1 \dots u_i \dots u_k \to u_1 \dots u_{i-1}xu_{i+1} \dots u_k$ for some $ 1 \leq i \leq k-1$ and some letter $ x$.

Then an edit from $ u$ to $ v$ is a sequence of elementary edits which begins with $ u= u_1 \dots u_k$ and ends in $ v= v_1 \dots v_j$. The length of an edit is the number of elementary edits in the sequence. Finally, we define the edit distance between $ u$ and $ v$, denoted $ d(u,v)$, as the length of the shortest edit from $ u$ to $ v$.

To verify this is a metric, we note that all edits have non-negative length, and the only edit of length zero is the edit which does nothing, so if $ d(x,y) = 0$ it follows that $ x = y$. Second, we note that edits are symmetric inherently, in that if we have an edit from $ x$ to $ y$, we may simply reverse the sequence and we have a valid edit from $ y$ to $ x$. Clearly, the property of being the shortest edit is not altered by reversal.

Last, we must verify the triangle inequality. Let $ x,y,z$ be words; we want to show $ d(x,z) \leq d(x,y) + d(y,z)$. Take two shortest edits between $ x,y$ and $ y,z$, and note that their composition is a valid edit from $ x$ to $ z$. Following our definition, by “compose” we mean combine the two sequences of operations into one sequence in the obvious way. Since this is an edit, its length can be no smaller than the shortest edit from $ x$ to $ z$, proving the claim.

So $ d$ is in fact a metric, and historically it is called Levenshtein’s metric.

A Topological Aside

Before we get to implementing this metric, we have a few observations to make. First, we note that the shortest edit between two words is far from unique. In particular, the needed substitutions, insertions, and deletions often commute (i.e. the order of operations is irrelevant). Furthermore, instead of simply counting the number of operations required, we could assign each operation a cost, and call the total cost of an edit the sum of the costs of each elementary edit. This yields a large class of different metrics, and one could conceivably think of new operations (combinations of elementary operations) to assign lower costs. Indeed, we will do just that soon enough.

Second, and more interestingly, this metric provides quite a bit of structure on our space. It is a well known fact that every metric induces a topology. In other words, there is a topology generated by the open balls $ \left \{ x : d(x,y) < r \right \}$ for all possible radii $ r \in \mathbb{R}$ and all centers $ y$. We can also characterize the topology from another viewpoint: consider the infinite graph $ G$ where each vertex is a word in $ \Sigma^*$ and two words have a connecting edge if there exists an elementary edit between them. Then edit distance in $ \Sigma^*$ is just the length of a shortest path in $ G$, and so the spaces are isometric, and hence homeomorphic (they have identical topologies). Indeed, this is often generalized to the word metric on a group, which is beyond the scope of this post (indeed, we haven’t gotten anywhere close to group theory yet on this blog!).

For those of us unfamiliar with topology or graph theory, we can still imagine geometric notions that get to the intuitive heart of what “induced topology” means for words. For example, we can describe a circle of radius $ r$ centered at a word $ w$ quite easily: it is just the set of all words whose edit distance from $ w$ is exactly $ r$. As a concrete example, the circle of radius 1 centered at the word $ a$ is

$ \left \{ \varepsilon, b, c, \dots , z, aa, ab, ac, \dots , az, ba, ca, \dots , za \right \}$

In fact, any geometric construction that can be phrased entirely in terms of distance has an interpretation in this setting. We encourage the reader to think of more.

Python Implementation, and a Peek at Dynamic Programming

Of course, what use are these theoretical concerns to us if we can’t use it to write a spell-checker? To actually implement the damn thing, we need a nontrivial algorithm. So now let’s turn to Python.

Our first observation is that we don’t actually care what the edits are, we just care about the number of edits. Since the edits only operate on single characters, we can define the behavior recursively. Specifically, suppose we have two words $ u = u_1 \dots u_k$ and $ v_1 \dots v_j$. If $ u_k = v_j$, we can leave the last characters the same and inductively work with the remaining letters. If not, we find the shortest edit between $ u_1 \dots u_{k-1}$ and $ v_1 \dots v_{j}$, as if our last operation were a deletion of $ u_k$. Similarly, we can inductively find the shortest distance between $ u_1 \dots u_k$ and $ v_1 \dots v_{j-1}$, as if our last move were an insertion of $ v_j$ to the end of $ u$. Finally, we could find the shortest distance between $ u_1 \dots u_{k-1}$ and $ v_1 \dots v_{j-1}$, as if our last move were a substitution of $ u_k$ for $ v_j$. For the base case, if any word is empty, then the only possible edit is inserting/deleting all the letters in the other word.

Here is precisely that algorithm, written in Python:

def dist(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist(word1[:-1], word2[:-1])
   else:
      return 1 + min(dist(word1[:-1], word2),
                     dist(word1, word2[:-1]),
                     dist(word1[:-1], word2[:-1]))

Here the [:-1] syntax indicates a slice of the first $ n-1$ characters of an $ n$ character string. Note again that as we don’t actually care what the operations are, we can simply assume we’re doing the correct transformation, and just add 1 to our recursive calls. For a proof of correctness, we refer the reader to Wikipedia (Sorry! It’s just a ton of case-checking). We also note that recursion in Python can be extremely slow for large inputs. There is of course a method of building up a cost matrix from scratch which would perform better, but we feel this code is more legible, and leave the performance tuning as an exercise to the reader. For more information on dynamic programming, see this blog’s primer on the subject.

The cautious programmer will note the above algorithm is terribly wasteful! For instance, suppose we’re investigating the distance between $ foo$ and $ bar$. Through our recursive calls, we’ll first investigate the distance between $ fo$ and $ bar$, during which we recursively investigate $ fo$ versus $ ba$. Once that’s finished, we go ahead and investigate the other branch, $ foo$ versus $ ba$, during which we look at $ fo$ versus $ ba$ once more, even though we already computed it in the first branch! What’s worse, is that we have a third branch that computes $ fo$ versus $ ba$ again! Doing a bit of algorithm analysis, we realize that this algorithm is $ O(3^{\min(n,m)})$, where $ m, n$ are the lengths of the two compared words. Unacceptable!

To fix this, we need to keep track of prior computations. The technical term is memoized recursion, and essentially we want to save old computations in a lookup table for later reference. In mostly-Python:

cache = {}
def memoizedFunction(args):
   if args not in cache:
      cache[args] = doTheComputation(args)
   return cache[args]

To actually implement this, it turns out we don’t need to change the above code at all. Instead, we will use a decorator to modify the function as we wish. Here’s the code, which is essentially an extra layer of indirection applied to the above pseudocode.

def memoize(f):
   cache = {}

   def memoizedFunction(*args):
      if args not in cache:
         cache[args] = f(*args)
      return cache[args]

   memoizedFunction.cache = cache
   return memoizedFunction

Here the function memoize() will accept our distance function, and return a new function which encapsulates the memo behavior. To use it, we simply use

def f(x):
   ...

equivalentButMemoizedFunction = memoize(f)

But luckily, Python gives a nice preprocessor macro to avoid writing this for every function we wish to memoize. Instead, we may simply write

@memoize
def f(x):
   ...

And Python will make the appropriate replacements of calls to f with the appropriate calls to the memoized function. Convenient! For further discussion, see our post on this technique in the program gallery.

Applying this to our Levenshtein metric, we see an impressive speedup, and a quick analysis shows the algorithm takes $ O(nm)$, where $ n, m$ are the lengths of the two words being compared. Indeed, we are comparing (at worst) all possible prefixes of the two words, and for each of the $ n$ prefixes of one word, we compute a distance to all $ m$ prefixes of the other word. The memoization prevents us from doing any computation twice.

To this author, this approach is the most natural implementation, but there are other approaches worth investigating. In particular, Python limits the recursion depth to a few hundred. If we try to compare, say, two DNA sequences, this algorithm will quickly overflow. There are a number of ways to fix this, the most appropriate of which would be tail call optimization (in this author’s humble opinion). Unfortunately, we’d need to tweak the algorithm a bit to put the recursive call in tail position, Python does not support tail call optimization, and manually putting things in continuation-passing style is annoying, obfuscating, and quite ugly. If we decide in the future to do DNA sequence analysis, we will return to this problem.

In the future, we plan to provide another Python primer, with a focus on dynamic algorithms. Other methods for solving this problem will arise there. Indeed, I’m teaching an introductory Python programming course next semester, so this will be a good refresher.

Transpositions, and Other Enhancements

One other significant kind of typo is a transposition. Often times we type the correct letters in a word, but jumble the order of two words. In a spell checker, we want the word $ thier$ to be closer to the word $ their$ than it is to the word $ cheer$, but with the Levenshtein metric the two pairs have equal distance (two substitutions each). We can enhance the metric by making transpositions have a cost of 1. Historically, this extended metric is called the Damerau-Levenshtein metric. Indeed, Damerau himself gave evidence that transpositions, along with the other three elementary edits, account for over 85% of human typing errors. Then again, that was back in the sixties, and typing has changed in many ways since then (not the least of which is a change in a typist’s vocabulary).

Adding transpositions to the algorithm above seems straightforward, but there are some nontrivial details to consider. For instance, we may first transpose two letters and then insert a new letter between them, as in the transformation from $ ta$ to $ act$. If we are not careful, we might prohibit such legal transformations in our algorithm. Here is an implementation, which again uses the memoization decorator.

@memoize
def dist2(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist2(word1[:-1], word2[:-1])
   else:
      minDist = 1 + min(dist2(word1[:-1], word2),
                        dist2(word1, word2[:-1]),
                        dist2(word1[:-1], word2[:-1]))
      # transpositions
      if len(word1) &gt; 1 and len(word2) &gt; 1:
         if word1[-2] == word2[-1]:
            transposedWord1 = word1[:-2] + word1[-1] + word1[-2]
            minDist = min(minDist, dist2(transposedWord1[:-1], word2))
         if word2[-2] == word1[-1]:
            transposedWord2 = word2[:-2] + word2[-1] + word2[-2]
            minDist = min(minDist, dist2(word1, transposedWord2[:-1]))
   return minDist

Indeed, we must inspect both possible transpositions, and the symmetry of the example above shows the need for both branches. The proof that this extended metric is still a metric and the proof of algorithmic correctness are nearly identical to the plain Levenshtein metric.

So that was fun. Here are some other ideas we leave as exercises to the reader. First, if we allow ourselves to fix a keyboard layout (for many languages with Latin-based alphabets, the standard is qwerty with minor substitutions), we could factor that in to our analysis of letter substitutions and incorrect insertions. For instance, the word $ ribies$ is just as close to $ rabies$ as it is to $ rubies$, but it is less likely the user meant to type the first word, since $ u$ is physically closer to $ i$ than $ a$ is. To implement this, we can modify the above algorithm to accept a look-up table of physical distances (approximations) between keys. Instead of adding 1 in the relevant branches, we can add a cost according to the look-up table. At the coarsest, we could construct a graph with vertices representing letters, edges representing physical adjacencies, and use the shortest graph path in place of physical key distance.

We also note (from our mature vantage point) that this algorithm is not restricted to strings, but can be performed on any free monoid. This includes the example we mentioned earlier of lists. So we could generalize the algorithm to operate on any piece of data which has such an identity element and binary operation, and satisfies the freedom condition. My knowledge of Python is still somewhat limited, but the method for achieving this generalization comes in many names in many languages: in Java it’s interfaces, in C++ it’s templating, in Haskell it’s a typeclass. In Python, there is a fancy thing called duck-typing, and we leave this for our next Python primer.

Next time, we’ll crack open some data files with actual English dictionaries in them, and see what we can do about solving interesting problems with them. Until then!