Dynamic Time Warping for Sequence Comparison

Problem: Write a program that compares two sequences of differing lengths for similarity.

Solution: (In Python)

import math

def dynamicTimeWarp(seqA, seqB, d = lambda x,y: abs(x-y)):
    # create the cost matrix
    numRows, numCols = len(seqA), len(seqB)
    cost = [[0 for _ in range(numCols)] for _ in range(numRows)]

    # initialize the first row and column
    cost[0][0] = d(seqA[0], seqB[0])
    for i in xrange(1, numRows):
        cost[i][0] = cost[i-1][0] + d(seqA[i], seqB[0])

    for j in xrange(1, numCols):
        cost[0][j] = cost[0][j-1] + d(seqA[0], seqB[j])

    # fill in the rest of the matrix
    for i in xrange(1, numRows):
        for j in xrange(1, numCols):
            choices = cost[i-1][j], cost[i][j-1], cost[i-1][j-1]
            cost[i][j] = min(choices) + d(seqA[i], seqB[j])

    for row in cost:
       for entry in row:
          print "%03d" % entry,
       print ""
    return cost[-1][-1]

DiscussionComparing sequences of numbers can be tricky business. The simplest way to do so is simply component-wise. However, this will often disregard more abstract features of a sequence that we intuitively understand as “shape.”

For example, consider the following two sequences.

0 0 0 3 6 13 25 22 7 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 12 24 23 8 3 1 0 0 0 0 0

They both have the same characteristics — a bump of height roughly 25 and length 8 — but comparing the two sequences entrywise would imply they are not similar. According to the standard Euclidean norm, they are 52 units apart. For motivation, according to the dynamic time warping function above, they are a mere 7 units apart. Indeed, if the two bumps consisted of the same numbers, the dynamic time warp distance between the entire sequences would be zero.

These kinds of sequences show up in many applications. For instance, in speech recognition software one often has many samples of the a single person speaking, but there is a difference in the instant during the sample at which the person begins speaking. Similarly, the rate at which the person speaks may be slightly different. In either event, we want the computed “distance” between two such samples to be small. That is, we’re looking for a measurement which is time-insensitive both in scaling and in position. Historically, the dynamic time warping solution was designed in 1978 to solve this problem.

The idea is similar to the Levenshtein metric we implemented for “edit distance” measurements in our Metrics on Words post. Instead of inserting or deleting sequence elements, we may optionally “pause” our usual elementwise comparison on one sequence while continuing on the other. The trick is in deciding whether to “pause,” and when to switch the “pausing” from one sequence to the other; this will require a dynamic program (as did the Levenshtein metric).

In order to compare two sequences, one needs to have some notion of a local comparison. That is, we need to be able to compare any entry from one sequence to any entry in the other. While there are many such options that depend on the data type used in the sequence, we will use the following simple numeric metric:

$ \displaystyle d(x,y) = |x-y|$

This is just the Euclidean metric in $ \mathbb{R}$. More generally, this assumes that two numbers are similar when they are close together (and this is in fact an important assumption; not all number systems are like this).

Now given two sequences $ a_i, b_j$, we can compare them by comparing their local distance for a specially chosen set of indices given by $ m_k$ for $ a_i$ and $ n_k$ for $ b_j$. That is, the dynamic time warping distance will end up being the quantity:

$ \displaystyle C(a_i, b_j) = \sum_{k=0}^{M} d(a_{m_k}, b_{n_k})$

Of course, we should constrain the indices $ m_k, n_k$ so that the result is reasonable. A good way to do that is to describe the conditions we want it to satisfy, and then figure out how to compute such indices. In particular, let us assume that $ a_i$ has length $ M$, $ b_j$ has length $ N$. Then we have the following definition.

Definition: A warping path for $ a_i, b_j$ is a pair of sequences $ (m_k, n_k)$, both of some length $ L$, satisfying the following conditions:

  1. $ 1 \leq m_k \leq M$ and $ 1 \leq n_k \leq N$ for all $ k$.
  2. The sequences have endpoints $ (m_1, n_1) = (1,1), (m_L, n_L) = (M, N)$
  3. The seqences $ m_k, n_k$ are monotonically increasing.
  4. $ (m_k – m_{k-1}, n_k – n_{k-1})$ must be one of $ (1,0), (0,1), (1,1)$.

The first condition is obvious, or else the $ m_k, n_k$ could not be indexing $ a_i, b_j$. The second condition ensures that we use all of the information in both sequences in our comparison. The third condition implies that we cannot “step backward” in time as we compare local sequence entries. We wonder offhand if anything interesting could be salvaged from the mess that would ensue if one left this condition out. And the fourth condition allows the index of one sequence to be stopped while the other continues. This condition creates the “time warping” effect, that some parts of the sequence can be squeezed or stretched in comparison with the other sequence.

We also note as an aside that the fourth condition listed implies the third.

Of course there are many valid warping paths for any two sequences, but we need to pick one which has our desired feature. That is, it minimizes the sum of the local comparisons (the formula for $ C(a_i, b_j)$ above). We denote this optimal value as $ DTW(a_i, b_j)$.

The fourth condition in the list above should give it away that to compute the optimal path requires a dynamic program. Specifically, the optimal path can be computed by solving the three sub-problems of finding the optimal warping path for

$ \displaystyle (a_{1 \dots M-1}, b_{1 \dots N}), (a_{1 \dots M}, b_{1 \dots N-1}), \textup{ and } (a_{1 \dots M-1}, b_{1 \dots N-1})$

The clueless reader should refer to this blog’s primer on Python and dynamic programming. In any event, the program implementing this dynamic program is given in the solution above.

A visualization of the dynamic time warp cost matrix for two sequences. The algorithm attempts to find the least expensive path from the bottom left to the top right, where the darker regions correspond to low local cost, and the lighter regions to high local cost. The arrows point in the forward direction along each sequence, showing the monotonicity requirement of an optimal warping path.

The applications of this technique certainly go beyond speech recognition. Dynamic time warping can essentially be used to compare any data which can be represented as one-dimensional sequences. This includes video, graphics, financial data, and plenty of others.

We may also play around with which metric is used in the algorithm. When the elements of the lists are themselves points in Euclidean space, you can swap out the standard Euclidean metric with metrics like the Manhattan metric, the maximum metric, the discrete metric, or your mother’s favorite $ L^p$ norm.

While we use a metric for elementwise comparisons in the algorithm above, the reader must note that the dynamic time warping distance is not a metricIn fact, it’s quite far from a metric. The casual reader can easily come up with an example of two non-identical sequences $ x, y$ for which $ DTW(x,y) = 0$, hence denying positive-definiteness. The more intrepid reader will come up with three sequences which give a counterexample to satisfy the triangle inequality (hint: using the discrete metric as the local comparison metric makes things easier).

In fact, the failure to satisfy positive-definiteness and the triangle inequality means that the dynamic time warping “distance” is not even a semimetric. To be completely pedantic, it would fit in as a symmetric premetric. Unfortunately, this means that we don’t get the benefits of geometry or topology to analyze dynamic time warping as a characteristic feature of the space of all numeric sequences. In any event, it’s still proven to be useful in applications, so it belongs here in the program gallery.

A Spoonful of Python (and Dynamic Programming)

This primer is a third look at Python, and is admittedly selective in which features we investigate (for instance, we don’t use classes, as in our second primer on random psychedelic images). We do assume some familiarity with the syntax and basic concepts of the language. For a first primer on Python, see A Dash of Python. We’ll investigate some of Python’s useful built-in types, including lists, tuples, and dictionaries, and we use them to computing Fibonacci numbers and “optimal” coin change.

Fibonacci Numbers (the Wrong Way)

There’s an interesting disconnect between the mathematical descriptions of things and a useful programmatic implementation. Take for instance, the Fibonacci numbers $ f_n$.

$ f_1 = f_2 = 1$
$ f_n = f_{n-1} + f_{n-2}$ for $ n > 2$

A direct Python implementation of this definition is essentially useless. Here it is:

def fib(n):
   if n <= 2:
      return 1
      return fib(n-1) + fib(n-2)

Recalling our first Python primer, we recognize that this is a very different kind of “for” loop. Indeed, this is a recursive loop, which achieves the looping idea by having a function call itself with different arguments. For more examples of this (albeit, in a different language), see our primer A Taste of Racket.

In one mindset that will do us good later in the post, we are simplifying the problem of computing large Fibonacci numbers by breaking it up into a number of sub-problems. The sub-problems require us to compute smaller and smaller Fibonacci numbers, and we build up the sub-problems in some way to arrive at a final solution. This is the essence of dynamic programming, and we’ll see a more complicated example shortly.

Unfortunately, nobody in their right mind would use this piece of code. Trying to compute even $ f_{100}$, this algorithm seems to never terminate! Some might throw their hands in the air and claim recursion is totally useless. Indeed, before we get to a working recursive solution (which this author considers more elegant), we will take the indignant view and look for a solution that is totally recursionless.


Python has two built-in types which are list-like. The first is called a tuple, and the second is called a list. They function almost exactly the same way, but the first is immutable (that means you can’t change it after it’s created). We covered plain ‘ol lists in our first Python primer, but here are a few syntactic reminders, and show off some new list methods.

myList = ['a', 2, "gamma", ["wtf another list!", 5]]
myList[-1] = "replacing the list by a string"
secondList = [2*i for i in range(10)]

As in the second line, we can use negative indices to access the elements in reverse order. In the third, we can access slices of the list using the colon notation. And in the last line, we have the generally awesome and useful list comprehensions. This follows more or less closely with the usual notation for constructing lists (of course, using Pythonic syntax). However, one must be careful, because each “for” in a list comprehension literally corresponds to an actual for loop. In other words, the following two bits of code are equivalent:

myList = [x*y*z for x in range(100) for y in range(100)
          for z in range(100)]
myList = []
for x in range(100):
   for y in range(100):
     for z in range(100):

On the other hand, tuples have almost precisely the same syntax, except they are constructed with parentheses, and tuple comprehensions have an extra bit of syntax. Here are the same examples as above, except with tuples.

myTuple = ('a', 2, "gamma", ("wtf another tuple!", 5))
myTuple[-1] = 2 # not allowed to assign!
myTuple.append(myTuple[2:4]) # not allowed to append!
secondTuple = tuple(2*i for i in range(10))

The middle two lines are invalid. Tuples support neither assignment nor appending, and any sort of alteration of a tuple is forbidden. All of the fuss comes from the fact that when you pass a list as a function argument, the function can change the contents of the list. If you’re on a large software team and you don’t look at what the function does (or use some function whose code you can’t view), then you might not not anticipate this. Converting a list to a tuple before passing it is a safety measure.

Indeed, the conversion is in the last line but we could have first create a list, and then converted it to a tuple with “tuple()”.

With lists in our pocket, we can come up with a way to compute Fibonacci numbers quickly: we just save the old computations in a list (and do away with all that recursion nonsense).

def fib(n):
   fibValues = [0,1]
   for i in range(2,n+1):
      fibValues.append(fibValues[i-1] + fibValues[i-2])
   return fibValues[n]

One difference here is that lists index staring at zero, so we have to go up to $ n+1$. In fact, some authors set $ f_0 = 0$, so this works out just fine. In fact, we can even compute such huge Fibonacci numbers as $ f_{100,000}$, and it only takes about four seconds. Compare this to the infinitude of time it took our naive algorithm to compute $ f_{100}$ and you’ll see why some forsake recursion so quickly.

Before we go on to a more general discussion and a more interesting problem, we note that trick to compute Fibonacci numbers with a single recursive call. We leave this as an exercise to the reader (hint: instead of the argument being a single number $ n$, pass a tuple containing two useful values). Of course, this method can just as well be done with a loop instead of recursion. The difference is that there is no growing list as we go along, so we say this algorithm takes a constant amount of space (or just constant space). And, for completeness, depending on the language certain kinds of recursion do “take up space” in a sense, and Python happens to be a language that does that. For more on this, see tail call optimization.

Dynamic Programming

The feat we just accomplished in computing Fibonacci numbers quickly does generalize to more interesting problems and much harder problems. So hard, in fact, that the method has its own name: dynamic programming. It’s hard to give a precise (and concise) definition for when dynamic programming applies. This author likes to think of it as “the method you need when it’s easy to phrase a problem using multiple branches of recursion, but it ends up taking forever since you compute the same old crap way too many times.” According to Wikipedia, this means we have “overlapping sub-problems” which are just slightly smaller than the goal. In the case of Fibonacci numbers, this was that we could compute $ f_n$ from $ f_{n-1}$ and $ f_{n-2}$, but their computations in turn had overlapping sub-computations (both use the value of $ f_{n-3}$).

We avoided the problem of computing stuff multiple times by storing our necessary values in a list, and then building the final solution from the bottom up. This turns out to work for a plethora of problems, and we presently turn to one with some real world applications.

Coin Change

Most readers will have made change for a certain amount of money using a fixed number of coins, and one can ask, “what is the smallest number of coins I can use to make exact change?” The method of picking the largest coins first (and only taking smaller coins when you need to) happens to give the optimal solution in many cases (U.S. coins are one example). However, with an unusual set of coins, this is not always true. For instance, the best way to make change for thirty cents if you have quarters, dimes, and pennies, would be to use three dimes, as opposed to a quarter and five pennies.

This example shows us that the so-called “greedy” solution to this problem doesn’t work. So let’s try a dynamic algorithm. As with Fibonacci numbers, we need to compute some sub-problems and build off of them to get an optimal solution for the whole problem.

In particular, denote our coin values $ v_1, \dots, v_k$ and the amount of change we need to make $ n$. Let us further force $ v_1 = 1$, so that there is guaranteed to be a solution. Our sub-problems are to find the minimal number of coins needed to make change if we only use the first $ i$ coin values, and we make change for $ j$ cents. We will eventually store these values in a two-dimensional array, and call it $ \textup{minCoins}[i,j]$. In other words, this is a two-dimensional dynamic program, because we have sub-problems that depend on two parameters.

The base cases are easy: how many coins do I need to make change for zero cents? Zero! And how many pennies do I need to make $ j$ cents? Exactly $ j$. Now, suppose we know $ \textup{minCoins}[i-x,j]$ and $ \textup{minCoins}[i,j-y]$ for all relevant $ x, y$. To determine $ \textup{minCoins}[i,j]$, we have two choices. First, we can use a coin of value $ v_i$, and add 1 to the array entry $ \textup{minCoins}[i, j – v_i]$, which contains the best solution if we have $ j – v_i$ cents. Second, we can ignore the fact that we have coins of value $ v_i$ available, and we can use the solution $ \textup{minCoins}[i-1,j]$, which gives us the same amount of money either.

In words, the algorithm says we can suppose our last coin is the newly introduced coin, and if that doesn’t improve on our best solution yet (above, $ \textup{minCoins}[i,j-v_i]$) then we don’t use the new coin, and stick with our best solution whose last coin is not the new one (above, $ \textup{minCoins}[i-1,j]$).

Let’s implement this idea:

def coinChange(centsNeeded, coinValues):
   minCoins = [[0 for j in range(centsNeeded + 1)]
               for i in range(len(coinValues))]
   minCoins[0] = range(centsNeeded + 1)

   for i in range(1,len(coinValues)):
      for j in range(0, centsNeeded + 1):
         if j < coinValues[i]:
            minCoins[i][j] = minCoins[i-1][j]
            minCoins[i][j] = min(minCoins[i-1][j],
             1 + minCoins[i][j-coinValues[i]])

   return minCoins[-1][-1]

The first two lines of the function initialize the array with zeros, and fill the first row with the relevant penny values. The nested for loop runs down each row, updating the coin values as described above. The only difference is that we check to see if the coin value is too large to allow us to choose it (j < coinValues[i], or equivalently j – coinValues < 0). In that case, we default to not using that coin, and there is no other way to go. Finally, at the end we return the last entry in the last row of the matrix.

We encourage the reader to run this code on lots of examples, and improve upon it. For instance, we still assume that the first entry in coinValues is 1.

Memoized Recursion

Essentially all of dynamic programming boils down to remembering the solutions to sub-problems. One might ask: do we really need all of this array indexing junk to do that? Isn’t there a way that hides all of those details so we can focus on the core algorithm? Of course, the answer to that question is yes, but it requires some more knowledge about Python’s built in data types.

In the fill justify problem above, we uniquely identified a sub-problem using two integers, and indexing the rows and columns of a two-dimensional array (a list of lists). However, the fact that it was a two-dimensional array is wholly irrelevant. We could use a one-dimensional list, as long as we had some way to uniquely identify a sub-problem via a key we have access to. The official name for such a look-up table is a hash table.

In most standard computer science courses, this is the point at which we might delve off into a long discussion about hash tables (hashing functions, collision strategies, resizing methods, etc.). However for the sake of brevity we recognize that most languages, including Python, have pretty good hash tables built in to them. The details are optimized for our benefit. In Python, they are called dictionaries, and they are just as easy to use as lists.

dict = {}
dict[1] = 1
dict[2] = 1
dict[3] = 2
dict[(1,2,3,4)] = "weird"
dict["odd"] = 876

if 2 in dict:
   print("2 is a key!")
   print("its value is" + str(dict[2]))

dict2 = {1:1, 2:1, 3:2} # key:value pairs

The curly-brace notation is reserved for dictionaries (almost), and it comes with a number of useful functions. As expected, “in” tests for key membership, but there are also a number of useful functions for extracting other kinds of information. We won’t need any of them here, however. Finally, the last line shows a quick way of creating new tables where you know the values ahead of time.

We can use a dictionary to save old computations in a function call, replacing the egregious extra computations by lightning fast dictionary look-ups. For our original Fibonacci problem, we can do this as follows:

fibTable = {1:1, 2:1}
def fib(n):
   if n <= 2:
      return 1
   if n in fibTable:
      return fibTable[n]
      fibTable[n] = fib(n-1) + fib(n-2)
      return fibTable[n]

We simply check to see if the key in question has already been computed, and if so, we just look the value up in the dictionary. And now we can compute large values of $ f_n$ quickly! (Unfortunately, due to Python’s limit on recursion depth, we still can’t go as large as the non-recursive solution, but that is a different issue altogether). However we can do better in terms of removing the gunk from our algorithm. Remember, we want to completely hide the fact that we’re using dictionaries. This brings us to a very neat Python feature called decorators.


A decorator is a way to hide pre- or post-processing to a function. For example, say we had a function which returns some text:

def getText():
   return "Here is some text!"

And we want to take this function and for whatever text it might return, we want to display it on a webpage, so we need to surround it with the HTML paragraph tags <p></p>. We might even have a hundred such functions which all return text, so we want a way to get all of them at once without typing the same thing over and over again. This problem is just begging for a decorator.

A decorator accepts a function $ f$ as input, and returns a function which potentially does something else, but perhaps uses $ f$ in an interesting way. Here is the paragraph example:

def paragraphize(inputFunction):
   def newFunction():
      return "<p>" + inputFunction() + "</p>"
   return newFunction

def getText():
   return "Here is some text!"

“paragraphize” is the decorator here, and this function must accept a function as input, and return a function as output. The output function can essentially do whatever it wants, and it will replace the input function. Indeed, the macro “@paragraphize” precedes whatever function we wish to decorate, and it replaces all calls to “getText()” by calls to “newFunction(getText)”. In other words, it is shorthand for the following statement

getText = paragraphize(getText)

Moreover, the function that paragraphize outputs must accept the same number of arguments as the input function accepts. This is to be expected, and the way we signify multiple arguments in Python is with the * notation. Here is an example:

def printArgsFirst(f):
   def newFunction(*args):
      return f(*args)

   return newFunction

def randomFunction(x, y, z):
   return x + y + z

The “*args” notation above represents multiple comma-separated values, and when we refer to “args” without the star, it is simply a tuple containing the argument values. If we include the *, as we do in the call to f(*args), it unpacks the values from the tuple and inserts them into the call as if we had called f(x,y,z), as opposed to f((x,y,z)), which calls f with a single argument (a tuple).

Applying this to our memoization problem, we arrive at the following decorator.

def memoize(f):
   cache = {}

   def memoizedFunction(*args):
      if args not in cache:
         cache[args] = f(*args)
      return cache[args]

   memoizedFunction.cache = cache
   return memoizedFunction

We have a saved cache, and update it whenever we encounter a new call to f(). The second to last line simply allows us to access the cache from other parts of the code by attaching it to memoizedFunction as an attribute (what attributes are requires another discussion about objects, but essentially we’re making the dictionary a piece of data that is part of memoizedFunction).

And now we come full circle, back to our original (easy to read) implementation of Fibonacci numbers, with one tiny modification:

def fib(n):
   if n <= 2:
      return 1
      return fib(n-1) + fib(n-2)

Again, it is unfortunate that we still cannot compute ridiculously huge values of $ f_n$, but one has to admit this is the cleanest way to write the algorithm. We encourage the reader re-solve the fill justify problem using this memoized recursion (and make it easy to read!).

For the adventurous readers, here is a list of interesting problems that are solved via dynamic programming. One common feature that requires it is the justify alignment in word processors (this author has a messy Java implementation of it: 300 lines not including tests! I admit I was a boyish, reckless programmer). All dynamic programs follow the same basic pattern, so with the tools we have now one should be able to go implement them all!

Until next time!

Metrics on Words

We are about to begin a series where we analyze large corpora of English words. In particular, we will use a probabilistic analysis of Google’s ngrams to solve various tasks such as spelling correction, word segmentation, on-line typing prediction, and decoding substitution ciphers. This will hopefully take us on a wonderful journey through elementary probability, dynamic programming algorithms, and optimization.

As usual, the code implemented in this post is available from this blog’s Github page, and we encourage the reader to use the code to implement our suggested exercises. But before we get there, we should investigate some properties of our domain: the set of all finite strings of letters.

Words, Words, Words.

If we consider a fixed alphabet $ \Sigma$ (any set, really, but for our purposes a finite one), we may consider the set $ \Sigma^*$ of all finite strings of elements from $ \Sigma$, also called words. For example, given $ \Sigma = \left \{ u,v,w \right \}$, we have the word $ u \cdot w \cdot w \cdot v \cdot u \in \Sigma^*$. Also, we allow the empty word $ \varepsilon$ to be a string of length zero. The formal name for the “star” operation is the Kleene star. Most of our work here will be done over the English alphabet of letters $ \Sigma = \left \{ a, b, c, \dots , z \right \}$.

As usual, we are looking for some sort of underlying structure. Here the structure is that two words can be concatenated to make a larger string. In the parlance of abstract algebra, $ \Sigma^*$ is a monoid with respect to the concatenation operation. If we denote the operation by (pretending it is) multiplication, we write $ u \cdot v = uv$, and the monoid structure just means two things. First, the $ \cdot$ operation is associative, so that any three words $ r,s,t$ satisfy $ (r \cdot s) \cdot t = r \cdot (s \cdot t)$. Second, it has an identity element (here the empty word), so that $ \varepsilon w = w \varepsilon$ for all words $ w$. For computer scientists, these are just natural properties of functions like C’s strcat(), but in mathematics they define the structure of the space of all words. To be completely clear, these two properties (a set with an associative binary operation and an identity element) define a monoid.

We make a few abuses of notation here. In every monoid the operation is a pretend multiplication, so in general we will call it multiplication. We will also write strings (abusively, “products”) as $ a^4b^2$, which would formally be $ a \cdot a \cdot a \cdot a \cdot b \cdot b$. We lose nothing by the abuses, but gain brevity.

The Kleene starred monoid $ \Sigma^*$ has an additional property; it is the free monoid generated by $ \Sigma$. This won’t mean anything to the reader who isn’t familiar with universal properties, but it essentially tells us that any word $ w \in \Sigma^*$ is uniquely written as a product of letters in $ \Sigma$.

Now, the structure we’ve described here is not particularly rich. In fact, free objects are the algebraic objects which are usually “completely understood.” For our purposes the language of abstract algebra is just a mature setting for our discussion, but the concepts we introduce will give an extra perspective on the topic. In other words, as we don’t plan to use any monoids more complicated than the English alphabet free monoid described above, we have no interesting (general) theorems to apply to our activities.

Before we turn to a more concrete setting, we have one more definition. A monoid homomorphism between two monoids $ M_1, M_2$ is a function $ f : M_1 \to M_2$ which respects the multiplication operations and preserves the identity element. Rigorously, we have that all words $ u,v \in M_1$ satisfy $ f(uv) = f(u)f(v)$, where the operation on the left side of the equality (before the application of $ f$) is multiplication in $ M_1$, and the one on the right hand side is multiplication in $ M_2$.

One easy example of a monoid homomorphism from our English alphabet free monoid is the length homomorphism. Rigorously, the set of natural numbers $ \mathbb{N}$ is a monoid under addition, and the function $ \textup{length} : \Sigma^* \to \mathbb{N}$ which assigns to each word its length is a homomorphism of monoids. This is intuitively clear: the length of a concatenation of two words is the sum of the lengths of the pieces.

A more complex example which shows up in functional programming has to do with lists. Let $ X, Y$ be two classes of objects of some fixed types, then we may consider $ X^*$ as the set of all lists of objects in $ X$. This is again a free monoid over $ X$ with the operation of list appending and the empty list as the identity. Note that $ X$ sits inside $ X^*$ in a natural way: each element of $ X$ can be considered a list of length one. With this understanding, we may be sloppy in calling the “product” of $ x,y \in X$ the list $ xy \in X^*$ (note, $ X$ itself need not have any operations).

Now for any fixed operation $ g : X \to Y$, we may form the map homomorphism $ \mu_g: X^* \to Y^*$ inductively as follows:

$ \mu_g(\varepsilon) = \varepsilon$
$ \mu_g(x_1 \dots x_n) = g(x_1) \mu_g(x_2 \dots x_n))$

This is precisely the map operation defined in our primer on functional programming. We encourage the reader to investigate how to phrase the other two functions (filter and fold) as monoid homomorphisms, or prove it cannot be done (thanks to Matt for pointing out this author’s mistake with regards to that).

Metrics, and String Comparisons

Since our goal is to do things like spelling correction, we are quite interested in strings of letters which are not actually words. We want to be able to tell someone that the word “beleive” is probably a misspelling of the word “believe.” So let us fix our alphabet $ \Sigma = \left \{ a, b, c, \dots , z \right \}$ and consider the free monoid $ \Sigma^*$. As we have noted, this is the set of all words one could type with the lowercase English alphabet, so it includes all of our egregious typos. It is a simplistic model, since we ignore punctuation, capitalization, and stylistic marks that convey meaning. But it is as good a place as any to start our investigation.

To mathematically describe what it means for a misspelled word to be “almost” the intended word, we need to bring in the concept of a metric. In other words, we want to view our set $ \Sigma^*$ as a metric space in which we can measure the distance between any two words (when viewed as a metric space, we will call them points). Then the set of all valid English words $ E \subset \Sigma^*$ is a subspace. To correct a misspelled word $ w \in \Sigma^*$, we can simply use the closest point in $ E$ with respect to the metric.

Of course, the hard part is describing the right metric. But before we get there, we must define a metric so we know what properties to aim for in constructing a metric on words.

Definition: A metric $ d : X \times X \to \mathbb{R}$ is a function on a set $ X$ which has the following three properties for all $ x,y,z \in X$

  • $ d(x,y) \geq 0$, and $ d(x,y) = 0$ if and only if $ x = y$.
  • $ d(x,y) = d(y,x)$
  • $ d(x,y) + d(y,z) \geq d(x,z)$ (the triangle inequality)

A space $ X$ equipped with a fixed metric $ d$ is said to be a metric space.

There are plenty of interesting examples of metrics, and we refer the interested reader to Wikipedia, or to any introductory topology text (or the end of a real analysis text). We will focus on the Levenshtein metric.

If we think for a minute we can come up with a list of ways that people make typing mistakes. Sometimes we omit letters (as in diferent), sometimes we add too many letters (e.g., committment), and sometimes we substitute one letter for another (missussippi could be a phonetic error, or a slip of the finger on a qwerty keyboard). Furthermore, we can traverse from one word to another by a sequence of such operations (at worst, delete all letters and then insert the right letters). So it would make sense to take the distance between two words to be the smallest number of such transformations required to turn one word into another.

More rigorously, let $ u = u_1 \dots u_k$ be the unique way to write $ u$ as a product of letters, and let $ v = v_1 \dots v_j$ be the same for $ v$. An elementary edit of $ u$ is one of the following:

  • a deletion: the transformation $ u_1 \dots u_i \dots u_k \to u_1 \dots \widehat{u_i} \dots u_k$ for some $ 1 \leq i \leq k$, where the hat omits omission in the $ i$-th spot.
  • an insertion: the transformation $ u_1 \dots u_k \to u_1 \dots u_i x \dots u_k $ for some $ 1 \leq i \leq k$, and some letter $ x \in \Sigma$.
  • a substitution: the transformation $ u_1 \dots u_i \dots u_k \to u_1 \dots u_{i-1}xu_{i+1} \dots u_k$ for some $ 1 \leq i \leq k-1$ and some letter $ x$.

Then an edit from $ u$ to $ v$ is a sequence of elementary edits which begins with $ u= u_1 \dots u_k$ and ends in $ v= v_1 \dots v_j$. The length of an edit is the number of elementary edits in the sequence. Finally, we define the edit distance between $ u$ and $ v$, denoted $ d(u,v)$, as the length of the shortest edit from $ u$ to $ v$.

To verify this is a metric, we note that all edits have non-negative length, and the only edit of length zero is the edit which does nothing, so if $ d(x,y) = 0$ it follows that $ x = y$. Second, we note that edits are symmetric inherently, in that if we have an edit from $ x$ to $ y$, we may simply reverse the sequence and we have a valid edit from $ y$ to $ x$. Clearly, the property of being the shortest edit is not altered by reversal.

Last, we must verify the triangle inequality. Let $ x,y,z$ be words; we want to show $ d(x,z) \leq d(x,y) + d(y,z)$. Take two shortest edits between $ x,y$ and $ y,z$, and note that their composition is a valid edit from $ x$ to $ z$. Following our definition, by “compose” we mean combine the two sequences of operations into one sequence in the obvious way. Since this is an edit, its length can be no smaller than the shortest edit from $ x$ to $ z$, proving the claim.

So $ d$ is in fact a metric, and historically it is called Levenshtein’s metric.

A Topological Aside

Before we get to implementing this metric, we have a few observations to make. First, we note that the shortest edit between two words is far from unique. In particular, the needed substitutions, insertions, and deletions often commute (i.e. the order of operations is irrelevant). Furthermore, instead of simply counting the number of operations required, we could assign each operation a cost, and call the total cost of an edit the sum of the costs of each elementary edit. This yields a large class of different metrics, and one could conceivably think of new operations (combinations of elementary operations) to assign lower costs. Indeed, we will do just that soon enough.

Second, and more interestingly, this metric provides quite a bit of structure on our space. It is a well known fact that every metric induces a topology. In other words, there is a topology generated by the open balls $ \left \{ x : d(x,y) < r \right \}$ for all possible radii $ r \in \mathbb{R}$ and all centers $ y$. We can also characterize the topology from another viewpoint: consider the infinite graph $ G$ where each vertex is a word in $ \Sigma^*$ and two words have a connecting edge if there exists an elementary edit between them. Then edit distance in $ \Sigma^*$ is just the length of a shortest path in $ G$, and so the spaces are isometric, and hence homeomorphic (they have identical topologies). Indeed, this is often generalized to the word metric on a group, which is beyond the scope of this post (indeed, we haven’t gotten anywhere close to group theory yet on this blog!).

For those of us unfamiliar with topology or graph theory, we can still imagine geometric notions that get to the intuitive heart of what “induced topology” means for words. For example, we can describe a circle of radius $ r$ centered at a word $ w$ quite easily: it is just the set of all words whose edit distance from $ w$ is exactly $ r$. As a concrete example, the circle of radius 1 centered at the word $ a$ is

$ \left \{ \varepsilon, b, c, \dots , z, aa, ab, ac, \dots , az, ba, ca, \dots , za \right \}$

In fact, any geometric construction that can be phrased entirely in terms of distance has an interpretation in this setting. We encourage the reader to think of more.

Python Implementation, and a Peek at Dynamic Programming

Of course, what use are these theoretical concerns to us if we can’t use it to write a spell-checker? To actually implement the damn thing, we need a nontrivial algorithm. So now let’s turn to Python.

Our first observation is that we don’t actually care what the edits are, we just care about the number of edits. Since the edits only operate on single characters, we can define the behavior recursively. Specifically, suppose we have two words $ u = u_1 \dots u_k$ and $ v_1 \dots v_j$. If $ u_k = v_j$, we can leave the last characters the same and inductively work with the remaining letters. If not, we find the shortest edit between $ u_1 \dots u_{k-1}$ and $ v_1 \dots v_{j}$, as if our last operation were a deletion of $ u_k$. Similarly, we can inductively find the shortest distance between $ u_1 \dots u_k$ and $ v_1 \dots v_{j-1}$, as if our last move were an insertion of $ v_j$ to the end of $ u$. Finally, we could find the shortest distance between $ u_1 \dots u_{k-1}$ and $ v_1 \dots v_{j-1}$, as if our last move were a substitution of $ u_k$ for $ v_j$. For the base case, if any word is empty, then the only possible edit is inserting/deleting all the letters in the other word.

Here is precisely that algorithm, written in Python:

def dist(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist(word1[:-1], word2[:-1])
      return 1 + min(dist(word1[:-1], word2),
                     dist(word1, word2[:-1]),
                     dist(word1[:-1], word2[:-1]))

Here the [:-1] syntax indicates a slice of the first $ n-1$ characters of an $ n$ character string. Note again that as we don’t actually care what the operations are, we can simply assume we’re doing the correct transformation, and just add 1 to our recursive calls. For a proof of correctness, we refer the reader to Wikipedia (Sorry! It’s just a ton of case-checking). We also note that recursion in Python can be extremely slow for large inputs. There is of course a method of building up a cost matrix from scratch which would perform better, but we feel this code is more legible, and leave the performance tuning as an exercise to the reader. For more information on dynamic programming, see this blog’s primer on the subject.

The cautious programmer will note the above algorithm is terribly wasteful! For instance, suppose we’re investigating the distance between $ foo$ and $ bar$. Through our recursive calls, we’ll first investigate the distance between $ fo$ and $ bar$, during which we recursively investigate $ fo$ versus $ ba$. Once that’s finished, we go ahead and investigate the other branch, $ foo$ versus $ ba$, during which we look at $ fo$ versus $ ba$ once more, even though we already computed it in the first branch! What’s worse, is that we have a third branch that computes $ fo$ versus $ ba$ again! Doing a bit of algorithm analysis, we realize that this algorithm is $ O(3^{\min(n,m)})$, where $ m, n$ are the lengths of the two compared words. Unacceptable!

To fix this, we need to keep track of prior computations. The technical term is memoized recursion, and essentially we want to save old computations in a lookup table for later reference. In mostly-Python:

cache = {}
def memoizedFunction(args):
   if args not in cache:
      cache[args] = doTheComputation(args)
   return cache[args]

To actually implement this, it turns out we don’t need to change the above code at all. Instead, we will use a decorator to modify the function as we wish. Here’s the code, which is essentially an extra layer of indirection applied to the above pseudocode.

def memoize(f):
   cache = {}

   def memoizedFunction(*args):
      if args not in cache:
         cache[args] = f(*args)
      return cache[args]

   memoizedFunction.cache = cache
   return memoizedFunction

Here the function memoize() will accept our distance function, and return a new function which encapsulates the memo behavior. To use it, we simply use

def f(x):

equivalentButMemoizedFunction = memoize(f)

But luckily, Python gives a nice preprocessor macro to avoid writing this for every function we wish to memoize. Instead, we may simply write

def f(x):

And Python will make the appropriate replacements of calls to f with the appropriate calls to the memoized function. Convenient! For further discussion, see our post on this technique in the program gallery.

Applying this to our Levenshtein metric, we see an impressive speedup, and a quick analysis shows the algorithm takes $ O(nm)$, where $ n, m$ are the lengths of the two words being compared. Indeed, we are comparing (at worst) all possible prefixes of the two words, and for each of the $ n$ prefixes of one word, we compute a distance to all $ m$ prefixes of the other word. The memoization prevents us from doing any computation twice.

To this author, this approach is the most natural implementation, but there are other approaches worth investigating. In particular, Python limits the recursion depth to a few hundred. If we try to compare, say, two DNA sequences, this algorithm will quickly overflow. There are a number of ways to fix this, the most appropriate of which would be tail call optimization (in this author’s humble opinion). Unfortunately, we’d need to tweak the algorithm a bit to put the recursive call in tail position, Python does not support tail call optimization, and manually putting things in continuation-passing style is annoying, obfuscating, and quite ugly. If we decide in the future to do DNA sequence analysis, we will return to this problem.

In the future, we plan to provide another Python primer, with a focus on dynamic algorithms. Other methods for solving this problem will arise there. Indeed, I’m teaching an introductory Python programming course next semester, so this will be a good refresher.

Transpositions, and Other Enhancements

One other significant kind of typo is a transposition. Often times we type the correct letters in a word, but jumble the order of two words. In a spell checker, we want the word $ thier$ to be closer to the word $ their$ than it is to the word $ cheer$, but with the Levenshtein metric the two pairs have equal distance (two substitutions each). We can enhance the metric by making transpositions have a cost of 1. Historically, this extended metric is called the Damerau-Levenshtein metric. Indeed, Damerau himself gave evidence that transpositions, along with the other three elementary edits, account for over 85% of human typing errors. Then again, that was back in the sixties, and typing has changed in many ways since then (not the least of which is a change in a typist’s vocabulary).

Adding transpositions to the algorithm above seems straightforward, but there are some nontrivial details to consider. For instance, we may first transpose two letters and then insert a new letter between them, as in the transformation from $ ta$ to $ act$. If we are not careful, we might prohibit such legal transformations in our algorithm. Here is an implementation, which again uses the memoization decorator.

def dist2(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist2(word1[:-1], word2[:-1])
      minDist = 1 + min(dist2(word1[:-1], word2),
                        dist2(word1, word2[:-1]),
                        dist2(word1[:-1], word2[:-1]))
      # transpositions
      if len(word1) &gt; 1 and len(word2) &gt; 1:
         if word1[-2] == word2[-1]:
            transposedWord1 = word1[:-2] + word1[-1] + word1[-2]
            minDist = min(minDist, dist2(transposedWord1[:-1], word2))
         if word2[-2] == word1[-1]:
            transposedWord2 = word2[:-2] + word2[-1] + word2[-2]
            minDist = min(minDist, dist2(word1, transposedWord2[:-1]))
   return minDist

Indeed, we must inspect both possible transpositions, and the symmetry of the example above shows the need for both branches. The proof that this extended metric is still a metric and the proof of algorithmic correctness are nearly identical to the plain Levenshtein metric.

So that was fun. Here are some other ideas we leave as exercises to the reader. First, if we allow ourselves to fix a keyboard layout (for many languages with Latin-based alphabets, the standard is qwerty with minor substitutions), we could factor that in to our analysis of letter substitutions and incorrect insertions. For instance, the word $ ribies$ is just as close to $ rabies$ as it is to $ rubies$, but it is less likely the user meant to type the first word, since $ u$ is physically closer to $ i$ than $ a$ is. To implement this, we can modify the above algorithm to accept a look-up table of physical distances (approximations) between keys. Instead of adding 1 in the relevant branches, we can add a cost according to the look-up table. At the coarsest, we could construct a graph with vertices representing letters, edges representing physical adjacencies, and use the shortest graph path in place of physical key distance.

We also note (from our mature vantage point) that this algorithm is not restricted to strings, but can be performed on any free monoid. This includes the example we mentioned earlier of lists. So we could generalize the algorithm to operate on any piece of data which has such an identity element and binary operation, and satisfies the freedom condition. My knowledge of Python is still somewhat limited, but the method for achieving this generalization comes in many names in many languages: in Java it’s interfaces, in C++ it’s templating, in Haskell it’s a typeclass. In Python, there is a fancy thing called duck-typing, and we leave this for our next Python primer.

Next time, we’ll crack open some data files with actual English dictionaries in them, and see what we can do about solving interesting problems with them. Until then!