Metrics on Words

We are about to begin a series where we analyze large corpora of English words. In particular, we will use a probabilistic analysis of Google’s ngrams to solve various tasks such as spelling correction, word segmentation, on-line typing prediction, and decoding substitution ciphers. This will hopefully take us on a wonderful journey through elementary probability, dynamic programming algorithms, and optimization.

As usual, the code implemented in this post is available from this blog’s Github page, and we encourage the reader to use the code to implement our suggested exercises. But before we get there, we should investigate some properties of our domain: the set of all finite strings of letters.

Words, Words, Words.

If we consider a fixed alphabet \Sigma (any set, really, but for our purposes a finite one), we may consider the set \Sigma^* of all finite strings of elements from \Sigma, also called words. For example, given \Sigma = \left \{ u,v,w \right \}, we have the word u \cdot w \cdot w \cdot v \cdot u \in \Sigma^*. Also, we allow the empty word \varepsilon to be a string of length zero. The formal name for the “star” operation is the Kleene star. Most of our work here will be done over the English alphabet of letters \Sigma = \left \{ a, b, c, \dots , z \right \}.

As usual, we are looking for some sort of underlying structure. Here the structure is that two words can be concatenated to make a larger string. In the parlance of abstract algebra, \Sigma^* is a monoid with respect to the concatenation operation. If we denote the operation by (pretending it is) multiplication, we write u \cdot v = uv, and the monoid structure just means two things. First, the \cdot operation is associative, so that any three words r,s,t satisfy (r \cdot s) \cdot t = r \cdot (s \cdot t). Second, it has an identity element (here the empty word), so that \varepsilon w = w \varepsilon for all words w. For computer scientists, these are just natural properties of functions like C’s strcat(), but in mathematics they define the structure of the space of all words. To be completely clear, these two properties (a set with an associative binary operation and an identity element) define a monoid.

We make a few abuses of notation here. In every monoid the operation is a pretend multiplication, so in general we will call it multiplication. We will also write strings (abusively, “products”) as a^4b^2, which would formally be a \cdot a \cdot a \cdot a \cdot b \cdot b. We lose nothing by the abuses, but gain brevity.

The Kleene starred monoid \Sigma^* has an additional property; it is the free monoid generated by \Sigma. This won’t mean anything to the reader who isn’t familiar with universal properties, but it essentially tells us that any word w \in \Sigma^* is uniquely written as a product of letters in \Sigma.

Now, the structure we’ve described here is not particularly rich. In fact, free objects are the algebraic objects which are usually “completely understood.” For our purposes the language of abstract algebra is just a mature setting for our discussion, but the concepts we introduce will give an extra perspective on the topic. In other words, as we don’t plan to use any monoids more complicated than the English alphabet free monoid described above, we have no interesting (general) theorems to apply to our activities.

Before we turn to a more concrete setting, we have one more definition. A monoid homomorphism between two monoids M_1, M_2 is a function f : M_1 \to M_2 which respects the multiplication operations and preserves the identity element. Rigorously, we have that all words u,v \in M_1 satisfy f(uv) = f(u)f(v), where the operation on the left side of the equality (before the application of f) is multiplication in M_1, and the one on the right hand side is multiplication in M_2.

One easy example of a monoid homomorphism from our English alphabet free monoid is the length homomorphism. Rigorously, the set of natural numbers \mathbb{N} is a monoid under addition, and the function \textup{length} : \Sigma^* \to \mathbb{N} which assigns to each word its length is a homomorphism of monoids. This is intuitively clear: the length of a concatenation of two words is the sum of the lengths of the pieces.

A more complex example which shows up in functional programming has to do with lists. Let X, Y be two classes of objects of some fixed types, then we may consider X^* as the set of all lists of objects in X. This is again a free monoid over X with the operation of list appending and the empty list as the identity. Note that X sits inside X^* in a natural way: each element of X can be considered a list of length one. With this understanding, we may be sloppy in calling the “product” of x,y \in X the list xy \in X^* (note, X itself need not have any operations).

Now for any fixed operation g : X \to Y, we may form the map homomorphism \mu_g: X^* \to Y^* inductively as follows:

\mu_g(\varepsilon) = \varepsilon
\mu_g(x_1 \dots x_n) = g(x_1) \mu_g(x_2 \dots x_n))

This is precisely the map operation defined in our primer on functional programming. We encourage the reader to investigate how to phrase the other two functions (filter and fold) as monoid homomorphisms, or prove it cannot be done (thanks to Matt for pointing out this author’s mistake with regards to that).

Metrics, and String Comparisons

Since our goal is to do things like spelling correction, we are quite interested in strings of letters which are not actually words. We want to be able to tell someone that the word “beleive” is probably a misspelling of the word “believe.” So let us fix our alphabet \Sigma = \left \{ a, b, c, \dots , z \right \} and consider the free monoid \Sigma^*. As we have noted, this is the set of all words one could type with the lowercase English alphabet, so it includes all of our egregious typos. It is a simplistic model, since we ignore punctuation, capitalization, and stylistic marks that convey meaning. But it is as good a place as any to start our investigation.

To mathematically describe what it means for a misspelled word to be “almost” the intended word, we need to bring in the concept of a metric. In other words, we want to view our set \Sigma^* as a metric space in which we can measure the distance between any two words (when viewed as a metric space, we will call them points). Then the set of all valid English words E \subset \Sigma^* is a subspace. To correct a misspelled word w \in \Sigma^*, we can simply use the closest point in E with respect to the metric.

Of course, the hard part is describing the right metric. But before we get there, we must define a metric so we know what properties to aim for in constructing a metric on words.

Definition: A metric d : X \times X \to \mathbb{R} is a function on a set X which has the following three properties for all x,y,z \in X

A space X equipped with a fixed metric d is said to be a metric space.

There are plenty of interesting examples of metrics, and we refer the interested reader to Wikipedia, or to any introductory topology text (or the end of a real analysis text). We will focus on the Levenshtein metric.

If we think for a minute we can come up with a list of ways that people make typing mistakes. Sometimes we omit letters (as in diferent), sometimes we add too many letters (e.g., committment), and sometimes we substitute one letter for another (missussippi could be a phonetic error, or a slip of the finger on a qwerty keyboard). Furthermore, we can traverse from one word to another by a sequence of such operations (at worst, delete all letters and then insert the right letters). So it would make sense to take the distance between two words to be the smallest number of such transformations required to turn one word into another.

More rigorously, let u = u_1 \dots u_k be the unique way to write u as a product of letters, and let v = v_1 \dots v_j be the same for v. An elementary edit of u is one of the following:

  • a deletion: the transformation u_1 \dots u_i \dots u_k \to u_1 \dots \widehat{u_i} \dots u_k for some 1 \leq i \leq k, where the hat omits omission in the i-th spot.
  • an insertion: the transformation u_1 \dots u_k \to u_1 \dots u_i x \dots u_k for some 1 \leq i \leq k, and some letter x \in \Sigma.
  • a substitution: the transformation u_1 \dots u_i \dots u_k \to u_1 \dots u_{i-1}xu_{i+1} \dots u_k for some 1 \leq i \leq k-1 and some letter x.

Then an edit from u to v is a sequence of elementary edits which begins with u= u_1 \dots u_k and ends in v= v_1 \dots v_j. The length of an edit is the number of elementary edits in the sequence. Finally, we define the edit distance between u and v, denoted d(u,v), as the length of the shortest edit from u to v.

To verify this is a metric, we note that all edits have non-negative length, and the only edit of length zero is the edit which does nothing, so if d(x,y) = 0 it follows that x = y. Second, we note that edits are symmetric inherently, in that if we have an edit from x to y, we may simply reverse the sequence and we have a valid edit from y to x. Clearly, the property of being the shortest edit is not altered by reversal.

Last, we must verify the triangle inequality. Let x,y,z be words; we want to show d(x,z) \leq d(x,y) + d(y,z). Take two shortest edits between x,y and y,z, and note that their composition is a valid edit from x to z. Following our definition, by “compose” we mean combine the two sequences of operations into one sequence in the obvious way. Since this is an edit, its length can be no smaller than the shortest edit from x to z, proving the claim.

So d is in fact a metric, and historically it is called Levenshtein’s metric.

A Topological Aside

Before we get to implementing this metric, we have a few observations to make. First, we note that the shortest edit between two words is far from unique. In particular, the needed substitutions, insertions, and deletions often commute (i.e. the order of operations is irrelevant). Furthermore, instead of simply counting the number of operations required, we could assign each operation a cost, and call the total cost of an edit the sum of the costs of each elementary edit. This yields a large class of different metrics, and one could conceivably think of new operations (combinations of elementary operations) to assign lower costs. Indeed, we will do just that soon enough.

Second, and more interestingly, this metric provides quite a bit of structure on our space. It is a well known fact that every metric induces a topology. In other words, there is a topology generated by the open balls \left \{ x : d(x,y) < r \right \} for all possible radii r \in \mathbb{R} and all centers y. We can also characterize the topology from another viewpoint: consider the infinite graph G where each vertex is a word in \Sigma^* and two words have a connecting edge if there exists an elementary edit between them. Then edit distance in \Sigma^* is just the length of a shortest path in G, and so the spaces are isometric, and hence homeomorphic (they have identical topologies). Indeed, this is often generalized to the word metric on a group, which is beyond the scope of this post (indeed, we haven’t gotten anywhere close to group theory yet on this blog!).

For those of us unfamiliar with topology or graph theory, we can still imagine geometric notions that get to the intuitive heart of what “induced topology” means for words. For example, we can describe a circle of radius r centered at a word w quite easily: it is just the set of all words whose edit distance from w is exactly r. As a concrete example, the circle of radius 1 centered at the word a is

\left \{ \varepsilon, b, c, \dots , z, aa, ab, ac, \dots , az, ba, ca, \dots , za \right \}

In fact, any geometric construction that can be phrased entirely in terms of distance has an interpretation in this setting. We encourage the reader to think of more.

Python Implementation, and a Peek at Dynamic Programming

Of course, what use are these theoretical concerns to us if we can’t use it to write a spell-checker? To actually implement the damn thing, we need a nontrivial algorithm. So now let’s turn to Python.

Our first observation is that we don’t actually care what the edits are, we just care about the number of edits. Since the edits only operate on single characters, we can define the behavior recursively. Specifically, suppose we have two words u = u_1 \dots u_k and v_1 \dots v_j. If u_k = v_j, we can leave the last characters the same and inductively work with the remaining letters. If not, we find the shortest edit between u_1 \dots u_{k-1} and v_1 \dots v_{j}, as if our last operation were a deletion of u_k. Similarly, we can inductively find the shortest distance between u_1 \dots u_k and v_1 \dots v_{j-1}, as if our last move were an insertion of v_j to the end of u. Finally, we could find the shortest distance between u_1 \dots u_{k-1} and v_1 \dots v_{j-1}, as if our last move were a substitution of u_k for v_j. For the base case, if any word is empty, then the only possible edit is inserting/deleting all the letters in the other word.

Here is precisely that algorithm, written in Python:

def dist(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist(word1[:-1], word2[:-1])
   else:
      return 1 + min(dist(word1[:-1], word2),
                     dist(word1, word2[:-1]),
                     dist(word1[:-1], word2[:-1]))

Here the [:-1] syntax indicates a slice of the first n-1 characters of an n character string. Note again that as we don’t actually care what the operations are, we can simply assume we’re doing the correct transformation, and just add 1 to our recursive calls. For a proof of correctness, we refer the reader to Wikipedia (Sorry! It’s just a ton of case-checking). We also note that recursion in Python can be extremely slow for large inputs. There is of course a method of building up a cost matrix from scratch which would perform better, but we feel this code is more legible, and leave the performance tuning as an exercise to the reader. For more information on dynamic programming, see this blog’s primer on the subject.

The cautious programmer will note the above algorithm is terribly wasteful! For instance, suppose we’re investigating the distance between foo and bar. Through our recursive calls, we’ll first investigate the distance between fo and bar, during which we recursively investigate fo versus ba. Once that’s finished, we go ahead and investigate the other branch, foo versus ba, during which we look at fo versus ba once more, even though we already computed it in the first branch! What’s worse, is that we have a third branch that computes fo versus ba again! Doing a bit of algorithm analysis, we realize that this algorithm is O(3^{\min(n,m)}), where m, n are the lengths of the two compared words. Unacceptable!

To fix this, we need to keep track of prior computations. The technical term is memoized recursion, and essentially we want to save old computations in a lookup table for later reference. In mostly-Python:

cache = {}
def memoizedFunction(args):
   if args not in cache:
      cache[args] = doTheComputation(args)
   return cache[args]

To actually implement this, it turns out we don’t need to change the above code at all. Instead, we will use a decorator to modify the function as we wish. Here’s the code, which is essentially an extra layer of indirection applied to the above pseudocode.

def memoize(f):
   cache = {}

   def memoizedFunction(*args):
      if args not in cache:
         cache[args] = f(*args)
      return cache[args]

   memoizedFunction.cache = cache
   return memoizedFunction

Here the function memoize() will accept our distance function, and return a new function which encapsulates the memo behavior. To use it, we simply use

def f(x):
   ...

equivalentButMemoizedFunction = memoize(f)

But luckily, Python gives a nice preprocessor macro to avoid writing this for every function we wish to memoize. Instead, we may simply write

@memoize
def f(x):
   ...

And Python will make the appropriate replacements of calls to f with the appropriate calls to the memoized function. Convenient! For further discussion, see our post on this technique in the program gallery.

Applying this to our Levenshtein metric, we see an impressive speedup, and a quick analysis shows the algorithm takes O(nm), where n, m are the lengths of the two words being compared. Indeed, we are comparing (at worst) all possible prefixes of the two words, and for each of the n prefixes of one word, we compute a distance to all m prefixes of the other word. The memoization prevents us from doing any computation twice.

To this author, this approach is the most natural implementation, but there are other approaches worth investigating. In particular, Python limits the recursion depth to a few hundred. If we try to compare, say, two DNA sequences, this algorithm will quickly overflow. There are a number of ways to fix this, the most appropriate of which would be tail call optimization (in this author’s humble opinion). Unfortunately, we’d need to tweak the algorithm a bit to put the recursive call in tail position, Python does not support tail call optimization, and manually putting things in continuation-passing style is annoying, obfuscating, and quite ugly. If we decide in the future to do DNA sequence analysis, we will return to this problem.

In the future, we plan to provide another Python primer, with a focus on dynamic algorithms. Other methods for solving this problem will arise there. Indeed, I’m teaching an introductory Python programming course next semester, so this will be a good refresher.

Transpositions, and Other Enhancements

One other significant kind of typo is a transposition. Often times we type the correct letters in a word, but jumble the order of two words. In a spell checker, we want the word thier to be closer to the word their than it is to the word cheer, but with the Levenshtein metric the two pairs have equal distance (two substitutions each). We can enhance the metric by making transpositions have a cost of 1. Historically, this extended metric is called the Damerau-Levenshtein metric. Indeed, Damerau himself gave evidence that transpositions, along with the other three elementary edits, account for over 85% of human typing errors. Then again, that was back in the sixties, and typing has changed in many ways since then (not the least of which is a change in a typist’s vocabulary).

Adding transpositions to the algorithm above seems straightforward, but there are some nontrivial details to consider. For instance, we may first transpose two letters and then insert a new letter between them, as in the transformation from ta to act. If we are not careful, we might prohibit such legal transformations in our algorithm. Here is an implementation, which again uses the memoization decorator.

@memoize
def dist2(word1, word2):
   if not word1 or not word2:
      return max(len(word1), len(word2))
   elif word1[-1] == word2[-1]:
      return dist2(word1[:-1], word2[:-1])
   else:
      minDist = 1 + min(dist2(word1[:-1], word2),
                        dist2(word1, word2[:-1]),
                        dist2(word1[:-1], word2[:-1]))
      # transpositions
      if len(word1) > 1 and len(word2) > 1:
         if word1[-2] == word2[-1]:
            transposedWord1 = word1[:-2] + word1[-1] + word1[-2]
            minDist = min(minDist, dist2(transposedWord1[:-1], word2))
         if word2[-2] == word1[-1]:
            transposedWord2 = word2[:-2] + word2[-1] + word2[-2]
            minDist = min(minDist, dist2(word1, transposedWord2[:-1]))
   return minDist

Indeed, we must inspect both possible transpositions, and the symmetry of the example above shows the need for both branches. The proof that this extended metric is still a metric and the proof of algorithmic correctness are nearly identical to the plain Levenshtein metric.

So that was fun. Here are some other ideas we leave as exercises to the reader. First, if we allow ourselves to fix a keyboard layout (for many languages with Latin-based alphabets, the standard is qwerty with minor substitutions), we could factor that in to our analysis of letter substitutions and incorrect insertions. For instance, the word ribies is just as close to rabies as it is to rubies, but it is less likely the user meant to type the first word, since u is physically closer to i than a is. To implement this, we can modify the above algorithm to accept a look-up table of physical distances (approximations) between keys. Instead of adding 1 in the relevant branches, we can add a cost according to the look-up table. At the coarsest, we could construct a graph with vertices representing letters, edges representing physical adjacencies, and use the shortest graph path in place of physical key distance.

We also note (from our mature vantage point) that this algorithm is not restricted to strings, but can be performed on any free monoid. This includes the example we mentioned earlier of lists. So we could generalize the algorithm to operate on any piece of data which has such an identity element and binary operation, and satisfies the freedom condition. My knowledge of Python is still somewhat limited, but the method for achieving this generalization comes in many names in many languages: in Java it’s interfaces, in C++ it’s templating, in Haskell it’s a typeclass. In Python, there is a fancy thing called duck-typing, and we leave this for our next Python primer.

Next time, we’ll crack open some data files with actual English dictionaries in them, and see what we can do about solving interesting problems with them. Until then!

About these ads

11 thoughts on “Metrics on Words

  1. Interesting article. I’m not sure your functions would translate correctly to other languages however. In particular languages where character combinations can be used to express a single letter. For example, in german “oe” is equivalent to “ö”. So the length must also be the same. But if you take your identity you say that length(“o”) + length(“e”) = length( “oe” ), which in this case it clearly shouldn’t.

    I’m not sure this would cause any significant problems for the overall technique. It probably just adds complications.

    • Wonderful comment! You’re very right, and actually I speak German so I’m familiar with the rules you mention (additionally, the now-considered-archaic ß replaces ss). So one way around it would be some preprocessor function which replaces all instances of “oe” with “ö”, etc. Though I’m sure there are some obscure words (or at least, strings one might wish to type, whether or not they are standard words) which use “oe” in a nontrivially different way. One could make a conceivable metric out of it one way or another, but it does throw a stick in our assumptions about what kinds of typos are possible if we start changing what the user originally typed. I wonder offhand how much of a difference that makes on real-world data, and to what extent companies compensate for it in their applications.

      • You could introduce “composition typos” — to borrow Unicode’s term. I wonder however if it might not be better to “decompose” the strings for comparison. That is, every combined character is split into its parts, so “ö” becomes “oe” instead of the other way around. Consider an example where the user types just “o” instead of “ö”. One would expect the spell checker to give a pretty close distance between those characters: if comparing against “oe” it would indeed be just 1 away. Decomposition would also make it possible that dropping/adding accents just becomes part of the cost — rather than having a large table of related accented/combined characters.

      • Actually, in german oe is not the same as ö. I mean, you CAN replace ö with oe, for example, when you can’t enter ö. Some might replace ü it with u (like in Diane Kruger) and others replace ö with oe (like Chad Kroeger). Anyways, there is only one right way to write a german word, so writing “daß” for example, is just wrong, cause its “dass”, same goes for Umlauts. So with Chad Kroeger you could actually MEAN Chad Kroeger, and not Chad Kröger. The only problem there is is with people who can not enter Umlauts, but they are corrected then, so that’s OK. You might want to think about making it “cheaper” to transfer from oe to ö or sth..

  2. Excellent! I’ve been curious about this sort of thing for a while now, so I’m eager to see how you explore this area.

    As an interesting side note, the use of a finite set for the fixed alphabet is rather well-founded, notwithstanding the scores of languages with potentially unique character sets! In Turing’s 1936 paper, for a footnote he considers a metric on the space of symbols, and asserts that this space is conditionally compact — meaning that unless we claim to be able to distinguish between two symbols s,r with d(s,r)<epsilon for any epsilon, there are really only finitely many symbols!

  3. There is a different method to deal with mistakes in ‘ words’ . Apart from mistypings, there are a lot of word mistakes caused by less optimal knowledge of the language.
    You could try and find the thesis of Martin Reynaart from the university of Tilburg.
    He used a very different approach: every letter is a value, it is raise to the fifth power, then summed. This is a very quick lookup for letter-swaps. Using precalculated common mistakes (drop a letter, change l for 1, ph for f etc) getting alternatives is quick and easy. Adding levehnstein to this helps discarding too big edits.
    A note about probability of mistakes: it is very helpful to identify the natural word frequency in suggesting the most likely alternative.

    A note about the assumption of words, why not focus on any string, including spaces and interpunction? Then sentence boundaries incorrec twordsplitups could be detected as well.

    Might also want to have a look al http://www.valkuil.net, a Dutch initiative, maybe a bit like this.

    • That is an interesting question. Is someone just as likely to type the word “lefa” as they are “fale” when they mean to type the word “leaf”? What you’re suggesting is a sort of “weak” hashing method where two words have equal hashes if and only if they contain the same letters (another example of that: the letters could be assigned distinct primes, and the product of the letter values is the hash). I suppose I do sometimes type words completely jumbled when I’m rushing… I also wonder how much we as humans should allow ourselves to expect of a spell-checker. While checking bigram word frequencies is a great method (and actually, what I’m doing in the next post in this series for a related problem), if we want to check the spelling of a large (hundred page) document in real time, we do have to worry about performance. I think you’re right that either most times a human doesn’t know how to spell a word, or it’s a mistake that the human can correct far faster than the spell-checker. But it’s very difficult to make accommodations for people who don’t understand the language, because there will always be more and more ridiculous spellings of things. For instance, how can a spell checker guess what you mean when you type “ackwire”? Even word frequency tables won’t help there (at least, not the one we plan to use from Google), because AckWire is a product/company/username that probably has a nontrivial frequency!

  4. Great article! It’s not terribly important, but I’m not sure I understand the “left-fold homomorphism”. Is it required that the operation in the monoid Y has some relationship to the function g? I can’t see how in general the map is homomorphic.

    • You’re right, and thanks for the comment. There are a number of reasons why my construction is gibberish and it’s not very constructive to go through them all. Here’s a simple reason why it doesn’t work: monoid homomorphisms must preserve the identity. Another is that Y would need to be a monoid itself, which is not true of most data types. I’ll keep this in the back of my mind and if I come up with a more appropriate construction I’ll post it here.

  5. In the second paragraph, starting with:

    As usual, this blog’s Github page, and we …

    There is a mixup with the html, so the link doesn’t work, and some of the text got caught in the garbled tag.

    I had to look at it for a few moments because it didn’t make any sense before I realized the issue,

    Anyway, awesome blog. I am absolutely loving it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s