
We are about to begin a series where we analyze large corpora of English words. In particular, we will use a probabilistic analysis of Google’s ngrams to solve various tasks such as spelling correction, word segmentation, on-line typing prediction, and decoding substitution ciphers. This will hopefully take us on a wonderful journey through elementary probability, dynamic programming algorithms, and optimization.
As usual, the code implemented in this post is available from this blog’s Github page, and we encourage the reader to use the code to implement our suggested exercises. But before we get there, we should investigate some properties of our domain: the set of all finite strings of letters.
Words, Words, Words.
If we consider a fixed alphabet
As usual, we are looking for some sort of underlying structure. Here the structure is that two words can be concatenated to make a larger string. In the parlance of abstract algebra,
We make a few abuses of notation here. In every monoid the operation is a pretend multiplication, so in general we will call it multiplication. We will also write strings (abusively, “products”) as
The Kleene starred monoid
Now, the structure we’ve described here is not particularly rich. In fact, free objects are the algebraic objects which are usually “completely understood.” For our purposes the language of abstract algebra is just a mature setting for our discussion, but the concepts we introduce will give an extra perspective on the topic. In other words, as we don’t plan to use any monoids more complicated than the English alphabet free monoid described above, we have no interesting (general) theorems to apply to our activities.
Before we turn to a more concrete setting, we have one more definition. A monoid homomorphism between two monoids
One easy example of a monoid homomorphism from our English alphabet free monoid is the length homomorphism. Rigorously, the set of natural numbers
A more complex example which shows up in functional programming has to do with lists. Let
Now for any fixed operation
This is precisely the map operation defined in our primer on functional programming. We encourage the reader to investigate how to phrase the other two functions (filter and fold) as monoid homomorphisms, or prove it cannot be done (thanks to Matt for pointing out this author’s mistake with regards to that).
Metrics, and String Comparisons
Since our goal is to do things like spelling correction, we are quite interested in strings of letters which are not actually words. We want to be able to tell someone that the word “beleive” is probably a misspelling of the word “believe.” So let us fix our alphabet
To mathematically describe what it means for a misspelled word to be “almost” the intended word, we need to bring in the concept of a metric. In other words, we want to view our set
Of course, the hard part is describing the right metric. But before we get there, we must define a metric so we know what properties to aim for in constructing a metric on words.
Definition: A metric
, and if and only if . (the triangle inequality)
A space
There are plenty of interesting examples of metrics, and we refer the interested reader to Wikipedia, or to any introductory topology text (or the end of a real analysis text). We will focus on the Levenshtein metric.
If we think for a minute we can come up with a list of ways that people make typing mistakes. Sometimes we omit letters (as in diferent), sometimes we add too many letters (e.g., committment), and sometimes we substitute one letter for another (missussippi could be a phonetic error, or a slip of the finger on a qwerty keyboard). Furthermore, we can traverse from one word to another by a sequence of such operations (at worst, delete all letters and then insert the right letters). So it would make sense to take the distance between two words to be the smallest number of such transformations required to turn one word into another.
More rigorously, let
- a deletion: the transformation
for some , where the hat omits omission in the -th spot. - an insertion: the transformation
for some , and some letter . - a substitution: the transformation
for some and some letter .
Then an edit from
To verify this is a metric, we note that all edits have non-negative length, and the only edit of length zero is the edit which does nothing, so if
Last, we must verify the triangle inequality. Let
So
A Topological Aside
Before we get to implementing this metric, we have a few observations to make. First, we note that the shortest edit between two words is far from unique. In particular, the needed substitutions, insertions, and deletions often commute (i.e. the order of operations is irrelevant). Furthermore, instead of simply counting the number of operations required, we could assign each operation a cost, and call the total cost of an edit the sum of the costs of each elementary edit. This yields a large class of different metrics, and one could conceivably think of new operations (combinations of elementary operations) to assign lower costs. Indeed, we will do just that soon enough.
Second, and more interestingly, this metric provides quite a bit of structure on our space. It is a well known fact that every metric induces a topology. In other words, there is a topology generated by the open balls
For those of us unfamiliar with topology or graph theory, we can still imagine geometric notions that get to the intuitive heart of what “induced topology” means for words. For example, we can describe a circle of radius
In fact, any geometric construction that can be phrased entirely in terms of distance has an interpretation in this setting. We encourage the reader to think of more.
Python Implementation, and a Peek at Dynamic Programming
Of course, what use are these theoretical concerns to us if we can’t use it to write a spell-checker? To actually implement the damn thing, we need a nontrivial algorithm. So now let’s turn to Python.
Our first observation is that we don’t actually care what the edits are, we just care about the number of edits. Since the edits only operate on single characters, we can define the behavior recursively. Specifically, suppose we have two words
Here is precisely that algorithm, written in Python:
def dist(word1, word2):
if not word1 or not word2:
return max(len(word1), len(word2))
elif word1[-1] == word2[-1]:
return dist(word1[:-1], word2[:-1])
else:
return 1 + min(dist(word1[:-1], word2),
dist(word1, word2[:-1]),
dist(word1[:-1], word2[:-1]))
Here the [:-1] syntax indicates a slice of the first
The cautious programmer will note the above algorithm is terribly wasteful! For instance, suppose we’re investigating the distance between
To fix this, we need to keep track of prior computations. The technical term is memoized recursion, and essentially we want to save old computations in a lookup table for later reference. In mostly-Python:
cache = {}
def memoizedFunction(args):
if args not in cache:
cache[args] = doTheComputation(args)
return cache[args]
To actually implement this, it turns out we don’t need to change the above code at all. Instead, we will use a decorator to modify the function as we wish. Here’s the code, which is essentially an extra layer of indirection applied to the above pseudocode.
def memoize(f):
cache = {}
def memoizedFunction(*args):
if args not in cache:
cache[args] = f(*args)
return cache[args]
memoizedFunction.cache = cache
return memoizedFunction
Here the function memoize() will accept our distance function, and return a new function which encapsulates the memo behavior. To use it, we simply use
def f(x):
...
equivalentButMemoizedFunction = memoize(f)
But luckily, Python gives a nice preprocessor macro to avoid writing this for every function we wish to memoize. Instead, we may simply write
@memoize
def f(x):
...
And Python will make the appropriate replacements of calls to f with the appropriate calls to the memoized function. Convenient! For further discussion, see our post on this technique in the program gallery.
Applying this to our Levenshtein metric, we see an impressive speedup, and a quick analysis shows the algorithm takes
To this author, this approach is the most natural implementation, but there are other approaches worth investigating. In particular, Python limits the recursion depth to a few hundred. If we try to compare, say, two DNA sequences, this algorithm will quickly overflow. There are a number of ways to fix this, the most appropriate of which would be tail call optimization (in this author’s humble opinion). Unfortunately, we’d need to tweak the algorithm a bit to put the recursive call in tail position, Python does not support tail call optimization, and manually putting things in continuation-passing style is annoying, obfuscating, and quite ugly. If we decide in the future to do DNA sequence analysis, we will return to this problem.
In the future, we plan to provide another Python primer, with a focus on dynamic algorithms. Other methods for solving this problem will arise there. Indeed, I’m teaching an introductory Python programming course next semester, so this will be a good refresher.
Transpositions, and Other Enhancements
One other significant kind of typo is a transposition. Often times we type the correct letters in a word, but jumble the order of two words. In a spell checker, we want the word
Adding transpositions to the algorithm above seems straightforward, but there are some nontrivial details to consider. For instance, we may first transpose two letters and then insert a new letter between them, as in the transformation from
@memoize
def dist2(word1, word2):
if not word1 or not word2:
return max(len(word1), len(word2))
elif word1[-1] == word2[-1]:
return dist2(word1[:-1], word2[:-1])
else:
minDist = 1 + min(dist2(word1[:-1], word2),
dist2(word1, word2[:-1]),
dist2(word1[:-1], word2[:-1]))
# transpositions
if len(word1) > 1 and len(word2) > 1:
if word1[-2] == word2[-1]:
transposedWord1 = word1[:-2] + word1[-1] + word1[-2]
minDist = min(minDist, dist2(transposedWord1[:-1], word2))
if word2[-2] == word1[-1]:
transposedWord2 = word2[:-2] + word2[-1] + word2[-2]
minDist = min(minDist, dist2(word1, transposedWord2[:-1]))
return minDist
Indeed, we must inspect both possible transpositions, and the symmetry of the example above shows the need for both branches. The proof that this extended metric is still a metric and the proof of algorithmic correctness are nearly identical to the plain Levenshtein metric.
So that was fun. Here are some other ideas we leave as exercises to the reader. First, if we allow ourselves to fix a keyboard layout (for many languages with Latin-based alphabets, the standard is qwerty with minor substitutions), we could factor that in to our analysis of letter substitutions and incorrect insertions. For instance, the word
We also note (from our mature vantage point) that this algorithm is not restricted to strings, but can be performed on any free monoid. This includes the example we mentioned earlier of lists. So we could generalize the algorithm to operate on any piece of data which has such an identity element and binary operation, and satisfies the freedom condition. My knowledge of Python is still somewhat limited, but the method for achieving this generalization comes in many names in many languages: in Java it’s interfaces, in C++ it’s templating, in Haskell it’s a typeclass. In Python, there is a fancy thing called duck-typing, and we leave this for our next Python primer.
Next time, we’ll crack open some data files with actual English dictionaries in them, and see what we can do about solving interesting problems with them. Until then!
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.