# Hamming’s Code

## Or how to detect and correct errors

Last time we made a quick tour through the main theorems of Claude Shannon, which essentially solved the following two problems about communicating over a digital channel.

1. What is the best encoding for information when you are guaranteed that your communication channel is error free?
2. Are there any encoding schemes that can recover from random noise introduced during transmission?

The answers to these questions were purely mathematical theorems, of course. But the interesting shortcoming of Shannon’s accomplishment was that his solution for the noisy coding problem (2) was nonconstructive. The question remains: can we actually come up with efficiently computable encoding schemes? The answer is yes! Marcel Golay was the first to discover such a code in 1949 (just a year after Shannon’s landmark paper), and Golay’s construction was published on a single page! We’re not going to define Golay’s code in this post, but we will mention its interesting status in coding theory later. The next year Richard Hamming discovered another simpler and larger family of codes, and went on to do some of the major founding work in coding theory. For his efforts he won a Turing Award and played a major part in bringing about the modern digital age. So we’ll start with Hamming’s codes.

We will assume some basic linear algebra knowledge, as detailed our first linear algebra primer. We will also use some basic facts about polynomials and finite fields, though the lazy reader can just imagine everything as binary $\{ 0,1 \}$ and still grok the important stuff.

Richard Hamming, inventor of Hamming codes. [image source]

## What is a code?

The formal definition of a code is simple: a code $C$ is just a subset of $\{ 0,1 \}^n$ for some $n$. Elements of $C$ are called codewords.

This is deceptively simple, but here’s the intuition. Say we know we want to send messages of length $k$, so that our messages are in $\{ 0,1 \}^k$. Then we’re really viewing a code $C$ as the image of some encoding function $\textup{Enc}: \{ 0,1 \}^k \to \{ 0,1 \}^n$. We can define $C$ by just describing what the set is, or we can define it by describing the encoding function. Either way, we will make sure that $\textup{Enc}$ is an injective function, so that no two messages get sent to the same codeword. Then $|C| = 2^k$, and we can call $k = \log |C|$ the message length of $C$ even if we don’t have an explicit encoding function.

Moreover, while in this post we’ll always work with $\{ 0,1 \}$, the alphabet of your encoded messages could be an arbitrary set $\Sigma$. So then a code $C$ would be a subset of tuples in $\Sigma^n$, and we would call $q = |\Sigma|$.

So we have these parameters $n, k, q$, and we need one more. This is the minimum distance of a code, which we’ll denote by $d$. This is defined to be the minimum Hamming distance between all distinct pairs of codewords, where by Hamming distance I just mean the number of coordinates that two tuples differ in. Recalling the remarks we made last time about Shannon’s nonconstructive proof, when we decode an encoded message $y$ (possibly with noisy bits) we look for the (unencoded) message $x$ whose encoding $\textup{Enc}(x)$ is as close to $y$ as possible. This will only work in the worst case if all pairs of codewords are sufficiently far apart. Hence we track the minimum distance of a code.

So coding theorists turn this mess of parameters into notation.

Definition: A code $C$ is called an $(n, k, d)_q$-code if

• $C \subset \Sigma^n$ for some alphabet $\Sigma$,
• $k = \log |C|$,
• $C$ has minimum distance $d$, and
• the alphabet $\Sigma$ has size $q$.

The basic goals of coding theory are:

1. For which values of these four parameters do codes exist?
2. Fixing any three parameters, how can we optimize the other one?

In this post we’ll see how simple linear-algebraic constructions can give optima for one of these problems, optimizing $k$ for $d=3$, and we’ll state a characterization theorem for optimizing $k$ for a general $d$. Next time we’ll continue with a second construction that optimizes a different bound called the Singleton bound.

## Linear codes and the Hamming code

A code is called linear if it can be identified with a linear subspace of some finite-dimensional vector space. In this post all of our vector spaces will be $\{ 0,1 \}^n$, that is tuples of bits under addition mod 2. But you can do the same constructions with any finite scalar field $\mathbb{F}_q$ for a prime power $q$, i.e. have your vector space be $\mathbb{F}_q^n$. We’ll go back and forth between describing a binary code $q=2$ over $\{ 0,1 \}$ and a code in $\mathbb{F}_q^n$. So to say a code is linear means:

• The zero vector is a codeword.
• The sum of any two codewords is a codeword.
• Any scalar multiple of a codeword is a codeword.

Linear codes are the simplest kinds of codes, but already they give a rich variety of things to study. The benefit of linear codes is that you can describe them in a lot of different and useful ways besides just describing the encoding function. We’ll use two that we define here. The idea is simple: you can describe everything about a linear subspace by giving a basis for the space.

Definition: generator matrix of a $(n,k,d)_q$-code $C$ is a $k \times n$ matrix $G$ whose rows form a basis for $C$.

There are a lot of equivalent generator matrices for a linear code (we’ll come back to this later), but the main benefit is that having a generator matrix allows one to encode messages $x \in \{0,1 \}^k$ by left multiplication $xG$. Intuitively, we can think of the bits of $x$ as describing the coefficients of the chosen linear combination of the rows of $G$, which uniquely describes an element of the subspace. Note that because a $k$-dimensional subspace of $\{ 0,1 \}^n$ has $2^k$ elements, we’re not abusing notation by calling $k = \log |C|$ both the message length and the dimension.

For the second description of $C$, we’ll remind the reader that every linear subspace $C$ has a unique orthogonal complement $C^\perp$, which is the subspace of vectors that are orthogonal to vectors in $C$.

Definition: Let $H^T$ be a generator matrix for $C^\perp$. Then $H$ is called a parity check matrix.

Note $H$ has the basis for $C^\perp$ as columns. This means it has dimensions $n \times (n-k)$. Moreover, it has the property that $x \in C$ if and only if the left multiplication $xH = 0$. Having zero dot product with all columns of $H$ characterizes membership in $C$.

The benefit of having a parity check matrix is that you can do efficient error detection: just compute $yH$ on your received message $y$, and if it’s nonzero there was an error! What if there were so many errors, and just the right errors that $y$ coincided with a different codeword than it started? Then you’re screwed. In other words, the parity check matrix is only guarantee to detect errors if you have fewer errors than the minimum distance of your code.

So that raises an obvious question: if you give me the generator matrix of a linear code can I compute its minimum distance? It turns out that this problem is NP-hard in general. In fact, you can show that this is equivalent to finding the smallest linearly dependent set of rows of the parity check matrix, and it is easier to see why such a problem might be hard. But if you construct your codes cleverly enough you can compute their distance properties with ease.

Before we do that, one more definition and a simple proposition about linear codes. The Hamming weight of a vector $x$, denoted $wt(x)$, is the number of nonzero entries in $x$.

Proposition: The minimum distance of a linear code $C$ is the minimum Hamming weight over all nonzero vectors $x \in C$.

Proof. Consider a nonzero $x \in C$. On one hand, the zero vector is a codeword and $wt(x)$ is by definition the Hamming distance between $x$ and zero, so it is an upper bound on the minimum distance. In fact, it’s also a lower bound: if $x,y$ are two nonzero codewords, then $x-y$ is also a codeword and $wt(x-y)$ is the Hamming distance between $x$ and $y$.

$\square$

So now we can define our first code, the Hamming code. It will be a $(n, k, 3)_2$-code. The construction is quite simple. We have fixed $d=3, q=2$, and we will also fix $l = n-k$. One can think of this as fixing $n$ and maximizing $k$, but it will only work for $n$ of a special form.

We’ll construct the Hamming code by describing a parity-check matrix $H$. In fact, we’re going to see what conditions the minimum distance $d=3$ imposes on $H$, and find out those conditions are actually sufficient to get $d=3$. We’ll start with 2. If we want to ensure $d \geq 2$, then you need it to be the case that no nonzero vector of Hamming weight 1 is a code word. Indeed, if $e_i$ is a vector with all zeros except a one in position $i$, then $e_i H = h_i$ is the $i$-th row of $H$. We need $e_i H \neq 0$, so this imposes the condition that no row of $H$ can be zero. It’s easy to see that this is sufficient for $d \geq 2$.

Likewise for $d \geq 3$, given a vector $y = e_i + e_j$ for some positions $i \neq j$, then $yH = h_i + h_j$ may not be zero. But because our sums are mod 2, saying that $h_i + h_j \neq 0$ is the same as saying $h_i \neq h_j$. Again it’s an if and only if. So we have the two conditions.

• No row of $H$ may be zero.
• All rows of $H$ must be distinct.

That is, any parity check matrix with those two properties defines a distance 3 linear code. The only question that remains is how large can $n$  be if the vectors have length $n-k = l$? That’s just the number of distinct nonzero binary strings of length $l$, which is $2^l - 1$. Picking any way to arrange these strings as the rows of a matrix (say, in lexicographic order) gives you a good parity check matrix.

Theorem: For every $l > 0$, there is a $(2^l - 1, 2^l - l - 1, 3)_2$-code called the Hamming code.

Since the Hamming code has distance 3, we can always detect if at most a single error occurs. Moreover, we can correct a single error using the Hamming code. If $x \in C$ and $wt(e) = 1$ is an error bit in position $i$, then the incoming message would be $y = x + e$. Now compute $yH = xH + eH = 0 + eH = h_i$ and flip bit $i$ of $y$. That is, whichever row of $H$ you get tells you the index of the error, so you can flip the corresponding bit and correct it. If you order the rows lexicographically like we said, then $h_i = i$ as a binary number. Very slick.

Before we move on, we should note one interesting feature of linear codes.

Definition: A code is called systematic if it can be realized by an encoding function that appends some number $n-k$ “check bits” to the end of each message.

The interesting feature is that all linear codes are systematic. The reason is as follows. The generator matrix $G$ of a linear code has as rows a basis for the code as a linear subspace. We can perform Gaussian elimination on $G$ and get a new generator matrix that looks like $[I \mid A]$ where $I$ is the identity matrix of the appropriate size and $A$ is some junk. The point is that encoding using this generator matrix leaves the message unchanged, and adds a bunch of bits to the end that are determined by $A$. It’s a different encoding function on $\{ 0,1\}^k$, but it has the same image in $\{ 0,1 \}^n$, i.e. the code is unchanged. Gaussian elimination just performed a change of basis.

If you work out the parameters of the Hamming code, you’ll see that it is a systematic code which adds $\Theta(\log n)$ check bits to a message, and we’re able to correct a single error in this code. An obvious question is whether this is necessary? Could we get away with adding fewer check bits? The answer is no, and a simple “information theoretic” argument shows this. A single index out of $n$ requires $\log n$ bits to describe, and being able to correct a single error is like identifying a unique index. Without logarithmically many bits, you just don’t have enough information.

## The Hamming bound and perfect codes

One nice fact about Hamming codes is that they optimize a natural problem: the problem of maximizing $d$ given a fixed choice of $n$, $k$, and $q$. To get this let’s define $V_n(r)$ denote the volume of a ball of radius $r$ in the space $\mathbb{F}_2^n$. I.e., if you fix any string (doesn’t matter which) $x$, $V_n(r)$ is the size of the set $\{ y : d(x,y) \leq r \}$, where $d(x,y)$ is the hamming distance.

There is a theorem called the Hamming bound, which describes a limit to how much you can pack disjoint balls of radius $r$ inside $\mathbb{F}_2^n$.

Theorem: If an $(n,k,d)_2$-code exists, then

$\displaystyle 2^k V_n \left ( \left \lfloor \frac{d-1}{2} \right \rfloor \right ) \leq 2^n$

Proof. The proof is quite simple. To say a code $C$ has distance $d$ means that for every string $x \in C$ there is no other string $y$ within Hamming distance $d$ of $x$. In other words, the balls centered around both $x,y$ of radius $r = \lfloor (d-1)/2 \rfloor$ are disjoint. The extra difference of one is for odd $d$, e.g. when $d=3$ you need balls of radius 1 to guarantee no overlap. Now $|C| = 2^k$, so the total number of strings covered by all these balls is the left-hand side of the expression. But there are at most $2^n$ strings in $\mathbb{F}_2^n$, establishing the desired inequality.

$\square$

Now a code is called perfect if it actually meets the Hamming bound exactly. As you probably guessed, the Hamming codes are perfect codes. It’s not hard to prove this, and I’m leaving it as an exercise to the reader.

The obvious follow-up question is whether there are any other perfect codes. The answer is yes, some of which are nonlinear. But some of them are “trivial.” For example, when $d=1$ you can just use the identity encoding to get the code $C = \mathbb{F}_2^n$. You can also just have a code which consists of a single codeword. There are also some codes that encode by repeating the message multiple times. These are called “repetition codes,” and all three of these examples are called trivial (as a definition). Now there are some nontrivial and nonlinear perfect codes I won’t describe here, but here is the nice characterization theorem.

Theorem [van Lint ’71, Tietavainen ‘73]: Let $C$ be a nontrivial perfect $(n,d,k)_q$ code. Then the parameters must either be that of a Hamming code, or one of the two:

• A $(23, 12, 7)_2$-code
• A $(11, 6, 5)_3$-code

The last two examples are known as the binary and ternary Golay codes, respectively, which are also linear. In other words, every possible set of parameters for a perfect code can be realized as one of these three linear codes.

So this theorem was a big deal in coding theory. The Hamming and Golay codes were both discovered within a year of each other, in 1949 and 1950, but the nonexistence of other perfect linear codes was open for twenty more years. This wrapped up a very neat package.

Next time we’ll discuss the Singleton bound, which optimizes for a different quantity and is incomparable with perfect codes. We’ll define the Reed-Solomon and show they optimize this bound as well. These codes are particularly famous for being the error correcting codes used in DVDs. We’ll then discuss the algorithmic issues surrounding decoding, and more recent connections to complexity theory.

Until then!

Posts in this series:

# Information Distance — A Primer

This post assumes familiarity with our primer on Kolmogorov complexity. We recommend the uninformed reader begin there. We will do our best to keep consistent notation across both posts.

## Kolmogorov Complexity as a Metric

Over the past fifty years mathematicians have been piling up more and more theorems about Kolmogorov complexity, and for good reason. One of the main interpretations of the Kolmogorov complexity function $K$ is that for a given string $x$, $K(x)$ is the best theoretical compression of $x$ under any compression scheme. So a negative result about $K$ can provide useful bounds on how good a real-world compressor can be. It turns out that these properties also turn $K$ into a useful tool for machine learning. The idea is summarized as follows:

Let $x,y$ be binary strings, and as usual let’s fix some universal programming language $L$ in which to write all of our programs. Let $p(x,y)$ be the shortest program which computes both $y$ when given $x$ as an input, and $x$ given $y$. We would imagine that if $x,y$ are unrelated, then the length of the program $|p(x,y)|$ would be roughly $K(x) + K(y)$, simply by running the shortest program to output $x$ using no inputs, followed by the same thing for $y$. As usual there will be some additive constant term independent of both $x$ and $y$. We denote this by $c$ or $O(1)$ interchangeably.

We would further imagine that if $x,y$ are related (that is, if there is some information about $x$ contained in $y$ or vice versa), then the program $p(x,y)$ would utilize that information and hence be shorter than $K(x) + K(y)$. It turns out that there is an even better way to characterize $p$, and with a few modifications we can turn the length of $p$ into something similar to a metric on the set of all strings.

This metric has some strikingly attractive features. We will see that it is “universal” with respect to a certain class of distance functions (which is unfortunately not the class of all metrics). In particular, for any of these functions $f$, the length of $|p(x,y)|$ will be at worst a small amount larger than $f(x,y)$. In words, if $x$ and $y$ are similar according to any of these distance functions, then they will be similar according to $p$. Of course the devil is in the details, but this is the right idea to have in mind while we wade through the computations.

## An Aside on Metrics, and Properties of Kolmogorov Complexity

In recent posts on this blog we’ve covered a number of important examples of metrics and investigated how a metric creates structure in a space. But as powerful and rare as fruitful metrics are, we have barely scratched the surface of the vast amount of literature on the subject.

As usual with our computations in Kolmogorov complexity, all of our equalities will be true up to some kind of additive sloppiness. Most of the time it will be an additive constant $O(1)$ which is independent of anything else in the equation. We will usually omit the constant with that implicit understanding, and instead we will specify the times when it is an exact equality (or when the additive sloppiness is something other than a constant).

And so, unavoidably, the “metric” we define won’t be a true metric. It will only satisfy the metric properties (positive definite, symmetric, triangle inequality) up to a non-constant additive sloppiness. This will be part of the main theorem of this post.

Before we can reach the heart of the matter (and as a nice warm-up), we need to establish a few more properties of $K$. Recall that by $K(x|y)$ we mean the shortest program which computes $x$ when provided $y$ as an auxiliary input. We call this the conditional complexity of $x$ given $y$. Further, recall that $K(x,y)$ is the length of the shortest program which outputs both $x$ and $y$, and a way to distinguish between the two (if everything is in binary, the distinguishing part is nontrivial; should the reader be interested, this sort of conversation is made for comment threads). Finally, the comma notation works for auxiliary inputs as well: $K(x|y,z)$ is the length of the shortest program outputting $x$ when given $y,z$ and a way to distinguish them as input.

For example, the conditional Kolmogorov complexity $K(1^n | n) = c$ is constant: the length of the string $1^n$ provides all but a finite amount of information about it. On the other hand, if $x,y$ are random strings (their bits are generated independently and uniformly at random), then $K(y|x) = K(y)$; there is no information about $y$ contained in $x$.

Definition: Let $x$ be a (binary) string. We denote by $x^*$ the shortest program which computes $x$. That is, $K(x) = |x^*|$. If there are two shortest programs which compute $x$, then $x^*$ refers to the first in the standard enumeration of all programs.

As a quick aside, the “standard enumeration” is simple: treat a binary string as if it were a natural number written in base 2, and enumerate all strings in increasing order of their corresponding number. The choice of enumeration is irrelevant, though; all that matters is that it is consistent throughout our theory.

Proposition: Kolmogorov complexity has the following properties up to additive constants:

1. $K(x|y^*) = K(x|y,K(y))$
2. $K(x|y^*) \leq K(x|y)$, and $K(x|y) \leq K(x|y^*) + O(\log(K(y)))$
3. $K(x,y) = K(x) + K(y|x^*)$

The first item simply states that giving $y^*$ as input to a program is the same as giving $y$ and $K(y)$. This is not hard to prove. If $p$ is the shortest program computing $x$ from $y,K(y)$, then we can modify it slightly to work with $y^*$ instead. Just add to the beginning of $p$ the following instructions:

Compute K(y) as the length of the input y*
Simulate y* and record its output y

Since $y^*$ is a finite string and represents a terminating program, these two steps produce the values needed to run $p$. Moreover, the program description is constant in length, independent of $y^*$.

On the other hand, if $q$ is a program computing $x$ from $y^*$, we are tasked with finding $y^*$ given $y, K(y)$. The argument a standard but slightly more complicated technique in theoretical computer science called dovetailing. In particular, since we know the length of $y^*$, and there are only finitely many programs of the same length, we can get a list $p_1, p_2, \dots p_n$ of all programs of length $K(y)$. We then interleave the simulation of each of these programs; that is, we run the first step of all of the $p_i$, then the second, third, and so on. Once we find a program which halts and outputs $y$ (and we are guaranteed that one will do so) we can stop. In pseudocode, this is just the subroutine:

L = [all programs of length K(y) in lexicographic order]
i = 1
while True:
for program in L:
run step i of program
if program terminates and outputs y:
return program

The fact that this algorithm will terminate proves the claim.

The second item in the proposition has a similar proof, and we leave it as an exercise to the reader. (Hint: the logarithm in the second part of the statement comes from the hard-coding of a binary representation of the number $K(y)$)

The third item, that $K(x,y) = K(x) + K(y|x^*)$ has a much more difficult proof, and its consequences are far-reaching. We will use it often in our computations. The intrepid reader will see Theorem 3.9.1 in the text of Li & Vitanyi for a complete proof, but luckily one half of the proof is trivial. That is, the proof that $K(x,y) \leq K(x) + K(y|x^*) + c$ is similar to the argument we used above. Let $p,q$ be the shortest programs computing $x$ and $y$ given $x^*$, respectively. We can combine them into a program computing $x$ and $y$. First run $p$ to compute $x$ and compute the length of $p$. As we saw, these two pieces of data are equivalent to $x^*$, and so we can compute $y$ using $q$ as above, adding at most a finite amount program text to do so.

This property is so important it has a name.

Lemma: (Symmetry of information)

$\displaystyle K(x,y) = K(x) + K(y|x^*) = K(y) + K(x|y^*)$

This is true (and named appropriately) since there is symmetry in the quantity $K(x,y) = K(y,x)$. Note in particular that this doesn’t hold without the star: $K(x,y) = K(x) + K(y|x) + O(\log(K(x)))$. Those readers who completed the exercise above will know where the logarithm comes from.

## The Almost-Triangle Inequality

The first application of the symmetry of information is (surprisingly) a variant of the triangle inequality. Specifically, the function $f(x,y) = K(x|y^*)$ satisfies the metric inequalities up to an additive constant sloppiness.

$\displaystyle K(x|y^*) \leq K(x|z^*) + K(z|y^*) + c$

where $c$ does not depend on $x, y, z$. To prove this, see that

$\displaystyle K(x,z | y^*) = K(x,y,z) - K(y) \leq K(z) + K(x|z^*) + K(y|z^*) - K(y)$

The first equality is by the symmetry of information $K(x,y,z) = K(y) + K(x,z|y^*)$, and the second follows from the fact that $K(x,y,z) \leq K(z) + K(x|z^*) + K(y|z^*)$. This is the same argument we used to prove the $\leq$ case of the symmetry of information lemma.

Now we can rearrange the terms and use the symmetry of information twice, $K(z) + K(y|z^*) = K(y,z)$ and $K(y,z) - K(y) = K(z|y^*)$, to reach the final result.

This is interesting because it’s our first indication that Kolmogorov complexity can play a role in a metric. But there are some issues: $K(x|y)$ is in general not symmetric. We need to some up with a symmetric quantity to use instead. There are quite a few details to this process (see this paper if you want to know them all), but the result is quite nice.

Theorem: Let $E(x,y)$ be the length of the shortest program which computes $x$ given $y$ as input and $y$ given $x$. Then

$\displaystyle E(x,y) = \max (K(x|y), K(y|x)) + O(\log(M))$

where $M = \max(K(x|y), K(y|x))$.

That is, our intuitive idea of what the “information distance” from $x$ to $y$ should be coincides up to an additive logarithmic factor with the maximum of the conditional Kolmogorov complexities. If two strings are “close” with respect to $E$, then there is a lot of mutual information between them. In the same paper listed above, the researchers (Bennett et al.) prove that $E$ is a “metric” (up to additive constants) and so this gives a reasonable estimate for the true information distance in terms of conditional Kolmogorov complexities.

However, $E$ is not the final metric used in applications, but just an inspiration for other functions. This is where the story gets slightly more complicated.

## Normalized Information Distance(s)

At this point we realize that the information distance $E$ defined above is not as good as we’d like it to be. One of its major deficiencies is that it does not compute relative distances very well. That is, it doesn’t handle strings of varying size as well as it should.

For example, take $x$ to be a random string of length $n$ for arbitrary $n$.   The quantity $E(x, \varepsilon)$, where $\varepsilon$ is the empty string is just $K(x) + c$ (if the input is empty, compute $x$, otherwise output the empty string). But in a sense there is no information about $\varepsilon$ in any string. In other words, $\varepsilon$ is maximally dissimilar to all nonempty strings. But according to $E$, the empty string is variably dissimilar to other strings: it’s “less similar” to strings with higher Kolmogorov complexity. This is counter-intuitive, and hence undesirable.

Unfortunately the literature is littered with alternative distance functions, and the researchers involved spend little effort relating them to each other (this is part of the business of defining things “up to sloppiness”). We are about to define the principal example we will be concerned with, and we will discuss its relationship with its computationally-friendly cousins at the end.

The link between all of these examples is normalization. That is (again up to minor additive sloppiness we’ll make clear shortly) the distance functions take values in $[0,1]$, and a value of 0 means the strings are maximally similar, and a value of 1 implies maximal dissimilarity.

Definition: Let $\Sigma = \left \{ 0,1 \right \}^*$ be the set of binary strings. A normalized distance $f$ is a function $\Sigma \times \Sigma \to [0,1]$ which is symmetric and satisfies the following density condition for all $x \in \Sigma$ and all $0 \leq e \leq 1$:

$\displaystyle |\left \{ y : d(x,y) \leq e \right \}| < 2^{eK(x) + 1}$

That is, there is a restriction on the number of strings that are close to $x$. There is a sensible reason for such a convoluted condition: this is the Kolmogorov-complexity analogue of the Kraft inequality. One of the picky details we’ve blatantly left out in our discussion of Kolmogorov complexity is that the programs we’re allowed to write must collectively form a prefix-code. That is, no program is a proper prefix of another program. If the implications of this are unclear (or confusing), the reader may safely ignore it. It is purely a tool for theoretical analysis, and the full details are again in the text of Li & Vitanyi. We will come back to discuss other issues with this density condition later (in the mean time, think about why it’s potentially dubious), but now let us define our similarity metric.

Definition: The normalized information distance $d(x,y)$ is defined by

$\displaystyle d(x,y) = \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))}$

The reason we switched from $K(x|y)$ to $K(x|y^*)$ will become apparent in our calculations (we will make heavy use of the symmetry of information, which does not hold by a constant factor for $K(x|y)$).

Quickly note that this alleviates our empty string problem we had with the non-normalized metric. $d(x,\varepsilon) = K(x)/K(x) = 1$, so they are maximally dissimilar regardless of what $x$ is.

Theorem 1: (Metric Axioms) $d(x,y)$ satisfies the metric axioms up to additive $O(1/M)$ precision, where $M$ is the maximum of the Kolmogorov complexities of the strings involved in the (in)equality.

Theorem 2: (Universality) $d(x,y)$ is universal with respect to the class of computable normalized distance functions. That is, if $f$ is a normalized distance, then for all $x,y$ we have the following inequality:

$d(x,y) \leq f(x,y) + O(1/M)$

where again $M$ is the minimum of the Kolmogorov complexities of the strings involved.

We should note that in fact theorem 2 holds for even more general normalized distance functions, the so-called “upper semi-computable” functions. Skipping the rigorous definition, this just means that one can recursively approximate the true value by giving a consistently improved upper bound which converges to the actual value. It is not hard to see that $K$ is an upper semi-computable function, although it is unknown whether $d$ is (and many believe it is not).

The proof of the first theorem is straightforward but notationally dense.

Proof of Theorem 1 (Metric Axioms): The value $d(x,x) = K(x|x^*)/K(x) = O(1/K(x))$, since $K(x|x^*) = K(x|x,K(x))$ is trivially constant, and $d(x,y) \geq 0$ since Kolmogorov complexity is non-negative. Moreover, $d(x,y)$ is exactly symmetric, so the proof boils down to verifying the triangle inequality holds.

Let $x,y,z$ be strings. We gave a proof above that $K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$. We will modify this inequality to achieve our desired result, and there are two cases:

Case 1: $K(z) \leq \max(K(x), K(y))$. Take the maximum of each side of the two inequalities for $K(x|y^*), K(y|x^*)$ to get

$\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*) + K(z|y^*) , K(y|z^*) + K(z|x^*)) + O(1)$

We can further increase the right hand side by taking termwise maxima

$\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*), K(z|x^*)) + \max(K(y|z^*), K(z|y^*)) + O(1)$

Now divide through by $\max(K(x), K(y))$ to get

$\displaystyle \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))} \leq \frac{\max(K(x|z^*), K(z|x^*))}{\max(K(x), K(y))} + \frac{\max(K(y|x^*), K(z|y^*))}{\max(K(x), K(y))} + O(1/M)$

Finally, since $K(z)$ is smaller than the max of $K(x), K(y)$, we can replace the  $K(y)$ in the denominator of the first term of the right hand side by $K(z)$. This will only possibly increase the fraction, and for the same reason we can replace $K(x)$ by $K(z)$ in the second term. This achieves the triangle inequality up to $O(1/M)$, as desired.

Case 2: $K(z) = \max(K(x), K(y), K(z))$. Without loss of generality we may also assume $K(x) \geq K(y)$, for the other possibility has an identical argument. Now we can boil the inequality down to something simpler. We already know the denominators have to all be $K(z)$ in the right hand side, and $K(x)$ in the left. Moreover, we claim $K(z|x^*) \geq K(x|z^*)$. This is by the symmetry of information:

$\displaystyle K(x,z) = K(x|z^*) + K(z) = K(z|x^*) + K(x) \leq K(z|x^*) + K(z)$

Subtracting $K(z)$ establishes the claim, and similarly we have $K(z|y^*) \geq K(y|z^*)$. So the triangle inequality reduces to

$\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(z|x^*)}{K(z)} + \frac{K(z|y^*)}{K(z)} + O(1/K(z))$

Applying our original inequality again to get $K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$, we may divide through by $K(x)$ and there are two additional cases.

$\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(x|z^*) + K(z|y^*) + O(1)}{K(x)}$

If the right-hand side is less than or equal to 1, then adding a constant $c$ to the top and bottom of the fraction only increases the value of the fraction, and doesn’t violate the inequality. So we choose to add $K(z)-K(x)$ to the top and bottom of the right-hand side and again using the symmetry of information, we get exactly the required value.

If the right-hand side is greater than 1, then adding any constant to the top and bottom decreases the value of the fraction, but it still remains greater than 1. Since $K(x|y^*) \leq K(x)$ (a simple exercise), we see that the left-hand side is at most 1, and our same trick of adding $K(z) - K(x)$ works. $\square$

The proof of the universality theorem is considerably more elegant.

Proof of Theorem 2 (Universality): Let $f$ be any normalized distance function, and set $e = f(x,y)$. Suppose further that $K(x) \leq K(y)$.

Let us enumerate all strings $v$ such that $f(x,v) \leq e$. In particular, since $e = f(x,y)$, $y$ is included in this enumeration. By the density condition, the number of such strings is at most $2^{eK(x) + 1}$. The index of $y$ in this enumeration can be used as an effective description of $y$ when given $x$ as input. That is, there is a program which includes in its description the index of $y$ and outputs $y$ given $x$. Since the number of bits needed to describe the index of $y$ is at most $\log(2^{eK(x) + 1}) = eK(x) + 1$, we have

$\displaystyle K(y|x) \leq eK(x) + 1$

Again the symmetry of information lemma gives us $K(x|y^*) \leq K(y|x^*)$. And now

$\displaystyle d(x,y) = \frac{K(y|x^*)}{K(y)} \leq \frac{K(y|x) + O(1)}{K(y)} \leq \frac{eK(x) + O(1)}{K(y)}$

Since $K(x) \leq K(y)$, we can replace the denominator of the last expression with $K(x)$ (only increasing the fraction) to get $d(x,y) \leq e + O(1/K(x))$. But $e$ was just $f(x,y)$, so this completes the proof of this case.

In the case $K(y) < K(x)$, the proof is similar (enumerating the index of $x$ instead), and at the end we get

$\displaystyle d(x,y) = \frac{K(x|y^*)}{K(x)} \leq \frac{eK(y) + O(1)}{K(y)} = f(x,y) + O(1/K(y))$

The theorem is proved. $\square$

## Why Normalized Distance Functions?

The practical implications of the two theorems are immense. What we’re saying is that if we can represent some feature of string similarity by a normalized distance function, then that feature will be captured automatically by the normalized information distance $d$. The researchers who discovered normalized information distance (and proved its universality) argue that in fact upper semi-computable normalized distance functions encapsulate all real-world metrics we would ever care about! Of course, there is still the problem that Kolmogorov complexity is uncomputable, but we can certainly come up with reasonable approximations (we will see precisely this in our next post).

And these same researchers have shown that approximations to $d$ do represent a good deal of universality in practice. They’ve applied the same idea to fields as varied as genome clustering, language clustering, and music clustering. We will of course investigate the applications for ourselves on this blog, but their results seem to apply to data mining in any field.

But still this raises the obvious question (which goes unaddressed in any research article this author has read): does every metric have a sensible interpretation (or modification) as a normalized distance function? That awkward density condition seems particularly suspect, and is at the core of this author’s argument that the answer is no.

Consider the following example. Let $f$ be a normalized distance function, and fix $e = 1$. The density condition says that for any $x$ we want, the number of strings which are within distance 1 of $x$ is bounded by $2^{K(x) + 1}$. In particular, this quantity is finite, so there can only be finitely many strings which are within distance 1 of $x$. But there are infinitely many strings, so this is a contradiction!

Even if we rule out this (arguably trivial) case of $e=1$, we still run into problems. Let $e = 1 - \varepsilon$ for any sufficiently small $\varepsilon > 0$. Then fix $x = 0$ (the string consisting of the single bit 0). The number of strings which are within distance $e$ of $x$ is bounded by $2^{eK(x) + 1} < 2^{K(x) + 1}$ is again finite (and quite small, since $K(0)$ is about as simple as it gets). In other words, there are only a finite number of strings that are not maximally dissimilar to $0$. But one can easily come up with an infinite number of strings which share something in common with $0$: just use $0^n$ for any $n$ you please. It is ludicrous to say that every metric should call $0$ as dissimilar to $0^n$ as the empty string is to a random string of a thousand bits.

In general, this author doesn’t find it likely that one can take any arbitrary $f(x,y)$ which is both symmetric and has values in $[0,1]$ and modify it to satisfy the density condition. Indeed, this author has yet to see any example of a natural normalized similarity metric. There is one which is a modification of Hamming distance, but it is relatively awkward and involves the Kolmogorov complexity of the strings involved. If the reader has any ideas to the contrary, please share them in the comments.

So it appears that the class of normalized distance functions is not as large as we might wish, and in light of this the universality theorem is not as impressive. On the other hand, there is no denying the success of applying the normalized information distance to complex real-world problems. Something profound is going on, but from this author’s viewpoint more theoretical work is needed to establish why.

## Friendly Cousins of Normalized Information Distance

In practice we want to compute $K(x|y^*)$ in terms of quantities we can actually approximate. Due to the symmetry of information, we can rewrite the metric formula as

$\displaystyle d(x,y)=\frac{K(x,y) - \min(K(x), K(y))}{\max(K(x), K(y))}$

Indeed, since our main interpretation of $K(x)$ is as the size of the smallest “compressed version” of the string $x$, it would seem that we can approximate the function $K$ by using real-world compression algorithms. And for the $K(x,y)$ part, we recognize that (due to the need to specify a way to distinguish between the outputs $x,y$)

$K(x,y) \leq K(xy) + O(\log(\max(K(x), K(y))))$,

where $K(xy)$ is the Kolmogorov complexity of the concatenation of the two strings. So if we’re willing to forgive additive logarithmic sloppiness (technically, $O(\log(K(x))/K(x))$ sloppiness, which goes to zero asymptotocally), we can approximate normalized information distance as

$\displaystyle d(x,y) = \frac{K(xy) - \min(K(x), K(y))}{\max(K(x), K(y))}$

In the literature researchers will also simplify the metric by removing the “star” notation

$\displaystyle d(x,y) = \frac{\max(K(x|y), K(y|x))}{\max(K(x), K(y))}$

Unfortunately these two things aren’t equivalent. As we saw in our “basic properties” of $K(x|y)$,

$K(x|y) \leq K(x|y^*) + O(\log(K(y)))$

Indeed, it is not the case that $K(x|y) = K(x|y^*)$. An easy counterexample is by trying to equate $K(K(x) | x) = K(K(x) | x^*)$. We have already proven that the right hand side is always constant, but the left hand side could not be. An exercise in Li & Vitanyi shows there is an infinite family of strings $x$ for which $K(K(x) | x) \geq \log(|x|)$.

And so these two metrics cannot be equal, but they are close. In fact, denoting the non-star version by $d_2$ and the regular version by $d_1$, we have $d_2(x,y) \leq d_1(x,y) + O(1)$. This changes the metric properties and the universality claim, because $O(1/K)$ precision is stronger than $O(1)$ precision. Indeed, the true constant is always less than 1 (e.g. when $K(y) > K(x)$ it is $K(y^*)/K(y)$), but this means the metric can potentially take values in the range $[0,2]$, which is edging further and further away from the notion of normalization we originally strove for.

Finally, the last example of a cousin metric is

$\displaystyle d_3(x,y) = \frac{K(x|y^*) + K(y|x^*)}{K(x,y)}$

We will leave it to the reader to verify this function again satisfies the metric inequalities (in the same way that the original normalized information distance does). On the other hand, it only satisfies universality up to a factor of 2. So while it still may give some nice results in practice (and it is easy to see how to approximate this), the first choice of normalized information distance was theoretically more precise.

## Applications

We’ve just waded through a veritable bog of theory, but we’ve seen some big surprises along the way. Next time we’ll put these theoretical claims to the test by seeing how well we can cluster and classify data using the normalized information distance (and introducing as little domain knowledge as possible). Until then!

# Determinism and Finite Automata – A Primer

This half of the theory of computing primer will cover the various finite automata, including deterministic, nondeterministic, and pushdown automata. We devote the second half [upcoming] entirely to Turing machines and the halting problem, but to facilitate the discussion of Turing machines we rely on the intuition and notation developed here.

## Defining Computation

The first step in studying the sorts of possible computations (and more interestingly, those things which cannot be computed) is to define exactly what we mean by a “computation.” At a high level, this is easy: a computation is simply a function. Given some input, produce the appropriate output.

Unfortunately this is much too general. For instance, we could define almost anything we want in terms of functions. Let $f$ be the function which accepts as input the date of California Super Lotto drawings, and returns the set of winning numbers for that date. Of course (assuming the winning numbers are chosen randomly), this is ridiculous and totally uninformative. We cannot learn anything interesting about computations if they are performed behind an opaque veil. Instead, we need to rigorously define the innards of our model for computation, and then see what kinds of things it can do.

Because our inputs may be long and complicated, our model should break an input into pieces and work on the pieces in sequence. This motivates the following two definitions:

Definition: An alphabet is a finite set, usually denoted $\Sigma$.

Definition: $\Sigma^*$ is the set of finite strings whose characters come from $\Sigma$. We include the empty string $\varepsilon$ as the string of zero length. This operation is called the Kleene star.

We will henceforth let inputs to our model be elements of $\Sigma^*$ for some fixed alphabet $\Sigma$, and operate on each character one at a time.

Since $\Sigma$ is finite, we may assign to each element a unique finite string of ones and zeros (to keep things simple, let each string have length equal to ceiling of $\log |\Sigma|)$. Then, we can simulate any alphabet by the appropriate subset of $\left \{0,1 \right \}^*$. So without loss of generality, we may always assume our input comes from $\left \{ 0,1 \right \}^*$, and thus we will henceforth have $\Sigma = \left \{ 0,1 \right \}$.

Now we need a clear objective for our model, and we will phrase it again in terms of $\Sigma^*$. Specifically, we pick a subset of $A \subset \Sigma^*$, and call it a language. Our computational model will accept an input, and output “accept” or “reject.” We say that our model recognizes $A$ if it accepts and only accepts those strings in $A$.

Notice that this is a bit different from our colloquial idea of “computation.” Specifically, all our model will do is determine presence of an element in a set which has a potentially complicated construction. While it at first seems that this does not include things like “adding two numbers,” it turns out that we may phrase such objectives in terms of acceptance and rejection. Let $A = \left \{ axbxc | a,b,c \in \mathbb{N}, a+b=c \right \}$, where $x$ is some fixed separator. If we fix $a$ and $b$, then we may run our computational model on successively higher values of $c$ (ordered by absolute value, or restricting everything to be positive) until finding acceptance.

Colloquially, we have defined “computation” as recognizing all inputs which satisfy a qestion. Moreover, we will see that very many complex problems can be recognized as recognition problems, including things that are logically impossible to compute within our model.

The reason we have been heretofore vague in naming our model is that we will actually define four different models of progressively stronger computational ability, and this framework will hold for all of them. So here our first try.

## Deterministic Finite Automata

This definition comes from the intuitive idea that a computation can be carried out via a set of states and transitions between those states. A very simple example is a light switch, which has the states ‘on’ and ‘off’, and which can accept as input ‘switch’ or ‘do nothing’. This model is rigorized here:

Definition: A deterministic finite automaton, abbreviated DFA, is a five-tuple $D = (S, \Sigma, \tau, s_0, F)$, where:

• $S$ is a set of states, with initial state $s_0$.
• $\Sigma$ is our working alphabet, and inputs come from $\Sigma^*$.
• $\tau$ is a transition function $S \times \Sigma \to S$, which maps the current state and next input character to the next appropriate state.
• $F \subset S$ is a set of final states. If the last input character lands us in any state $f \in F$, we accept.

Rigorously, let our input word $w \in \Sigma^*$ have length $n$ and characters indexed $w_k$. Call the sequence of states in the computation $s_k$, where $s_{k+1} = \tau(s_k, w_k)$. Then our DFA accepts $w$ if and only if $s_n \in F$. We will use this notation throughout our discussion of finite automata. We call the set of languages which a DFA can recognize the regular languages. We will soon see that this class is rather small.

Of course, we have some examples of DFAs. The simplest possible DFA has only one state, $s$, and the transition function $\tau(s,a) = s$ for all $a \in \Sigma$. Depending on whether $F$ is empty, this DFA will accept or reject all of $\Sigma^*$. We may visualize this DFA as a directed graph:

A trivial DFA, with one state.

Indeed, this visualization provides a convenient way for us to write out the transition function. We simply draw an edge originating at each state for each element of our alphabet. When two or more inputs have the same behavior, we label one edge with multiple input symbols, as above.

Here is a more interesting example: let $A$ be the set of binary strings which have an even number of zeros. We design a DFA to accept this as follows:

A DFA which accepts binary numbers with evenly many zeros.

The shaded state, $s_0$ is the only final state. It is obvious from tracing the diagram that this DFA accepts precisely the set of strings with evenly many zeros. Let’s try something harder:

Let $A = \left \{ 0^n1^n | n \in \mathbb{N} \right \}$. This is a simple enough language, so let us develop a DFA to recognize it. We can easily construct a DFA which recognizes $0^n1^n$ for a fixed $n$, and if we have two DFAs we can construct a DFA which recognizes their union quite easily (do so as an exercise!). However, due to the restriction that $S$ is finite, we cannot connect these pieces together indefinitely! While we might imagine some cleverly designed DFA exists to recognize $A$, we find it more likely that no such DFA exists.

Indeed, it has been proven that no such DFA exists. The key to the proof lies in an esoteric lemma called the Pumping Lemma. Being a notational beast of a lemma, we will not state it rigorously. Colloquially, the lemma says that if $A$ is a regular language, then any sufficiently large word $w \in A$ can be split up into three pieces $xyz$, such that the middle piece may be repeated arbitrarily many times, as $xyyyy \dots yz$, with the resulting word still being in $A$. Clearly this breaks the “balance” restraint of $A$ no matter where the splitting is done, and so $A$ cannot be regular.

Before we increase our model’s expressive power, let us make an “enhancement” to the DFA.

## Nondeterministic Finite Automata

Following the colloquial definition of nondeterminism, we can design our computational system to make state transitions “branch out.” This will be made clear by a slight modification of the DFA.

Definition: A nondeterministic finite automaton, abbreviated NFA, is a DFA with the following two modifications:

• Instead of having a single state $s_k$ at each step, we allow a set of possible states, which we denote $S_k \subset S$.
• Our initial state $s_0$ is replaced with an initial set of states $S_0 \subset S$.
• We include $\varepsilon$ in $\Sigma$, allowing for immediate transitions that require no input.
• The transition function $\tau$ now has signature $S \times \Sigma \to P(S)$, where $P(S)$ denotes the power set of $S$, the set of all subsets of $S$.
• At each step (with input character $w_k$), we map $\tau(-,w_k)$ over $S_k$ to get $S_{k+1}$, i.e. $S_{k+1} = \left \{ \tau(s, w_k) | s \in S_k \right \}$
• The NFA accepts if any state in $S_n$ is final.

In this way, our machine can get a character as input, and proceed to be in two separate states at the same time. If any of the branches of computation accept, then they all accept.

Extending our example from DFAs, here is an NFA which accepts the language of binary strings which have either an even number of zeros or an even number of ones (or both).

An NFA recognizing strings containing evenly many zeros or evenly many ones

Here, $S_0 = F = \left \{ s_0, s_2 \right \}$, and by tracing (a little more carefully this time), we can see that it accepts what we expect.

Now, with the added strength of nondeterminism, we might expect that NFAs can compute some things that DFAs cannot. Amazingly, this is not the case. We point the reader to a more detailed description, but trust that our quick explanation will give the necessary intuition to see the equivalence.

The key observation is that despite nondeterminism, there is still a finite number of possible states an NFA can be in. Specifically, there are at most $2^{|S|}$ possible subsets of $S$, so we can simply construct a DFA which has $2^{|S|}$ states, one corresponding to each subset of $S$, and build up a deterministic transition function that makes things work. This would be an appropriate exercise for the advanced reader, but the beginning student should see Sisper’s Introduction to the Theory of Computation, which contains a complete and thorough proof. (We hate to refer the reader to an expensive textbook, but we could only find sparse proofs of this equivalence elsewhere on the internet, and most assume a particular working set of notation. And Sisper’s book is much better than this terse post for a primer in Computing Theory. The interested reader who lacks mathematical maturity would find it very accessible.)

In other words, every NFA has a corresponding DFA which recognizes precisely the same language. On the other hand, every DFA is trivially an NFA. Thus, the class of languages which NFAs recognize is also the regular languages.

So we have discovered that regular languages, and hence the DFAs and NFAs which recognize them, are quite limited in expressive power. Now let’s truly beef up our model with some mathematical steroids.

## Pushdown Automata

Okay, so these steroids won’t be that strong. Ideally, we just want to add as little as possible to our model and still make it more expressive. This way we can better understand the nuances of computational power, thus strengthening our intuition. While “little” is a subjective notion, lots of “littler” things were tried before arriving at a model that was not equivalent in power to a DFA. We will refrain from explaining them here, except to say that DFAs are equivalent in power to the class of formal Regular Expressions (hence the name, regular languages), which most programmers are familiar with on an informal level.

Definition: A pushdown automaton, denoted PDA, is an NFA with two additional components: $\Gamma$ a set of stack symbols including $\varepsilon$, and a modified transition function $\tau$:

$\tau: S \times \Sigma \times \Gamma \to P(S) \times \Gamma^*$

Here, the left hand $\Gamma$ symbol represents the current top of the stack (which is always popped), and the right hand $\Gamma^*$ represents any modification to the stack during state transition. This modification is always a push, and can be an empty push, or a push of arbitrarily many symbols.

For brevity of a description of $\tau$, which would otherwise be infinite as there are infinitely many possible stacks, we allow $\tau$ to be a partial function, not defined for some inputs. Then a PDA automatically rejects any encountered undefined input. Alternatively, we could construct $\tau$ as a total function, if we add edges for each undefined input going to some terminal state which is inescapable and not final. For the sake of this post, we will accept a partial transition function.

Recalling our discussion of computational ability in our post on Conway’s Game of Life, we recognize that a PDA’s new power is in its ability to grow without bound. Specifically, the stack has no limit to its size, and we can remember important pieces of input which would otherwise be discarded.

This modification seems like it was made just for the $0^n1^n$ problem. Specifically, now we can just push the 0s we see on to the stack, and pop just 1’s until we see an empty stack. Here is a PDA which does just that (making heavy use of epsilon transitions):

A PDA which recognizes $0^n1^n$

Here $S_0 = \left \{ s_0 \right \}$, and $F = \left \{ s_2 \right \}$. For each state transition label, we include $w_k; a \to bc$ to mean, “upon receiving $w_k$ as input with $a$ on top of the stack, pop $a$ and replace it with $bc$, where $bc$ may be a single character or multiple characters or the empty character. The epsilon transitions allow us to move from $s_0$ to $s_1$ and from $s_1$ to $s_2$ seamlessly, but only when the stack is agreeable. As an exercise, modify the above diagram to instead recognize $0^{2n}1^n$.

Neato! It seems like we’ve got a huge boost in computational power now that we can store some information. Wrong. As great as a stack is, it’s not good enough. Here’s a language that a PDA cannot solve: $A = \left \{ 0^n1^n2^n | n \in \mathbb{N} \right \}$. This follows from an additional and more nuanced pumping lemma, and we will not dwell on it here. But look at that! This language is hardly any more complex than $0^n1^n$, and yet it stumps our PDA.

Time to up the steroid dosage. Surprisingly enough, it turns out that our solution will be equivalent to adding another stack! In doing so, we will achieve a level of computational expression so powerful that nothing we know of today can surpass it. We call this panacea of computational woes a Turing machine, and we will cover both it and the wild problems it cannot compute next time.

Until then!