Information Distance — A Primer

This post assumes familiarity with our primer on Kolmogorov complexity. We recommend the uninformed reader begin there. We will do our best to keep consistent notation across both posts.

Kolmogorov Complexity as a Metric

Over the past fifty years mathematicians have been piling up more and more theorems about Kolmogorov complexity, and for good reason. One of the main interpretations of the Kolmogorov complexity function K is that for a given string x, K(x) is the best theoretical compression of x under any compression scheme. So a negative result about K can provide useful bounds on how good a real-world compressor can be. It turns out that these properties also turn K into a useful tool for machine learning. The idea is summarized as follows:

Let x,y be binary strings, and as usual let’s fix some universal programming language L in which to write all of our programs. Let p(x,y) be the shortest program which computes both y when given x as an input, and x given y. We would imagine that if x,y are unrelated, then the length of the program |p(x,y)| would be roughly K(x) + K(y), simply by running the shortest program to output x using no inputs, followed by the same thing for y. As usual there will be some additive constant term independent of both x and y. We denote this by c or O(1) interchangeably.

We would further imagine that if x,y are related (that is, if there is some information about x contained in y or vice versa), then the program p(x,y) would utilize that information and hence be shorter than K(x) + K(y). It turns out that there is an even better way to characterize p, and with a few modifications we can turn the length of p into something similar to a metric on the set of all strings.

This metric has some strikingly attractive features. We will see that it is “universal” with respect to a certain class of distance functions (which is unfortunately not the class of all metrics). In particular, for any of these functions f, the length of |p(x,y)| will be at worst a small amount larger than f(x,y). In words, if x and y are similar according to any of these distance functions, then they will be similar according to p. Of course the devil is in the details, but this is the right idea to have in mind while we wade through the computations.

An Aside on Metrics, and Properties of Kolmogorov Complexity

In recent posts on this blog we’ve covered a number of important examples of metrics and investigated how a metric creates structure in a space. But as powerful and rare as fruitful metrics are, we have barely scratched the surface of the vast amount of literature on the subject.

As usual with our computations in Kolmogorov complexity, all of our equalities will be true up to some kind of additive sloppiness. Most of the time it will be an additive constant O(1) which is independent of anything else in the equation. We will usually omit the constant with that implicit understanding, and instead we will specify the times when it is an exact equality (or when the additive sloppiness is something other than a constant).

And so, unavoidably, the “metric” we define won’t be a true metric. It will only satisfy the metric properties (positive definite, symmetric, triangle inequality) up to a non-constant additive sloppiness. This will be part of the main theorem of this post.

Before we can reach the heart of the matter (and as a nice warm-up), we need to establish a few more properties of K. Recall that by K(x|y) we mean the shortest program which computes x when provided y as an auxiliary input. We call this the conditional complexity of x given y. Further, recall that K(x,y) is the length of the shortest program which outputs both x and y, and a way to distinguish between the two (if everything is in binary, the distinguishing part is nontrivial; should the reader be interested, this sort of conversation is made for comment threads). Finally, the comma notation works for auxiliary inputs as well: K(x|y,z) is the length of the shortest program outputting x when given y,z and a way to distinguish them as input.

For example, the conditional Kolmogorov complexity K(1^n | n) = c is constant: the length of the string 1^n provides all but a finite amount of information about it. On the other hand, if x,y are random strings (their bits are generated independently and uniformly at random), then K(y|x) = K(y); there is no information about y contained in x.

Definition: Let x be a (binary) string. We denote by x^* the shortest program which computes x. That is, K(x) = |x^*|. If there are two shortest programs which compute x, then x^* refers to the first in the standard enumeration of all programs.

As a quick aside, the “standard enumeration” is simple: treat a binary string as if it were a natural number written in base 2, and enumerate all strings in increasing order of their corresponding number. The choice of enumeration is irrelevant, though; all that matters is that it is consistent throughout our theory.

Proposition: Kolmogorov complexity has the following properties up to additive constants:

  1. K(x|y^*) = K(x|y,K(y))
  2. K(x|y^*) \leq K(x|y), and K(x|y) \leq K(x|y^*) + O(\log(K(y)))
  3. K(x,y) = K(x) + K(y|x^*)

The first item simply states that giving y^* as input to a program is the same as giving y and K(y). This is not hard to prove. If p is the shortest program computing x from y,K(y), then we can modify it slightly to work with y^* instead. Just add to the beginning of p the following instructions:

Compute K(y) as the length of the input y*
Simulate y* and record its output y

Since y^* is a finite string and represents a terminating program, these two steps produce the values needed to run p. Moreover, the program description is constant in length, independent of y^*.

On the other hand, if q is a program computing x from y^*, we are tasked with finding y^* given y, K(y). The argument a standard but slightly more complicated technique in theoretical computer science called dovetailing. In particular, since we know the length of y^*, and there are only finitely many programs of the same length, we can get a list p_1, p_2, \dots p_n of all programs of length K(y). We then interleave the simulation of each of these programs; that is, we run the first step of all of the p_i, then the second, third, and so on. Once we find a program which halts and outputs y (and we are guaranteed that one will do so) we can stop. In pseudocode, this is just the subroutine:

L = [all programs of length K(y) in lexicographic order]
i = 1
while True:
   for program in L:
      run step i of program
      if program terminates and outputs y:
         return program

The fact that this algorithm will terminate proves the claim.

The second item in the proposition has a similar proof, and we leave it as an exercise to the reader. (Hint: the logarithm in the second part of the statement comes from the hard-coding of a binary representation of the number K(y))

The third item, that K(x,y) = K(x) + K(y|x^*) has a much more difficult proof, and its consequences are far-reaching. We will use it often in our computations. The intrepid reader will see Theorem 3.9.1 in the text of Li & Vitanyi for a complete proof, but luckily one half of the proof is trivial. That is, the proof that K(x,y) \leq K(x) + K(y|x^*) + c is similar to the argument we used above. Let p,q be the shortest programs computing x and y given x^*, respectively. We can combine them into a program computing x and y. First run p to compute x and compute the length of p. As we saw, these two pieces of data are equivalent to x^*, and so we can compute y using q as above, adding at most a finite amount program text to do so.

This property is so important it has a name.

Lemma: (Symmetry of information)

\displaystyle K(x,y) = K(x) + K(y|x^*) = K(y) + K(x|y^*)

This is true (and named appropriately) since there is symmetry in the quantity K(x,y) = K(y,x). Note in particular that this doesn’t hold without the star: K(x,y) = K(x) + K(y|x) + O(\log(K(x))). Those readers who completed the exercise above will know where the logarithm comes from.

The Almost-Triangle Inequality

The first application of the symmetry of information is (surprisingly) a variant of the triangle inequality. Specifically, the function f(x,y) = K(x|y^*) satisfies the metric inequalities up to an additive constant sloppiness.

\displaystyle K(x|y^*) \leq K(x|z^*) + K(z|y^*) + c

where c does not depend on x, y, z. To prove this, see that

\displaystyle K(x,z | y^*) = K(x,y,z) - K(y) \leq K(z) + K(x|z^*) + K(y|z^*) - K(y)

The first equality is by the symmetry of information K(x,y,z) = K(y) + K(x,z|y^*), and the second follows from the fact that K(x,y,z) \leq K(z) + K(x|z^*) + K(y|z^*). This is the same argument we used to prove the \leq case of the symmetry of information lemma.

Now we can rearrange the terms and use the symmetry of information twice, K(z) + K(y|z^*) = K(y,z) and K(y,z) - K(y) = K(z|y^*), to reach the final result.

This is interesting because it’s our first indication that Kolmogorov complexity can play a role in a metric. But there are some issues: K(x|y) is in general not symmetric. We need to some up with a symmetric quantity to use instead. There are quite a few details to this process (see this paper if you want to know them all), but the result is quite nice.

Theorem: Let E(x,y) be the length of the shortest program which computes x given y as input and y given x. Then

\displaystyle E(x,y) = \max (K(x|y), K(y|x)) + O(\log(M))

where M = \max(K(x|y), K(y|x)).

That is, our intuitive idea of what the “information distance” from x to y should be coincides up to an additive logarithmic factor with the maximum of the conditional Kolmogorov complexities. If two strings are “close” with respect to E, then there is a lot of mutual information between them. In the same paper listed above, the researchers (Bennett et al.) prove that E is a “metric” (up to additive constants) and so this gives a reasonable estimate for the true information distance in terms of conditional Kolmogorov complexities.

However, E is not the final metric used in applications, but just an inspiration for other functions. This is where the story gets slightly more complicated.

Normalized Information Distance(s)

At this point we realize that the information distance E defined above is not as good as we’d like it to be. One of its major deficiencies is that it does not compute relative distances very well. That is, it doesn’t handle strings of varying size as well as it should.

For example, take x to be a random string of length n for arbitrary n.   The quantity E(x, \varepsilon), where \varepsilon is the empty string is just K(x) + c (if the input is empty, compute x, otherwise output the empty string). But in a sense there is no information about \varepsilon in any string. In other words, \varepsilon is maximally dissimilar to all nonempty strings. But according to E, the empty string is variably dissimilar to other strings: it’s “less similar” to strings with higher Kolmogorov complexity. This is counter-intuitive, and hence undesirable.

Unfortunately the literature is littered with alternative distance functions, and the researchers involved spend little effort relating them to each other (this is part of the business of defining things “up to sloppiness”). We are about to define the principal example we will be concerned with, and we will discuss its relationship with its computationally-friendly cousins at the end.

The link between all of these examples is normalization. That is (again up to minor additive sloppiness we’ll make clear shortly) the distance functions take values in [0,1], and a value of 0 means the strings are maximally similar, and a value of 1 implies maximal dissimilarity.

Definition: Let \Sigma = \left \{ 0,1 \right \}^* be the set of binary strings. A normalized distance f is a function \Sigma \times \Sigma \to [0,1] which is symmetric and satisfies the following density condition for all x \in \Sigma and all 0 \leq e \leq 1:

\displaystyle |\left \{ y : d(x,y) \leq e \right \}| < 2^{eK(x) + 1}

That is, there is a restriction on the number of strings that are close to x. There is a sensible reason for such a convoluted condition: this is the Kolmogorov-complexity analogue of the Kraft inequality. One of the picky details we’ve blatantly left out in our discussion of Kolmogorov complexity is that the programs we’re allowed to write must collectively form a prefix-code. That is, no program is a proper prefix of another program. If the implications of this are unclear (or confusing), the reader may safely ignore it. It is purely a tool for theoretical analysis, and the full details are again in the text of Li & Vitanyi. We will come back to discuss other issues with this density condition later (in the mean time, think about why it’s potentially dubious), but now let us define our similarity metric.

Definition: The normalized information distance d(x,y) is defined by

\displaystyle d(x,y) = \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))}

The reason we switched from K(x|y) to K(x|y^*) will become apparent in our calculations (we will make heavy use of the symmetry of information, which does not hold by a constant factor for K(x|y)).

Quickly note that this alleviates our empty string problem we had with the non-normalized metric. d(x,\varepsilon) = K(x)/K(x) = 1, so they are maximally dissimilar regardless of what x is.

We will prove two theorems about this function:

Theorem 1: (Metric Axioms) d(x,y) satisfies the metric axioms up to additive O(1/M) precision, where M is the maximum of the Kolmogorov complexities of the strings involved in the (in)equality.

Theorem 2: (Universality) d(x,y) is universal with respect to the class of computable normalized distance functions. That is, if f is a normalized distance, then for all x,y we have the following inequality:

d(x,y) \leq f(x,y) + O(1/M)

where again M is the minimum of the Kolmogorov complexities of the strings involved.

We should note that in fact theorem 2 holds for even more general normalized distance functions, the so-called “upper semi-computable” functions. Skipping the rigorous definition, this just means that one can recursively approximate the true value by giving a consistently improved upper bound which converges to the actual value. It is not hard to see that K is an upper semi-computable function, although it is unknown whether d is (and many believe it is not).

The proof of the first theorem is straightforward but notationally dense.

Proof of Theorem 1 (Metric Axioms): The value d(x,x) = K(x|x^*)/K(x) = O(1/K(x)), since K(x|x^*) = K(x|x,K(x)) is trivially constant, and d(x,y) \geq 0 since Kolmogorov complexity is non-negative. Moreover, d(x,y) is exactly symmetric, so the proof boils down to verifying the triangle inequality holds.

Let x,y,z be strings. We gave a proof above that K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1). We will modify this inequality to achieve our desired result, and there are two cases:

Case 1: K(z) \leq \max(K(x), K(y)). Take the maximum of each side of the two inequalities for K(x|y^*), K(y|x^*) to get

\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*) + K(z|y^*) , K(y|z^*) + K(z|x^*)) + O(1)

We can further increase the right hand side by taking termwise maxima

\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*), K(z|x^*)) + \max(K(y|z^*), K(z|y^*)) + O(1)

Now divide through by \max(K(x), K(y)) to get

\displaystyle \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))} \leq \frac{\max(K(x|z^*), K(z|x^*))}{\max(K(x), K(y))} + \frac{\max(K(y|x^*), K(z|y^*))}{\max(K(x), K(y))} + O(1/M)

Finally, since K(z) is smaller than the max of K(x), K(y), we can replace the  K(y) in the denominator of the first term of the right hand side by K(z). This will only possibly increase the fraction, and for the same reason we can replace K(x) by K(z) in the second term. This achieves the triangle inequality up to O(1/M), as desired.

Case 2: K(z) = \max(K(x), K(y), K(z)). Without loss of generality we may also assume K(x) \geq K(y), for the other possibility has an identical argument. Now we can boil the inequality down to something simpler. We already know the denominators have to all be K(z) in the right hand side, and K(x) in the left. Moreover, we claim K(z|x^*) \geq K(x|z^*). This is by the symmetry of information:

\displaystyle K(x,z) = K(x|z^*) + K(z) = K(z|x^*) + K(x) \leq K(z|x^*) + K(z)

Subtracting K(z) establishes the claim, and similarly we have K(z|y^*) \geq K(y|z^*). So the triangle inequality reduces to

\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(z|x^*)}{K(z)} + \frac{K(z|y^*)}{K(z)} + O(1/K(z))

Applying our original inequality again to get K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1), we may divide through by K(x) and there are two additional cases.

\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(x|z^*) + K(z|y^*) + O(1)}{K(x)}

If the right-hand side is less than or equal to 1, then adding a constant c to the top and bottom of the fraction only increases the value of the fraction, and doesn’t violate the inequality. So we choose to add K(z)-K(x) to the top and bottom of the right-hand side and again using the symmetry of information, we get exactly the required value.

If the right-hand side is greater than 1, then adding any constant to the top and bottom decreases the value of the fraction, but it still remains greater than 1. Since K(x|y^*) \leq K(x) (a simple exercise), we see that the left-hand side is at most 1, and our same trick of adding K(z) - K(x) works. \square

The proof of the universality theorem is considerably more elegant.

Proof of Theorem 2 (Universality): Let f be any normalized distance function, and set e = f(x,y). Suppose further that K(x) \leq K(y).

Let us enumerate all strings v such that f(x,v) \leq e. In particular, since e = f(x,y), y is included in this enumeration. By the density condition, the number of such strings is at most 2^{eK(x) + 1}. The index of y in this enumeration can be used as an effective description of y when given x as input. That is, there is a program which includes in its description the index of y and outputs y given x. Since the number of bits needed to describe the index of y is at most \log(2^{eK(x) + 1}) = eK(x) + 1, we have

\displaystyle K(y|x) \leq eK(x) + 1

Again the symmetry of information lemma gives us K(x|y^*) \leq K(y|x^*). And now

\displaystyle d(x,y) = \frac{K(y|x^*)}{K(y)} \leq \frac{K(y|x) + O(1)}{K(y)} \leq \frac{eK(x) + O(1)}{K(y)}

Since K(x) \leq K(y), we can replace the denominator of the last expression with K(x) (only increasing the fraction) to get d(x,y) \leq e + O(1/K(x)). But e was just f(x,y), so this completes the proof of this case.

In the case K(y) < K(x), the proof is similar (enumerating the index of x instead), and at the end we get

\displaystyle d(x,y) = \frac{K(x|y^*)}{K(x)} \leq \frac{eK(y) + O(1)}{K(y)} = f(x,y) + O(1/K(y))

The theorem is proved. \square

Why Normalized Distance Functions?

The practical implications of the two theorems are immense. What we’re saying is that if we can represent some feature of string similarity by a normalized distance function, then that feature will be captured automatically by the normalized information distance d. The researchers who discovered normalized information distance (and proved its universality) argue that in fact upper semi-computable normalized distance functions encapsulate all real-world metrics we would ever care about! Of course, there is still the problem that Kolmogorov complexity is uncomputable, but we can certainly come up with reasonable approximations (we will see precisely this in our next post).

And these same researchers have shown that approximations to d do represent a good deal of universality in practice. They’ve applied the same idea to fields as varied as genome clustering, language clustering, and music clustering. We will of course investigate the applications for ourselves on this blog, but their results seem to apply to data mining in any field.

But still this raises the obvious question (which goes unaddressed in any research article this author has read): does every metric have a sensible interpretation (or modification) as a normalized distance function? That awkward density condition seems particularly suspect, and is at the core of this author’s argument that the answer is no.

Consider the following example. Let f be a normalized distance function, and fix e = 1. The density condition says that for any x we want, the number of strings which are within distance 1 of x is bounded by 2^{K(x) + 1}. In particular, this quantity is finite, so there can only be finitely many strings which are within distance 1 of x. But there are infinitely many strings, so this is a contradiction!

Even if we rule out this (arguably trivial) case of e=1, we still run into problems. Let e = 1 - \varepsilon for any sufficiently small \varepsilon > 0. Then fix x = 0 (the string consisting of the single bit 0). The number of strings which are within distance e of x is bounded by 2^{eK(x) + 1} < 2^{K(x) + 1} is again finite (and quite small, since K(0) is about as simple as it gets). In other words, there are only a finite number of strings that are not maximally dissimilar to 0. But one can easily come up with an infinite number of strings which share something in common with 0: just use 0^n for any n you please. It is ludicrous to say that every metric should call 0 as dissimilar to 0^n as the empty string is to a random string of a thousand bits.

In general, this author doesn’t find it likely that one can take any arbitrary f(x,y) which is both symmetric and has values in [0,1] and modify it to satisfy the density condition. Indeed, this author has yet to see any example of a natural normalized similarity metric. There is one which is a modification of Hamming distance, but it is relatively awkward and involves the Kolmogorov complexity of the strings involved. If the reader has any ideas to the contrary, please share them in the comments.

So it appears that the class of normalized distance functions is not as large as we might wish, and in light of this the universality theorem is not as impressive. On the other hand, there is no denying the success of applying the normalized information distance to complex real-world problems. Something profound is going on, but from this author’s viewpoint more theoretical work is needed to establish why.

Friendly Cousins of Normalized Information Distance

In practice we want to compute K(x|y^*) in terms of quantities we can actually approximate. Due to the symmetry of information, we can rewrite the metric formula as

\displaystyle d(x,y)=\frac{K(x,y) - \min(K(x), K(y))}{\max(K(x), K(y))}

Indeed, since our main interpretation of K(x) is as the size of the smallest “compressed version” of the string x, it would seem that we can approximate the function K by using real-world compression algorithms. And for the K(x,y) part, we recognize that (due to the need to specify a way to distinguish between the outputs x,y)

K(x,y) \leq K(xy) + O(\log(\max(K(x), K(y)))),

where K(xy) is the Kolmogorov complexity of the concatenation of the two strings. So if we’re willing to forgive additive logarithmic sloppiness (technically, O(\log(K(x))/K(x)) sloppiness, which goes to zero asymptotocally), we can approximate normalized information distance as

\displaystyle d(x,y) = \frac{K(xy) - \min(K(x), K(y))}{\max(K(x), K(y))}

In the literature researchers will also simplify the metric by removing the “star” notation

\displaystyle d(x,y) = \frac{\max(K(x|y), K(y|x))}{\max(K(x), K(y))}

Unfortunately these two things aren’t equivalent. As we saw in our “basic properties” of K(x|y),

K(x|y) \leq K(x|y^*) + O(\log(K(y)))

Indeed, it is not the case that K(x|y) = K(x|y^*). An easy counterexample is by trying to equate K(K(x) | x) = K(K(x) | x^*). We have already proven that the right hand side is always constant, but the left hand side could not be. An exercise in Li & Vitanyi shows there is an infinite family of strings x for which K(K(x) | x) \geq \log(|x|).

And so these two metrics cannot be equal, but they are close. In fact, denoting the non-star version by d_2 and the regular version by d_1, we have d_2(x,y) \leq d_1(x,y) + O(1). This changes the metric properties and the universality claim, because O(1/K) precision is stronger than O(1) precision. Indeed, the true constant is always less than 1 (e.g. when K(y) > K(x) it is K(y^*)/K(y)), but this means the metric can potentially take values in the range [0,2], which is edging further and further away from the notion of normalization we originally strove for.

Finally, the last example of a cousin metric is

\displaystyle d_3(x,y) = \frac{K(x|y^*) + K(y|x^*)}{K(x,y)}

We will leave it to the reader to verify this function again satisfies the metric inequalities (in the same way that the original normalized information distance does). On the other hand, it only satisfies universality up to a factor of 2. So while it still may give some nice results in practice (and it is easy to see how to approximate this), the first choice of normalized information distance was theoretically more precise.


We’ve just waded through a veritable bog of theory, but we’ve seen some big surprises along the way. Next time we’ll put these theoretical claims to the test by seeing how well we can cluster and classify data using the normalized information distance (and introducing as little domain knowledge as possible). Until then!

Kolmogorov Complexity – A Primer

The Complexity of Things

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

The Problem with Randomness

What we would really love to do is be able to look at a string of binary digits and decide how “random” it is. On one hand, this is easy to do at a high level. A child would have no difficulty deciding which of the following two strings is more random:


And yet, by the immutable laws of probability, each string has an equal chance (2^{-50}) in being chosen at random from all sequences of 50 binary digits. So in a sense, the huge body of mathematics underlying probability has already failed us at this basic juncture because we cannot speak of how random one particular outcome of an experiment is. We need a new notion that overcomes this difficulty.

Definition: The Kolmogorov complexity of a string w, denoted K(w) is the length of the shortest program which outputs w given no input.

While this definition is not rigorous enough to be of any use (we will reformulate it later), we can easily see why the first of the two strings above is less random. We can write a very short Python program that outputs it:

print "01" * 25

On the other hand, it is not hard to see that the shortest program that produces the second string is

print "00011101001000101101001000101111010100000100111101"

The dutiful reader will cry out in protest. How do you know that’s the shortest program? Why restrict to Python? This whole discussion is so arbitrary!

Indeed, this is probably not the strictly shortest Python program that prints out the string. In the following we’ll work entirely in binary, so the picky reader should interpret this as a print command referring to a block of binary memory. We will reify these ideas in full rigor shortly, but first let us continue with this naivete to make the coming definitions a bit easier to parse.

If we abstract away the lengths of these strings, we see that the length of the first program is O(1) + \log(n), since we need \log(n) bits to represent the number n/2 in the string product. On the other hand, the second string has a program length of O(1) + n, as we require the entire output string as program text.

This intuition would lead us to define a sequence of length n to be random if it has Kolmogorov complexity at least n. One way to interpret this is: a string is “random” if the shortest program that outputs the string basically encodes the entire string in its source code.

We can extend this idea to talk about relative complexity. Specifically, we can speak of Python programs which accept input, and compute their output based on that input. For instance, the first of the two strings above has the program:

n = input()
print "01" * n/2

With respect to the input “50”, we see that the first string has constant complexity (indeed, this is also true of many numbers, such as 25). In other words, the string “50” contains a lot of information about the string we wish to generate (it’s length).

On the other hand, the same technique still won’t work for the second string. Even though it has length 50, that is not enough information to determine the contents of the string, which vary wildly. So the shortest program is still (probably) the one which prints out the string verbatim.

In the future, we plan to revisit the idea of relative complexity in the context of machine learning and classification; briefly, two items are similar if one has low complexity relative to the other. But for now, let us turn to the more precise definitions.

Actual Kolmogorov Complexity

We keep saying of the second string above that the shortest program is probably the one which prints the string verbatim. In fact, short of testing every single python program of shorter length, we will never know if this is true! Even if we did, our following definitions will make that discovery irrelevant. More generally, it’s an important fact that Kolmogorov complexity is an uncomputable function. That is, there is no Turing machine M which accepts as input a word w and produces its Kolmogorov complexity as output. In order to prove such miraculous things, we need to formalize the discussion in the context of Turing machines.

Let us fix a universal programming language L and speak of the Kolmogorov complexity with respect to L:

Definition: The Kolmogorov complexity of a string w with respect to L, denoted K_L(w) is the shortest program written in the language L which produces w as output. The conditional Kolmogorov complexity with respect to a string x, denoted K_L(w | x) (spoken w given x, as in probability theory), is the length of the shortest program which, when given x as input, outputs w.

Before we can prove that this definition is independent of L (for all intents and purposes), we need a small lemma, which we have essentially proved already:

Lemma: For any strings w,x and any language L, we have K_L(w | x) \leq |w| + c for some constant c independent of w, x, and K_L(w) \leq |w| + c' for some constant c' independent of w.

Proof. The program which trivially outputs the desired string has length |w| + c, for whatever constant number of letters c is required to dictate that a string be given as output. This is clearly independent of the string and any input. \square

It is not hard to see that this definition is invariant under a choice of language L up to a constant factor. In particular, let w be a string and fix two languages L, L'. As long as both languages are universal, in the sense that they can simulate a universal Turing machine, we can relate the Kolmogorov complexity of w with respect to both languages. Specifically, one can write an interpreter for L in the language of L' and vice versa. Readers should convince themselves that for any two reasonable programming languages, you can write a finite-length program in one language that interprets and executes programs written in the other language.

If we let p be the shortest program written in L that outputs w given x, and i be the interpreter for L written in L', then we can compute w given the input \left \langle p, x \right \rangle by way of the interpreter i. In other words, K_L(w | px) \leq |i|.

K_{L'}(w | x) \leq |p| + c + |i| = K_L(w | x) + c + |i| = K_L(w | x) + O(1)

Another easy way to convince oneself of this is to imagine one knows the language L'. Then the program would be something akin to:

input x
run the interpreter i on the program p with input x

Our inequalities above just describe the length of this program: we require the entire interpreter i, and the entire program p, to be part of the program text. After that, it’s just whatever fixed constant amount of code is required to initialize the interpreter on that input.

We call this result the invariance property of Kolmogorov complexity. And with this under our belt, we may speak of the Kolmogorov complexity of a string. We willingly ignore the additive constant difference which depends on the language chosen, and we may safely work exclusively in the context of some fixed universal Turing machine (or, say, Python, C, Racket, or Whitespace; pick your favorite). We will do this henceforth, denoting the Kolmogorov complexity of a string K(w).

Some Basic Facts

The two basic facts we will work with are the following:

  • There are strings of arbitrarily large Kolmogorov complexity.
  • Most strings have high complexity.

And we will use these facts to prove that K is uncomputable.

The first point is that for any n, there is a string w such that K(w) \geq n. To see this, we need only count the number of strings of smaller length. For those of us familiar with typical combinatorics, it’s just the Pigeonhole Principle applied to all strings of smaller length.

Specifically, there is at most one string of length zero, so there can only be one string with Kolmogorov complexity zero; i.e., there is only one program of length zero, and it can only have one output. Similarly, there are two strings of length one, so there are only two strings with Kolmogorov complexity equal to one. More generally, there are 2^i strings of length i, so there are at most \sum_{i = 0}^{n-1} 2^i strings of Kolmogorov complexity less than n. However, as we have seen before, this sum is 2^n - 1. So there are too many strings of length n and not enough of smaller length, implying that at least one string of length n has Kolmogorov complexity at least n.

The second point is simply viewing the above proof from a different angle: at least 1 string has complexity greater than n, more than half of the 2^n strings have complexity greater than n-1, more than three quarters of the strings have complexity greater than n-2, etc.

Rigorously, we will prove that the probability of picking a string with K(x) \leq n - c - 1 is smaller than 1/2^c. Letting S be the set of all such strings, we have an injection S \hookrightarrow the set of all strings of length less than n - c- 1, and there are 2^{n-c} - 1 such strings, so |S| \leq 2^{n-c} - 1, giving the inequality:

\displaystyle \frac{|S|}{2^n} \leq \frac{2^{n-c} - 1}{2^n} = \frac{1}{2^c} - \frac{1}{2^n} < \frac{1}{2^c}

In other words, the probability that a randomly chosen string w of length n has K(w) \geq n-c is at least 1 - 1/2^c.

The strings of high Kolmogorov complexity have special names.

Definition: We call a string w such that K(w) \geq |w|Kolmogorov random string, or an incompressible string.

This name makes sense for two reasons. First, as we’ve already mentioned, randomly chosen strings are almost always Kolmogorov random strings, so the name “random” is appropriate. Second, Kolmogorov complexity is essentially the ideal compression technique. In a sense, strings with high Kolmogorov complexity can’t be described in any shorter language; such a language would necessarily correspond to a program that decodes the language, and if the compression is small, so is the program that decompresses it (these claims are informal, and asymptotic).


We will now prove the main theorem of this primer, that Kolmogorov complexity is uncomputable.

Theorem: The Kolmogorov complexity function w \mapsto K(w) is uncomputable.

Proof. Suppose to the contrary that K is computable, and that M is a Turing machine which computes it. We will construct a new Turing machine M' which computes strings of high complexity, but M' will have a short description, giving a contradiction.

Specifically, M' iterates over the set of all binary strings in lexicographic order. For each such string w, it computes K(w), halting once it finds a w such that K(w) \geq |w| = n. Then we have the following inequality:

n \leq K(w) \leq \left | \left \langle M', n \right \rangle \right | + c

Here, the angle bracket notation represents a description of the tuple (M', n), and c is a constant depending only on M', which is fixed. The reason the inequality holds is just the invariance theorem: \left \langle M', n \right \rangle is a description of w in the language of Turing machines. In other words, the universal Turing machine which simulates M' on n will output w, so the Kolmogorov complexity of w is bounded by the length of this description (plus a constant).

Then the length of \left \langle M', n \right \rangle is at most \log(n) + \left | \left \langle M' \right \rangle \right | + c' for some constant c', and this is in turn \log(n) + c'' for some constant c'' (as the description of M' is constant). This gives the inequality

n \leq \log(n) + c''

But since \log(n) = o(n), we may pick sufficiently large n to achieve a contradiction. \square

We can interpret this philosophically: it is impossible to tell exactly how random something is. But perhaps more importantly, this is a genuinely different proof of uncomputability from our proof of the undecidability of the Halting problem. Up until now, the only way we could prove something is uncomputable is to reduce it to the Halting problem. Indeed, this lends itself nicely to many new kinds of undecidability-type theorems, as we will see below.

The reader may ask at this point, “Why bother talking about something we can’t even compute!?” Indeed, one might think that due to its uncomputability, Kolmogorov complexity can only provide insight into theoretical matters of no practical concern. In fact, there are many practical applications of Kolmogorov complexity, but in practice one usually gives a gross upper bound by use of the many industry-strength compression algorithms out there, such as Huffman codes. Our goal after this primer will be to convince you of its remarkable applicability, despite its uncomputability.

Consequences of Uncomputability and Incompressibility

One immediate consequence of the existence of incompressible strings is the following: it is impossible to write a perfect lossless compression algorithm. No matter how crafty one might be, there will always be strings that cannot be compressed.

But there are a number of other, more unexpected places that Kolmogorov complexity fills a niche. In computational complexity, for instance, one can give a lower bound on the amount of time it takes a single-tape Turing machine to simulate a two-tape Turing machine. Also in the realm of lower bounds, one can prove that an incompressible string w of length 2^n, which can be interpreted as a boolean function on n variables, requires \Omega(2^n/n) gates to express as a circuit.

It also appears outside of theoretical computer science. In, for instance, the study of entropy in dynamical systems (specifically, thermodynamics), one can make the following pictorial remark:

In fact, the author of this image, Scott Aaronson (whom we’ve seen before in our exploration of the Complexity Zoo) even proposes an empirical test of this fact: simulate a discretized coffee cup and try to compress the data at each step, graphing the resulting lengths to see the trends. This even sounds like a good project for this blog!

The applications don’t end there, though. Researchers have used the theory of Kolmogorov complexity to tackle problems in machine learning, clustering, and classification. Potentially, any subject which is concerned with relating pieces of information could benefit from a Kolmogorov-theoretic analysis.

Finally, we give a proof that the existence of Kolmogorov complexity provides an infinite number of mathematical statements which are unprovable.

Theorem: Fix a formalization of mathematics in which the following three conditions hold:

  • If a statement is provable then it is true.
  • Given a proof and a statement, it is decidable whether the proof proves the statement.
  • For every binary string x and integer k, we can construct a statement S(x,k) which is logically equivalent to “the Kolmogorov complexity of x is at least k“.

Then there is some constant t for which all statements S(x,k) with k > t are unprovable.

Proof. We construct an algorithm to find such proofs as follows:

On an input k,
Set m equal to 1.
   for all strings x of length at most m:
      for all strings P of length at most m:
         if P is a proof of S(x,k), output x and halt
   m = m+1

Suppose to the contrary that for all k there is an x for which the statement S(x,k) is provable. It is easy to see that the algorithm above will find such an x. On the other hand, for all such proofs P, we have the following inequality:

k \leq K(x) \leq \left | \left \langle M, k \right \rangle \right | + c = \log(k) + c'

Indeed, this algorithm is a description of x, so its length gives a bound on the complexity of x. Just as in the proof of the uncomputability of K, this inequality can only hold for finitely many k, a contradiction. \square

Note that this is a very different unprovability-type proof from Gödel’s Incompleteness Theorem. We can now construct with arbitrarily high probability arbitrarily many unprovable statements.

Look forward to our future posts on the applications of Kolmogorov complexity! We intend to explore some compression algorithms, and to use such algorithms to explore clustering problems, specifically for music recommendations.

Until then!

Other Complexity Classes

Not Just Time, But Space Too!

So far on this blog we’ve introduced models for computation, focused on Turing machines and given a short overview of the two most fundamental classes of problems: P and NP. While the most significant open question in the theory of computation is still whether P = NP, it turns out that there are hundreds (almost 500, in fact!) other “classes” of problems whose relationships are more or less unknown.

For instance, we can ask about problems that can be solved using polynomially-bounded space instead of time. This class would be called “PSPACE,” and it’s easy to see that \textup{P} \subset \textup{PSPACE}. As usual, there is a corresponding notion of problems solvable with polynomial-bounded space on a nondeterministic Turing machine, having the likely name NPSPACE. One might expect another huge open question: does PSPACE = NPSPACE? It turns out they are equal, and this just goes to show how sometimes nondeterminism doesn’t allow us to “do more” in one sense, but it does in another.

Another interesting question is the computational complexity of problems in the presence of oracles. Suppose we had a magical machine that could solve some difficult or impossible problem (say, the Halting Problem) in constant time. Then what sorts of problems can be solved in polynomial time? This notion gives rise to a whole new family of complexity classes that rely critically on using the oracle.

Above we give an example of some believed relationships between some basic complexity classes (I say basic, but even I haven’t ever heard of ZPP before). In a simplistic mindset, the goal of computing theory as a field of mathematics is to determine all of the relationships between these classes. In other words, we want to be able to answer all questions like, “Is computation with logarithmically-bounded space fundamentally different from computation with polynomially-bounded time?” Tacking this on to the fat list of open questions like P vs. NP, we admit that nobody knows the answer.

Unfortunately a primer on any more of these complexity classes would take us too far from the scope of this blog. We hardly skimmed the surface on P and NP, leaving quite a bit of low-hanging fruit to speculation (like, is there a problem which is not in NP? Or more subtly, is there a problem which is in NP, but not NP-complete? There is such a problem, but as far as this author knows, there is no known naturally occurring problem).

But fear not, interested reader! Even if we can’t reasonably commit to cataloging more complexity classes here, we do have an excellent reference: the Complexity Zoo. This is a more or less complete list of all known complexity classes, with pages and pages of descriptions and documented relationships with links to papers. Browsing the front-page list of classes, we see such intriguing and mysterious names as “Almost-PSPACE,” and “HeuristicP,” and “Quantum Refereed Games.”

The novice reader should start with the Petting Zoo, which expands on our meager introduction by describing, mostly in informal terms, the twenty most important complexity classes, and providing awesome inclusion graphs like the one below:

"Really Important Inclusions" among complexity classes.

The project was conceived and is organized by Scott Aaronson, a prominent researcher at MIT (with a neat blog of his own, though it’s pretty difficult material when he talks about his own research in quantum circuits). The Zoo also has a small side-exhibit focusing on the problems themselves, which is called the Complexity Garden. For instance he covers the quintessential NP-complete problem that we mentioned last time, namely N-Satisfiability.

Here on this blog we do have one more primer in the realm of Complexity Theory planned, but it’s more a question about data than Turing machines. As mentioned in our post on low-complexity art, we eventually mean to introduce the notion of Kolmogorov complexity, and prove a few of its basic and interesting properties.

Until then!

Random (Psychedelic) Art

And a Pinch of Python

Next semester I am a lab TA for an introductory programming course, and it’s taught in Python. My Python experience has a number of gaps in it, so we’ll have the opportunity for a few more Python primers, and small exercises to go along with it. This time, we’ll be investigating the basics of objects and classes, and have some fun with image construction using the Python Imaging Library. Disappointingly, the folks who maintain the PIL are slow to update it for any relatively recent version of Python (it’s been a few years since 3.x, honestly!), so this post requires one use Python 2.x (we’re using 2.7). As usual, the full source code for this post is available on this blog’s Github page, and we encourage the reader to follow along and create his own randomized pieces of art! Finally, we include a gallery of generated pictures at the end of this post. Enjoy!

How to Construct the Images

An image is a two-dimensional grid of pixels, and each pixel is a tiny dot of color displayed on the screen. In a computer, one represents each pixel as a triple of numbers (r,g,b), where r represents the red content, g the green content, and b the blue content. Each of these is a nonnegative integer between 0 and 255. Note that this gives us a total of 256^3 = 2^{24} distinct colors, which is nearly 17 million. Some estimates of how much color the eye can see range as high as 10 million (depending on the definition of color) but usually stick around 2.4 million, so it’s generally agreed that we don’t need more.

The general idea behind our random psychedelic art is that we will generate three randomized functions (f,g,h) each with domain and codomain [-1,1] \times [-1,1], and at each pixel (x,y) we will determine the color at that pixel by the triple (f(x,y), g(x,y), h(x,y)). This will require some translation between pixel coordinates, but we’ll get to that soon enough. As an example, if our colors are defined by the functions (\sin(\pi x), \cos(\pi xy), \sin(\pi y)), then the resulting image is:

We use the extra factor of \pi because without it the oscillation is just too slow, and the resulting picture is decidedly boring. Of course, the goal is to randomly generate such functions, so we should pick a few functions on [-1,1] and nest them appropriately. The first which come to mind are \sin(\pi \cdot -), \cos(\pi \cdot -), and simple multiplication. With these, we can create such convoluted functions like

\sin(\pi x \cos(\pi xy \sin(\pi (\cos (\pi xy)))))

We could randomly generate these functions two ways, but both require randomness, so let’s familiarize ourselves with the capabilities of Python’s random library.

Random Numbers

Pseudorandom number generators are a fascinating topic in number theory, and one of these days we plan to cover it on this blog. Until then, we will simply note the basics. First, contemporary computers can not generate random numbers. Everything on a computer is deterministic, meaning that if one completely determines a situation in a computer, the following action will always be the same. With the complexity of modern operating systems (and the aggravating nuances of individual systems), some might facetiously disagree.

For an entire computer the “determined situation” can be as drastic as choosing every single bit in memory and the hard drive. In a pseudorandom number generator the “determined situation” is a single number called a seed. This initializes the random number generator, which then proceeds to compute a sequence of bits via some complicated arithmetic. The point is that one may choose the seed, and choosing the same seed twice will result in the same sequence of “randomly” generated numbers. The default seed (which is what one uses when one is not testing for correctness) is usually some sort of time-stamp which is guaranteed to never repeat. Flaws in random number generator design (hubris, off-by-one errors, and even using time-stamps!) has allowed humans to take advantage of people who try to rely on random number generators. The interested reader will find a detailed account of how a group of software engineers wrote a program to cheat at online poker, simply by reverse-engineering the random number generator used to shuffle the deck.

In any event, Python makes generating random numbers quite easy:

import random

print(random.choice(["clubs", "hearts", "diamonds", "spades"]))

We import the random library, we seed it with the default seed, we print out a random number in (0,1), and then we randomly pick one element from a list. For a full list of the functions in Python’s random library, see the documentation. As it turns out, we will only need the choice() function.

Representing Mathematical Expressions

One neat way to represent a mathematical function is via…a function! In other words, just like Racket and Mathematica and a whole host of other languages, Python functions are first-class objects, meaning they can be passed around like variables. (Indeed, they are objects in another sense, but we will get to that later). Further, Python has support for anonymous functions, or lambda expressions, which work as follows:

>>> print((lambda x: x + 1)(4))

So one might conceivably randomly construct a mathematical expression by nesting lambdas:

import math

def makeExpr():
   if random.random() < 0.5:
      return lambda x: math.sin(math.pi * makeExpr()(x))
      return lambda x: x

Note that we need to import the math library, which has support for all of the necessary mathematical functions and constants. One could easily extend this to support two variables, cosines, etc., but there is one flaw with the approach: once we’ve constructed the function, we have no idea what it is. Here’s what happens:

>>> x = lambda y: y + 1
>>> str(x)
'<function <lambda> at 0xb782b144>'

There’s no way for Python to know the textual contents of a lambda expression at runtime!  In order to remedy this, we turn to classes.

The inquisitive reader may have noticed by now that lots of things in Python have “associated things,” which roughly correspond to what you can type after suffixing an expression with a dot. Lists have methods like “[1,2,3,4].append(5)”, dictionaries have associated lists of keys and values, and even numbers have some secretive methods:

>>> 45.7.is_integer()

In many languages like C, this would be rubbish. Many languages distinguish between primitive types and objects, and numbers usually fall into the former category. However, in Python everything is an object. This means the dot operator may be used after any type, and as we see above this includes literals.

A class, then, is just a more transparent way of creating an object with certain associated pieces of data (the fancy word is encapsulation). For instance, if I wanted to have a type that represents a dog, I might write the following Python program:

class Dog:
   age = 0
   name = ""

   def bark(self):
      print("Ruff ruff! (I'm %s)" %

Then to use the new Dog class, I could create it and set its attributes appropriately:

fido = Dog()
fido.age = 4 = "Fido"
fido.weight = 100

The details of the class construction requires a bit of explanation. First, we note that the indented block of code is arbitrary, and one need not “initialize” the member variables. Indeed, they simply pop into existence once they are referenced, as in the creation of the weight attribute. To make it more clear, Python provides a special function called “__init__()” (with two underscores on each side of “init”; heaven knows why they decided it should be so ugly), which is called upon the creation of a new object, in this case the expression “Dog()”. For instance, one could by default name their dogs “Fido” as follows:

class Dog:
   def __init__(self): = "Fido"

d = Dog()             # contains "Fido"

This brings up another point: all methods of a class that wish to access the attributes of the class require an additional argument. The first argument passed to any method is always the object which represents the owning instance of the object. In Java, this is usually hidden from view, but available by the keyword “this”. In Python, one must explicitly represent it, and it is standard to name the variable “self”.

If we wanted to give the user a choice when instantiating their dog, we could include an extra argument for the name like this:

class Dog:
   def __init__(self, name = 'Fido'): = name

d = Dog()                   # contains "Fido" 
e = Dog("Manfred")                   # contains "Manfred"

Here we made it so the “name” argument is not required, and if it is excluded we default to “Fido.”

To get back to representing mathematical functions, we might represent the identity function on x by the following class:

class X:
   def eval(self, x, y):
      return x

expr = X()
expr.eval(3,4)           # returns 3

That’s simple enough. But we still have the problem of not being able to print anything sensibly. Trying gives the following output:

>>> str(X)

In other words, all it does is print the name of the class, which is not enough if we want to have complicated nested expressions. It turns out that the “str” function is quite special. When one calls “str()” of something, Python first checks to see if the object being called has a method called “__str__()”, and if so, calls that. The awkward “__main__.X” is a default behavior. So if we soup up our class by adding a definition for “__str__()”, we can define the behavior of string conversion. For the X class this is simple enough:

class X:
   def eval(self, x, y):
      return x

   def __str__(self):
      return "x"

For nested functions we could recursively convert the argument, as in the following definition for a SinPi class:

class SinPi:
   def __str__(self):
      return "sin(pi*" + str(self.arg) + ")"

   def eval(self, x, y):
      return math.sin(math.pi * self.arg.eval(x,y))

Of course, this requires we set the “arg” attribute before calling these functions, and since we will only use these classes for random generation, we could include that sort of logic in the “__init__()” function.

To randomly construct expressions, we create the function “buildExpr”, which randomly picks to terminate or continue nesting things:

def buildExpr(prob = 0.99):
   if random.random() < prob:
      return random.choice([SinPi, CosPi, Times])(prob)
      return random.choice([X, Y])()

Here we have classes for cosine, sine, and multiplication, and the two variables. The reason for the interesting syntax (picking the class name from a list and then instantiating it, noting that these classes are objects even before instantiation and may be passed around as well!), is so that we can do the following trick, and avoid unnecessary recursion:

class SinPi:
   def __init__(self, prob):
      self.arg = buildExpr(prob * prob)


In words, each time we nest further, we exponentially decrease the probability that we will continue nesting in the future, and all the nesting logic is contained in the initialization of the object. We’re building an expression tree, and then when we evaluate an expression we have to walk down the tree and recursively evaluate the branches appropriately. Implementing the remaining classes is a quick exercise, and we remind the reader that the entire source code is available from this blog’s Github page. Printing out such expressions results in some nice long trees, but also some short ones:

>>> str(buildExpr())
>>> str(buildExpr())
>>> str(buildExpr())
>>> str(buildExpr())
>>> str(buildExpr())
>>> str(buildExpr())

This should work well for our goals. The rest is constructing the images.

Images in Python, and the Python Imaging Library

The Python imaging library is part of the standard Python installation, and so we can access the part we need by adding the following line to our header:

from PIL import Image

Now we can construct a new canvas, and start setting some pixels.

canvas ="L", (300,300))
canvas.putpixel((150,150), 255)"test.png", "PNG")

This gives us a nice black square with a single white pixel in the center. The “L” argument to says we’re working in grayscale, so that each pixel is a single 0-255 integer representing intensity. We can do this for three images, and merge them into a single color image using the following:

finalImage = Image.merge("RGB",
   (redCanvas, greenCanvas, blueCanvas))

Where we construct “redCanvas”, “greenCanvas”, and “blueCanvas” in the same way above, but with the appropriate intensities. The rest of the details in the Python code are left for the reader to explore, but we dare say it is just bookkeeping and converting between image coordinate representations. At the end of this post, we provide a gallery of the randomly generated images, and a text file containing the corresponding expression trees is packaged with the source code on this blog’s Github page.

Extending the Program With New Functions!

There is decidedly little mathematics in this project, but there are some things we can discuss. First, we note that there are many many many functions on the interval [-1,1] that we could include in our random trees. A few examples are: the average of two numbers in that range, the absolute value, certain exponentials, and reciprocals of interesting sequences of numbers. We leave it as an exercise to the reader to add new functions to our existing code, and to further describe which functions achieve coherent effects.

Indeed, the designs are all rather psychedelic, and the layers of color are completely unrelated. It would be an interesting venture to write a program which, given an image of something (pretend it’s a simple image containing some shapes), constructs expression trees that are consistent with the curves and lines in the image. This follows suit with our goal of constructing low-complexity pictures from a while back, and indeed, these pictures have rather low Kolmogorov complexity. This method is another framework in which to describe their complexity, in that smaller expression trees correspond to simpler pictures. We leave this for future work. Until then, enjoy these pictures!