This post assumes familiarity with our primer on Kolmogorov complexity. We recommend the uninformed reader begin there. We will do our best to keep consistent notation across both posts.
Kolmogorov Complexity as a Metric
Over the past fifty years mathematicians have been piling up more and more theorems about Kolmogorov complexity, and for good reason. One of the main interpretations of the Kolmogorov complexity function is that for a given string , is the best theoretical compression of under any compression scheme. So a negative result about can provide useful bounds on how good a real-world compressor can be. It turns out that these properties also turn into a useful tool for machine learning. The idea is summarized as follows:
Let be binary strings, and as usual let’s fix some universal programming language in which to write all of our programs. Let be the shortest program which computes both when given as an input, and given . We would imagine that if are unrelated, then the length of the program would be roughly , simply by running the shortest program to output using no inputs, followed by the same thing for . As usual there will be some additive constant term independent of both and . We denote this by or interchangeably.
We would further imagine that if are related (that is, if there is some information about contained in or vice versa), then the program would utilize that information and hence be shorter than . It turns out that there is an even better way to characterize , and with a few modifications we can turn the length of into something similar to a metric on the set of all strings.
This metric has some strikingly attractive features. We will see that it is “universal” with respect to a certain class of distance functions (which is unfortunately not the class of all metrics). In particular, for any of these functions , the length of will be at worst a small amount larger than . In words, if and are similar according to any of these distance functions, then they will be similar according to . Of course the devil is in the details, but this is the right idea to have in mind while we wade through the computations.
An Aside on Metrics, and Properties of Kolmogorov Complexity
In recent posts on this blog we’ve covered a number of important examples of metrics and investigated how a metric creates structure in a space. But as powerful and rare as fruitful metrics are, we have barely scratched the surface of the vast amount of literature on the subject.
As usual with our computations in Kolmogorov complexity, all of our equalities will be true up to some kind of additive sloppiness. Most of the time it will be an additive constant which is independent of anything else in the equation. We will usually omit the constant with that implicit understanding, and instead we will specify the times when it is an exact equality (or when the additive sloppiness is something other than a constant).
And so, unavoidably, the “metric” we define won’t be a true metric. It will only satisfy the metric properties (positive definite, symmetric, triangle inequality) up to a non-constant additive sloppiness. This will be part of the main theorem of this post.
Before we can reach the heart of the matter (and as a nice warm-up), we need to establish a few more properties of . Recall that by we mean the shortest program which computes when provided as an auxiliary input. We call this the conditional complexity of given . Further, recall that is the length of the shortest program which outputs both and , and a way to distinguish between the two (if everything is in binary, the distinguishing part is nontrivial; should the reader be interested, this sort of conversation is made for comment threads). Finally, the comma notation works for auxiliary inputs as well: is the length of the shortest program outputting when given and a way to distinguish them as input.
For example, the conditional Kolmogorov complexity is constant: the length of the string provides all but a finite amount of information about it. On the other hand, if are random strings (their bits are generated independently and uniformly at random), then ; there is no information about contained in .
Definition: Let be a (binary) string. We denote by the shortest program which computes . That is, . If there are two shortest programs which compute , then refers to the first in the standard enumeration of all programs.
As a quick aside, the “standard enumeration” is simple: treat a binary string as if it were a natural number written in base 2, and enumerate all strings in increasing order of their corresponding number. The choice of enumeration is irrelevant, though; all that matters is that it is consistent throughout our theory.
Proposition: Kolmogorov complexity has the following properties up to additive constants:
- , and
The first item simply states that giving as input to a program is the same as giving and . This is not hard to prove. If is the shortest program computing from , then we can modify it slightly to work with instead. Just add to the beginning of the following instructions:
Compute K(y) as the length of the input y*
Simulate y* and record its output y
Since is a finite string and represents a terminating program, these two steps produce the values needed to run . Moreover, the program description is constant in length, independent of .
On the other hand, if is a program computing from , we are tasked with finding given . The argument a standard but slightly more complicated technique in theoretical computer science called dovetailing. In particular, since we know the length of , and there are only finitely many programs of the same length, we can get a list of all programs of length . We then interleave the simulation of each of these programs; that is, we run the first step of all of the , then the second, third, and so on. Once we find a program which halts and outputs (and we are guaranteed that one will do so) we can stop. In pseudocode, this is just the subroutine:
L = [all programs of length K(y) in lexicographic order]
i = 1
while True:
for program in L:
run step i of program
if program terminates and outputs y:
return program
The fact that this algorithm will terminate proves the claim.
The second item in the proposition has a similar proof, and we leave it as an exercise to the reader. (Hint: the logarithm in the second part of the statement comes from the hard-coding of a binary representation of the number )
The third item, that has a much more difficult proof, and its consequences are far-reaching. We will use it often in our computations. The intrepid reader will see Theorem 3.9.1 in the text of Li & Vitanyi for a complete proof, but luckily one half of the proof is trivial. That is, the proof that is similar to the argument we used above. Let be the shortest programs computing and given , respectively. We can combine them into a program computing and . First run to compute and compute the length of . As we saw, these two pieces of data are equivalent to , and so we can compute using as above, adding at most a finite amount program text to do so.
This property is so important it has a name.
Lemma: (Symmetry of information)
This is true (and named appropriately) since there is symmetry in the quantity . Note in particular that this doesn’t hold without the star: . Those readers who completed the exercise above will know where the logarithm comes from.
The Almost-Triangle Inequality
The first application of the symmetry of information is (surprisingly) a variant of the triangle inequality. Specifically, the function satisfies the metric inequalities up to an additive constant sloppiness.
where does not depend on . To prove this, see that
The first equality is by the symmetry of information , and the second follows from the fact that . This is the same argument we used to prove the case of the symmetry of information lemma.
Now we can rearrange the terms and use the symmetry of information twice, and , to reach the final result.
This is interesting because it’s our first indication that Kolmogorov complexity can play a role in a metric. But there are some issues: is in general not symmetric. We need to some up with a symmetric quantity to use instead. There are quite a few details to this process (see this paper if you want to know them all), but the result is quite nice.
Theorem: Let be the length of the shortest program which computes given as input and given . Then
where .
That is, our intuitive idea of what the “information distance” from to should be coincides up to an additive logarithmic factor with the maximum of the conditional Kolmogorov complexities. If two strings are “close” with respect to , then there is a lot of mutual information between them. In the same paper listed above, the researchers (Bennett et al.) prove that is a “metric” (up to additive constants) and so this gives a reasonable estimate for the true information distance in terms of conditional Kolmogorov complexities.
However, is not the final metric used in applications, but just an inspiration for other functions. This is where the story gets slightly more complicated.
At this point we realize that the information distance defined above is not as good as we’d like it to be. One of its major deficiencies is that it does not compute relative distances very well. That is, it doesn’t handle strings of varying size as well as it should.
For example, take to be a random string of length for arbitrary . The quantity , where is the empty string is just (if the input is empty, compute , otherwise output the empty string). But in a sense there is no information about in any string. In other words, is maximally dissimilar to all nonempty strings. But according to , the empty string is variably dissimilar to other strings: it’s “less similar” to strings with higher Kolmogorov complexity. This is counter-intuitive, and hence undesirable.
Unfortunately the literature is littered with alternative distance functions, and the researchers involved spend little effort relating them to each other (this is part of the business of defining things “up to sloppiness”). We are about to define the principal example we will be concerned with, and we will discuss its relationship with its computationally-friendly cousins at the end.
The link between all of these examples is normalization. That is (again up to minor additive sloppiness we’ll make clear shortly) the distance functions take values in , and a value of 0 means the strings are maximally similar, and a value of 1 implies maximal dissimilarity.
Definition: Let be the set of binary strings. A normalized distance is a function which is symmetric and satisfies the following density condition for all and all :
That is, there is a restriction on the number of strings that are close to . There is a sensible reason for such a convoluted condition: this is the Kolmogorov-complexity analogue of the Kraft inequality. One of the picky details we’ve blatantly left out in our discussion of Kolmogorov complexity is that the programs we’re allowed to write must collectively form a prefix-code. That is, no program is a proper prefix of another program. If the implications of this are unclear (or confusing), the reader may safely ignore it. It is purely a tool for theoretical analysis, and the full details are again in the text of Li & Vitanyi. We will come back to discuss other issues with this density condition later (in the mean time, think about why it’s potentially dubious), but now let us define our similarity metric.
Definition: The normalized information distance is defined by
The reason we switched from to will become apparent in our calculations (we will make heavy use of the symmetry of information, which does not hold by a constant factor for ).
Quickly note that this alleviates our empty string problem we had with the non-normalized metric. , so they are maximally dissimilar regardless of what is.
We will prove two theorems about this function:
Theorem 1: (Metric Axioms) satisfies the metric axioms up to additive precision, where is the maximum of the Kolmogorov complexities of the strings involved in the (in)equality.
Theorem 2: (Universality) is universal with respect to the class of computable normalized distance functions. That is, if is a normalized distance, then for all we have the following inequality:
where again is the minimum of the Kolmogorov complexities of the strings involved.
We should note that in fact theorem 2 holds for even more general normalized distance functions, the so-called “upper semi-computable” functions. Skipping the rigorous definition, this just means that one can recursively approximate the true value by giving a consistently improved upper bound which converges to the actual value. It is not hard to see that is an upper semi-computable function, although it is unknown whether is (and many believe it is not).
The proof of the first theorem is straightforward but notationally dense.
Proof of Theorem 1 (Metric Axioms): The value , since is trivially constant, and since Kolmogorov complexity is non-negative. Moreover, is exactly symmetric, so the proof boils down to verifying the triangle inequality holds.
Let be strings. We gave a proof above that . We will modify this inequality to achieve our desired result, and there are two cases:
Case 1: . Take the maximum of each side of the two inequalities for to get
We can further increase the right hand side by taking termwise maxima
Now divide through by to get
Finally, since is smaller than the max of , we can replace the in the denominator of the first term of the right hand side by . This will only possibly increase the fraction, and for the same reason we can replace by in the second term. This achieves the triangle inequality up to , as desired.
Case 2: . Without loss of generality we may also assume , for the other possibility has an identical argument. Now we can boil the inequality down to something simpler. We already know the denominators have to all be in the right hand side, and in the left. Moreover, we claim . This is by the symmetry of information:
Subtracting establishes the claim, and similarly we have . So the triangle inequality reduces to
Applying our original inequality again to get , we may divide through by and there are two additional cases.
If the right-hand side is less than or equal to 1, then adding a constant to the top and bottom of the fraction only increases the value of the fraction, and doesn’t violate the inequality. So we choose to add to the top and bottom of the right-hand side and again using the symmetry of information, we get exactly the required value.
If the right-hand side is greater than 1, then adding any constant to the top and bottom decreases the value of the fraction, but it still remains greater than 1. Since (a simple exercise), we see that the left-hand side is at most 1, and our same trick of adding works.
The proof of the universality theorem is considerably more elegant.
Proof of Theorem 2 (Universality): Let be any normalized distance function, and set . Suppose further that .
Let us enumerate all strings such that . In particular, since , is included in this enumeration. By the density condition, the number of such strings is at most . The index of in this enumeration can be used as an effective description of when given as input. That is, there is a program which includes in its description the index of and outputs given . Since the number of bits needed to describe the index of is at most , we have
Again the symmetry of information lemma gives us . And now
Since , we can replace the denominator of the last expression with (only increasing the fraction) to get . But was just , so this completes the proof of this case.
In the case , the proof is similar (enumerating the index of instead), and at the end we get
The theorem is proved.
Why Normalized Distance Functions?
The practical implications of the two theorems are immense. What we’re saying is that if we can represent some feature of string similarity by a normalized distance function, then that feature will be captured automatically by the normalized information distance . The researchers who discovered normalized information distance (and proved its universality) argue that in fact upper semi-computable normalized distance functions encapsulate all real-world metrics we would ever care about! Of course, there is still the problem that Kolmogorov complexity is uncomputable, but we can certainly come up with reasonable approximations (we will see precisely this in our next post).
And these same researchers have shown that approximations to do represent a good deal of universality in practice. They’ve applied the same idea to fields as varied as genome clustering, language clustering, and music clustering. We will of course investigate the applications for ourselves on this blog, but their results seem to apply to data mining in any field.
But still this raises the obvious question (which goes unaddressed in any research article this author has read): does every metric have a sensible interpretation (or modification) as a normalized distance function? That awkward density condition seems particularly suspect, and is at the core of this author’s argument that the answer is no.
Consider the following example. Let be a normalized distance function, and fix . The density condition says that for any we want, the number of strings which are within distance 1 of is bounded by . In particular, this quantity is finite, so there can only be finitely many strings which are within distance 1 of . But there are infinitely many strings, so this is a contradiction!
Even if we rule out this (arguably trivial) case of , we still run into problems. Let for any sufficiently small . Then fix (the string consisting of the single bit 0). The number of strings which are within distance of is bounded by is again finite (and quite small, since is about as simple as it gets). In other words, there are only a finite number of strings that are not maximally dissimilar to . But one can easily come up with an infinite number of strings which share something in common with : just use for any you please. It is ludicrous to say that every metric should call as dissimilar to as the empty string is to a random string of a thousand bits.
In general, this author doesn’t find it likely that one can take any arbitrary which is both symmetric and has values in and modify it to satisfy the density condition. Indeed, this author has yet to see any example of a natural normalized similarity metric. There is one which is a modification of Hamming distance, but it is relatively awkward and involves the Kolmogorov complexity of the strings involved. If the reader has any ideas to the contrary, please share them in the comments.
So it appears that the class of normalized distance functions is not as large as we might wish, and in light of this the universality theorem is not as impressive. On the other hand, there is no denying the success of applying the normalized information distance to complex real-world problems. Something profound is going on, but from this author’s viewpoint more theoretical work is needed to establish why.
In practice we want to compute in terms of quantities we can actually approximate. Due to the symmetry of information, we can rewrite the metric formula as
Indeed, since our main interpretation of is as the size of the smallest “compressed version” of the string , it would seem that we can approximate the function by using real-world compression algorithms. And for the part, we recognize that (due to the need to specify a way to distinguish between the outputs )
,
where is the Kolmogorov complexity of the concatenation of the two strings. So if we’re willing to forgive additive logarithmic sloppiness (technically, sloppiness, which goes to zero asymptotocally), we can approximate normalized information distance as
In the literature researchers will also simplify the metric by removing the “star” notation
Unfortunately these two things aren’t equivalent. As we saw in our “basic properties” of ,
Indeed, it is not the case that . An easy counterexample is by trying to equate . We have already proven that the right hand side is always constant, but the left hand side could not be. An exercise in Li & Vitanyi shows there is an infinite family of strings for which .
And so these two metrics cannot be equal, but they are close. In fact, denoting the non-star version by and the regular version by , we have . This changes the metric properties and the universality claim, because precision is stronger than precision. Indeed, the true constant is always less than 1 (e.g. when it is ), but this means the metric can potentially take values in the range , which is edging further and further away from the notion of normalization we originally strove for.
Finally, the last example of a cousin metric is
We will leave it to the reader to verify this function again satisfies the metric inequalities (in the same way that the original normalized information distance does). On the other hand, it only satisfies universality up to a factor of 2. So while it still may give some nice results in practice (and it is easy to see how to approximate this), the first choice of normalized information distance was theoretically more precise.
Applications
We’ve just waded through a veritable bog of theory, but we’ve seen some big surprises along the way. Next time we’ll put these theoretical claims to the test by seeing how well we can cluster and classify data using the normalized information distance (and introducing as little domain knowledge as possible). Until then!