Information Distance — A Primer

This post assumes familiarity with our primer on Kolmogorov complexity. We recommend the uninformed reader begin there. We will do our best to keep consistent notation across both posts.

Kolmogorov Complexity as a Metric

Over the past fifty years mathematicians have been piling up more and more theorems about Kolmogorov complexity, and for good reason. One of the main interpretations of the Kolmogorov complexity function $ K$ is that for a given string $ x$, $ K(x)$ is the best theoretical compression of $ x$ under any compression scheme. So a negative result about $ K$ can provide useful bounds on how good a real-world compressor can be. It turns out that these properties also turn $ K$ into a useful tool for machine learning. The idea is summarized as follows:

Let $ x,y$ be binary strings, and as usual let’s fix some universal programming language $ L$ in which to write all of our programs. Let $ p(x,y)$ be the shortest program which computes both $ y$ when given $ x$ as an input, and $ x$ given $ y$. We would imagine that if $ x,y$ are unrelated, then the length of the program $ |p(x,y)|$ would be roughly $ K(x) + K(y)$, simply by running the shortest program to output $ x$ using no inputs, followed by the same thing for $ y$. As usual there will be some additive constant term independent of both $ x$ and $ y$. We denote this by $ c$ or $ O(1)$ interchangeably.

We would further imagine that if $ x,y$ are related (that is, if there is some information about $ x$ contained in $ y$ or vice versa), then the program $ p(x,y)$ would utilize that information and hence be shorter than $ K(x) + K(y)$. It turns out that there is an even better way to characterize $ p$, and with a few modifications we can turn the length of $ p$ into something similar to a metric on the set of all strings.

This metric has some strikingly attractive features. We will see that it is “universal” with respect to a certain class of distance functions (which is unfortunately not the class of all metrics). In particular, for any of these functions $ f$, the length of $ |p(x,y)|$ will be at worst a small amount larger than $ f(x,y)$. In words, if $ x$ and $ y$ are similar according to any of these distance functions, then they will be similar according to $ p$. Of course the devil is in the details, but this is the right idea to have in mind while we wade through the computations.

An Aside on Metrics, and Properties of Kolmogorov Complexity

In recent posts on this blog we’ve covered a number of important examples of metrics and investigated how a metric creates structure in a space. But as powerful and rare as fruitful metrics are, we have barely scratched the surface of the vast amount of literature on the subject.

As usual with our computations in Kolmogorov complexity, all of our equalities will be true up to some kind of additive sloppiness. Most of the time it will be an additive constant $ O(1)$ which is independent of anything else in the equation. We will usually omit the constant with that implicit understanding, and instead we will specify the times when it is an exact equality (or when the additive sloppiness is something other than a constant).

And so, unavoidably, the “metric” we define won’t be a true metric. It will only satisfy the metric properties (positive definite, symmetric, triangle inequality) up to a non-constant additive sloppiness. This will be part of the main theorem of this post.

Before we can reach the heart of the matter (and as a nice warm-up), we need to establish a few more properties of $ K$. Recall that by $ K(x|y)$ we mean the shortest program which computes $ x$ when provided $ y$ as an auxiliary input. We call this the conditional complexity of $ x$ given $ y$. Further, recall that $ K(x,y)$ is the length of the shortest program which outputs both $ x$ and $ y$, and a way to distinguish between the two (if everything is in binary, the distinguishing part is nontrivial; should the reader be interested, this sort of conversation is made for comment threads). Finally, the comma notation works for auxiliary inputs as well: $ K(x|y,z)$ is the length of the shortest program outputting $ x$ when given $ y,z$ and a way to distinguish them as input.

For example, the conditional Kolmogorov complexity $ K(1^n | n) = c$ is constant: the length of the string $ 1^n$ provides all but a finite amount of information about it. On the other hand, if $ x,y$ are random strings (their bits are generated independently and uniformly at random), then $ K(y|x) = K(y)$; there is no information about $ y$ contained in $ x$.

Definition: Let $ x$ be a (binary) string. We denote by $ x^*$ the shortest program which computes $ x$. That is, $ K(x) = |x^*|$. If there are two shortest programs which compute $ x$, then $ x^*$ refers to the first in the standard enumeration of all programs.

As a quick aside, the “standard enumeration” is simple: treat a binary string as if it were a natural number written in base 2, and enumerate all strings in increasing order of their corresponding number. The choice of enumeration is irrelevant, though; all that matters is that it is consistent throughout our theory.

Proposition: Kolmogorov complexity has the following properties up to additive constants:

  1. $ K(x|y^*) = K(x|y,K(y))$
  2. $ K(x|y^*) \leq K(x|y)$, and $ K(x|y) \leq K(x|y^*) + O(\log(K(y)))$
  3. $ K(x,y) = K(x) + K(y|x^*)$

The first item simply states that giving $ y^*$ as input to a program is the same as giving $ y$ and $ K(y)$. This is not hard to prove. If $ p$ is the shortest program computing $ x$ from $ y,K(y)$, then we can modify it slightly to work with $ y^*$ instead. Just add to the beginning of $ p$ the following instructions:

Compute K(y) as the length of the input y*
Simulate y* and record its output y

Since $ y^*$ is a finite string and represents a terminating program, these two steps produce the values needed to run $ p$. Moreover, the program description is constant in length, independent of $ y^*$.

On the other hand, if $ q$ is a program computing $ x$ from $ y^*$, we are tasked with finding $ y^*$ given $ y, K(y)$. The argument a standard but slightly more complicated technique in theoretical computer science called dovetailing. In particular, since we know the length of $ y^*$, and there are only finitely many programs of the same length, we can get a list $ p_1, p_2, \dots p_n$ of all programs of length $ K(y)$. We then interleave the simulation of each of these programs; that is, we run the first step of all of the $ p_i$, then the second, third, and so on. Once we find a program which halts and outputs $ y$ (and we are guaranteed that one will do so) we can stop. In pseudocode, this is just the subroutine:

L = [all programs of length K(y) in lexicographic order]
i = 1
while True:
   for program in L:
      run step i of program
      if program terminates and outputs y:
         return program

The fact that this algorithm will terminate proves the claim.

The second item in the proposition has a similar proof, and we leave it as an exercise to the reader. (Hint: the logarithm in the second part of the statement comes from the hard-coding of a binary representation of the number $ K(y)$)

The third item, that $ K(x,y) = K(x) + K(y|x^*)$ has a much more difficult proof, and its consequences are far-reaching. We will use it often in our computations. The intrepid reader will see Theorem 3.9.1 in the text of Li & Vitanyi for a complete proof, but luckily one half of the proof is trivial. That is, the proof that $ K(x,y) \leq K(x) + K(y|x^*) + c$ is similar to the argument we used above. Let $ p,q$ be the shortest programs computing $ x$ and $ y$ given $ x^*$, respectively. We can combine them into a program computing $ x$ and $ y$. First run $ p$ to compute $ x$ and compute the length of $ p$. As we saw, these two pieces of data are equivalent to $ x^*$, and so we can compute $ y$ using $ q$ as above, adding at most a finite amount program text to do so.

This property is so important it has a name.

Lemma: (Symmetry of information)

$ \displaystyle K(x,y) = K(x) + K(y|x^*) = K(y) + K(x|y^*)$

This is true (and named appropriately) since there is symmetry in the quantity $ K(x,y) = K(y,x)$. Note in particular that this doesn’t hold without the star: $ K(x,y) = K(x) + K(y|x) + O(\log(K(x)))$. Those readers who completed the exercise above will know where the logarithm comes from.

The Almost-Triangle Inequality

The first application of the symmetry of information is (surprisingly) a variant of the triangle inequality. Specifically, the function $ f(x,y) = K(x|y^*)$ satisfies the metric inequalities up to an additive constant sloppiness.

$ \displaystyle K(x|y^*) \leq K(x|z^*) + K(z|y^*) + c$

where $ c$ does not depend on $ x, y, z$. To prove this, see that

$ \displaystyle K(x,z | y^*) = K(x,y,z) – K(y) \leq K(z) + K(x|z^*) + K(y|z^*) – K(y)$

The first equality is by the symmetry of information $ K(x,y,z) = K(y) + K(x,z|y^*)$, and the second follows from the fact that $ K(x,y,z) \leq K(z) + K(x|z^*) + K(y|z^*)$. This is the same argument we used to prove the $ \leq$ case of the symmetry of information lemma.

Now we can rearrange the terms and use the symmetry of information twice, $ K(z) + K(y|z^*) = K(y,z)$ and $ K(y,z) – K(y) = K(z|y^*)$, to reach the final result.

This is interesting because it’s our first indication that Kolmogorov complexity can play a role in a metric. But there are some issues: $ K(x|y)$ is in general not symmetric. We need to some up with a symmetric quantity to use instead. There are quite a few details to this process (see this paper if you want to know them all), but the result is quite nice.

Theorem: Let $ E(x,y)$ be the length of the shortest program which computes $ x$ given $ y$ as input and $ y$ given $ x$. Then

$ \displaystyle E(x,y) = \max (K(x|y), K(y|x)) + O(\log(M))$

where $ M = \max(K(x|y), K(y|x))$.

That is, our intuitive idea of what the “information distance” from $ x$ to $ y$ should be coincides up to an additive logarithmic factor with the maximum of the conditional Kolmogorov complexities. If two strings are “close” with respect to $ E$, then there is a lot of mutual information between them. In the same paper listed above, the researchers (Bennett et al.) prove that $ E$ is a “metric” (up to additive constants) and so this gives a reasonable estimate for the true information distance in terms of conditional Kolmogorov complexities.

However, $ E$ is not the final metric used in applications, but just an inspiration for other functions. This is where the story gets slightly more complicated.

Normalized Information Distance(s)

At this point we realize that the information distance $ E$ defined above is not as good as we’d like it to be. One of its major deficiencies is that it does not compute relative distances very well. That is, it doesn’t handle strings of varying size as well as it should.

For example, take $ x$ to be a random string of length $ n$ for arbitrary $ n$.   The quantity $ E(x, \varepsilon)$, where $ \varepsilon$ is the empty string is just $ K(x) + c$ (if the input is empty, compute $ x$, otherwise output the empty string). But in a sense there is no information about $ \varepsilon$ in any string. In other words, $ \varepsilon$ is maximally dissimilar to all nonempty strings. But according to $ E$, the empty string is variably dissimilar to other strings: it’s “less similar” to strings with higher Kolmogorov complexity. This is counter-intuitive, and hence undesirable.

Unfortunately the literature is littered with alternative distance functions, and the researchers involved spend little effort relating them to each other (this is part of the business of defining things “up to sloppiness”). We are about to define the principal example we will be concerned with, and we will discuss its relationship with its computationally-friendly cousins at the end.

The link between all of these examples is normalization. That is (again up to minor additive sloppiness we’ll make clear shortly) the distance functions take values in $ [0,1]$, and a value of 0 means the strings are maximally similar, and a value of 1 implies maximal dissimilarity.

Definition: Let $ \Sigma = \left \{ 0,1 \right \}^*$ be the set of binary strings. A normalized distance $ f$ is a function $ \Sigma \times \Sigma \to [0,1]$ which is symmetric and satisfies the following density condition for all $ x \in \Sigma$ and all $ 0 \leq e \leq 1$:

$ \displaystyle |\left \{ y : d(x,y) \leq e \right \}| < 2^{eK(x) + 1}$

That is, there is a restriction on the number of strings that are close to $ x$. There is a sensible reason for such a convoluted condition: this is the Kolmogorov-complexity analogue of the Kraft inequality. One of the picky details we’ve blatantly left out in our discussion of Kolmogorov complexity is that the programs we’re allowed to write must collectively form a prefix-code. That is, no program is a proper prefix of another program. If the implications of this are unclear (or confusing), the reader may safely ignore it. It is purely a tool for theoretical analysis, and the full details are again in the text of Li & Vitanyi. We will come back to discuss other issues with this density condition later (in the mean time, think about why it’s potentially dubious), but now let us define our similarity metric.

Definition: The normalized information distance $ d(x,y)$ is defined by

$ \displaystyle d(x,y) = \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))}$

The reason we switched from $ K(x|y)$ to $ K(x|y^*)$ will become apparent in our calculations (we will make heavy use of the symmetry of information, which does not hold by a constant factor for $ K(x|y)$).

Quickly note that this alleviates our empty string problem we had with the non-normalized metric. $ d(x,\varepsilon) = K(x)/K(x) = 1$, so they are maximally dissimilar regardless of what $ x$ is.

We will prove two theorems about this function:

Theorem 1: (Metric Axioms) $ d(x,y)$ satisfies the metric axioms up to additive $ O(1/M)$ precision, where $ M$ is the maximum of the Kolmogorov complexities of the strings involved in the (in)equality.

Theorem 2: (Universality) $ d(x,y)$ is universal with respect to the class of computable normalized distance functions. That is, if $ f$ is a normalized distance, then for all $ x,y$ we have the following inequality:

$ d(x,y) \leq f(x,y) + O(1/M)$

where again $ M$ is the minimum of the Kolmogorov complexities of the strings involved.

We should note that in fact theorem 2 holds for even more general normalized distance functions, the so-called “upper semi-computable” functions. Skipping the rigorous definition, this just means that one can recursively approximate the true value by giving a consistently improved upper bound which converges to the actual value. It is not hard to see that $ K$ is an upper semi-computable function, although it is unknown whether $ d$ is (and many believe it is not).

The proof of the first theorem is straightforward but notationally dense.

Proof of Theorem 1 (Metric Axioms): The value $ d(x,x) = K(x|x^*)/K(x) = O(1/K(x))$, since $ K(x|x^*) = K(x|x,K(x))$ is trivially constant, and $ d(x,y) \geq 0$ since Kolmogorov complexity is non-negative. Moreover, $ d(x,y)$ is exactly symmetric, so the proof boils down to verifying the triangle inequality holds.

Let $ x,y,z$ be strings. We gave a proof above that $ K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$. We will modify this inequality to achieve our desired result, and there are two cases:

Case 1: $ K(z) \leq \max(K(x), K(y))$. Take the maximum of each side of the two inequalities for $ K(x|y^*), K(y|x^*)$ to get

$ \displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*) + K(z|y^*) , K(y|z^*) + K(z|x^*)) + O(1)$

We can further increase the right hand side by taking termwise maxima

$ \displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*), K(z|x^*)) + \max(K(y|z^*), K(z|y^*)) + O(1)$

Now divide through by $ \max(K(x), K(y))$ to get

$ \displaystyle \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))} \leq \frac{\max(K(x|z^*), K(z|x^*))}{\max(K(x), K(y))} + \frac{\max(K(y|x^*), K(z|y^*))}{\max(K(x), K(y))} + O(1/M)$

Finally, since $ K(z)$ is smaller than the max of $ K(x), K(y)$, we can replace the  $ K(y)$ in the denominator of the first term of the right hand side by $ K(z)$. This will only possibly increase the fraction, and for the same reason we can replace $ K(x)$ by $ K(z)$ in the second term. This achieves the triangle inequality up to $ O(1/M)$, as desired.

Case 2: $ K(z) = \max(K(x), K(y), K(z))$. Without loss of generality we may also assume $ K(x) \geq K(y)$, for the other possibility has an identical argument. Now we can boil the inequality down to something simpler. We already know the denominators have to all be $ K(z)$ in the right hand side, and $ K(x)$ in the left. Moreover, we claim $ K(z|x^*) \geq K(x|z^*)$. This is by the symmetry of information:

$ \displaystyle K(x,z) = K(x|z^*) + K(z) = K(z|x^*) + K(x) \leq K(z|x^*) + K(z)$

Subtracting $ K(z)$ establishes the claim, and similarly we have $ K(z|y^*) \geq K(y|z^*)$. So the triangle inequality reduces to

$ \displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(z|x^*)}{K(z)} + \frac{K(z|y^*)}{K(z)} + O(1/K(z))$

Applying our original inequality again to get $ K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$, we may divide through by $ K(x)$ and there are two additional cases.

$ \displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(x|z^*) + K(z|y^*) + O(1)}{K(x)}$

If the right-hand side is less than or equal to 1, then adding a constant $ c$ to the top and bottom of the fraction only increases the value of the fraction, and doesn’t violate the inequality. So we choose to add $ K(z)-K(x)$ to the top and bottom of the right-hand side and again using the symmetry of information, we get exactly the required value.

If the right-hand side is greater than 1, then adding any constant to the top and bottom decreases the value of the fraction, but it still remains greater than 1. Since $ K(x|y^*) \leq K(x)$ (a simple exercise), we see that the left-hand side is at most 1, and our same trick of adding $ K(z) – K(x)$ works. $ \square$

The proof of the universality theorem is considerably more elegant.

Proof of Theorem 2 (Universality): Let $ f$ be any normalized distance function, and set $ e = f(x,y)$. Suppose further that $ K(x) \leq K(y)$.

Let us enumerate all strings $ v$ such that $ f(x,v) \leq e$. In particular, since $ e = f(x,y)$, $ y$ is included in this enumeration. By the density condition, the number of such strings is at most $ 2^{eK(x) + 1}$. The index of $ y$ in this enumeration can be used as an effective description of $ y$ when given $ x$ as input. That is, there is a program which includes in its description the index of $ y$ and outputs $ y$ given $ x$. Since the number of bits needed to describe the index of $ y$ is at most $ \log(2^{eK(x) + 1}) = eK(x) + 1$, we have

$ \displaystyle K(y|x) \leq eK(x) + 1$

Again the symmetry of information lemma gives us $ K(x|y^*) \leq K(y|x^*)$. And now

$ \displaystyle d(x,y) = \frac{K(y|x^*)}{K(y)} \leq \frac{K(y|x) + O(1)}{K(y)} \leq \frac{eK(x) + O(1)}{K(y)}$

Since $ K(x) \leq K(y)$, we can replace the denominator of the last expression with $ K(x)$ (only increasing the fraction) to get $ d(x,y) \leq e + O(1/K(x))$. But $ e$ was just $ f(x,y)$, so this completes the proof of this case.

In the case $ K(y) < K(x)$, the proof is similar (enumerating the index of $ x$ instead), and at the end we get

$ \displaystyle d(x,y) = \frac{K(x|y^*)}{K(x)} \leq \frac{eK(y) + O(1)}{K(y)} = f(x,y) + O(1/K(y))$

The theorem is proved. $ \square$

Why Normalized Distance Functions?

The practical implications of the two theorems are immense. What we’re saying is that if we can represent some feature of string similarity by a normalized distance function, then that feature will be captured automatically by the normalized information distance $ d$. The researchers who discovered normalized information distance (and proved its universality) argue that in fact upper semi-computable normalized distance functions encapsulate all real-world metrics we would ever care about! Of course, there is still the problem that Kolmogorov complexity is uncomputable, but we can certainly come up with reasonable approximations (we will see precisely this in our next post).

And these same researchers have shown that approximations to $ d$ do represent a good deal of universality in practice. They’ve applied the same idea to fields as varied as genome clustering, language clustering, and music clustering. We will of course investigate the applications for ourselves on this blog, but their results seem to apply to data mining in any field.

But still this raises the obvious question (which goes unaddressed in any research article this author has read): does every metric have a sensible interpretation (or modification) as a normalized distance function? That awkward density condition seems particularly suspect, and is at the core of this author’s argument that the answer is no.

Consider the following example. Let $ f$ be a normalized distance function, and fix $ e = 1$. The density condition says that for any $ x$ we want, the number of strings which are within distance 1 of $ x$ is bounded by $ 2^{K(x) + 1}$. In particular, this quantity is finite, so there can only be finitely many strings which are within distance 1 of $ x$. But there are infinitely many strings, so this is a contradiction!

Even if we rule out this (arguably trivial) case of $ e=1$, we still run into problems. Let $ e = 1 – \varepsilon$ for any sufficiently small $ \varepsilon > 0$. Then fix $ x = 0$ (the string consisting of the single bit 0). The number of strings which are within distance $ e$ of $ x$ is bounded by $ 2^{eK(x) + 1} < 2^{K(x) + 1}$ is again finite (and quite small, since $ K(0)$ is about as simple as it gets). In other words, there are only a finite number of strings that are not maximally dissimilar to $ 0$. But one can easily come up with an infinite number of strings which share something in common with $ 0$: just use $ 0^n$ for any $ n$ you please. It is ludicrous to say that every metric should call $ 0$ as dissimilar to $ 0^n$ as the empty string is to a random string of a thousand bits.

In general, this author doesn’t find it likely that one can take any arbitrary $ f(x,y)$ which is both symmetric and has values in $ [0,1]$ and modify it to satisfy the density condition. Indeed, this author has yet to see any example of a natural normalized similarity metric. There is one which is a modification of Hamming distance, but it is relatively awkward and involves the Kolmogorov complexity of the strings involved. If the reader has any ideas to the contrary, please share them in the comments.

So it appears that the class of normalized distance functions is not as large as we might wish, and in light of this the universality theorem is not as impressive. On the other hand, there is no denying the success of applying the normalized information distance to complex real-world problems. Something profound is going on, but from this author’s viewpoint more theoretical work is needed to establish why.

Friendly Cousins of Normalized Information Distance

In practice we want to compute $ K(x|y^*)$ in terms of quantities we can actually approximate. Due to the symmetry of information, we can rewrite the metric formula as

$ \displaystyle d(x,y)=\frac{K(x,y) – \min(K(x), K(y))}{\max(K(x), K(y))}$

Indeed, since our main interpretation of $ K(x)$ is as the size of the smallest “compressed version” of the string $ x$, it would seem that we can approximate the function $ K$ by using real-world compression algorithms. And for the $ K(x,y)$ part, we recognize that (due to the need to specify a way to distinguish between the outputs $ x,y$)

$ K(x,y) \leq K(xy) + O(\log(\max(K(x), K(y))))$,

where $ K(xy)$ is the Kolmogorov complexity of the concatenation of the two strings. So if we’re willing to forgive additive logarithmic sloppiness (technically, $ O(\log(K(x))/K(x))$ sloppiness, which goes to zero asymptotocally), we can approximate normalized information distance as

$ \displaystyle d(x,y) = \frac{K(xy) – \min(K(x), K(y))}{\max(K(x), K(y))}$

In the literature researchers will also simplify the metric by removing the “star” notation

$ \displaystyle d(x,y) = \frac{\max(K(x|y), K(y|x))}{\max(K(x), K(y))}$

Unfortunately these two things aren’t equivalent. As we saw in our “basic properties” of $ K(x|y)$,

$ K(x|y) \leq K(x|y^*) + O(\log(K(y)))$

Indeed, it is not the case that $ K(x|y) = K(x|y^*)$. An easy counterexample is by trying to equate $ K(K(x) | x) = K(K(x) | x^*)$. We have already proven that the right hand side is always constant, but the left hand side could not be. An exercise in Li & Vitanyi shows there is an infinite family of strings $ x$ for which $ K(K(x) | x) \geq \log(|x|)$.

And so these two metrics cannot be equal, but they are close. In fact, denoting the non-star version by $ d_2$ and the regular version by $ d_1$, we have $ d_2(x,y) \leq d_1(x,y) + O(1)$. This changes the metric properties and the universality claim, because $ O(1/K)$ precision is stronger than $ O(1)$ precision. Indeed, the true constant is always less than 1 (e.g. when $ K(y) > K(x)$ it is $ K(y^*)/K(y)$), but this means the metric can potentially take values in the range $ [0,2]$, which is edging further and further away from the notion of normalization we originally strove for.

Finally, the last example of a cousin metric is

$ \displaystyle d_3(x,y) = \frac{K(x|y^*) + K(y|x^*)}{K(x,y)}$

We will leave it to the reader to verify this function again satisfies the metric inequalities (in the same way that the original normalized information distance does). On the other hand, it only satisfies universality up to a factor of 2. So while it still may give some nice results in practice (and it is easy to see how to approximate this), the first choice of normalized information distance was theoretically more precise.


We’ve just waded through a veritable bog of theory, but we’ve seen some big surprises along the way. Next time we’ll put these theoretical claims to the test by seeing how well we can cluster and classify data using the normalized information distance (and introducing as little domain knowledge as possible). Until then!

Topological Spaces — A Primer

In our last primer we looked at a number of interesting examples of metric spaces, that is, spaces in which we can compute distance in a reasonable way. Our goal for this post is to relax this assumption. That is, we want to study the geometric structure of space without the ability to define distance. That is not to say that some notion of distance necessarily exists under the surface somewhere, but rather that we include a whole new class of spaces for which no notion of distance makes sense. Indeed, even when there is a reasonable notion of a metric, we’ll still want to blur the lines as to what kinds of things we consider “the same.”

The reader might wonder how we can say anything about space if we can’t compute distances between things. Indeed, how could it even really be “space” as we know it? The short answer is: the reader shouldn’t think of a topological space as a space in the classical sense. While we will draw pictures and say some very geometric things about topological spaces, the words we use are only inspired by their classical analogues. In fact the general topological space will be a much wilder beast, with properties ranging from absolute complacency to rampant hooliganism. Even so, topological spaces can spring out of every mathematical cranny. They bring at least a loose structure to all sorts of problems, and so studying them is of vast importance.

Just before we continue, we should give a short list of how topological spaces are applied to the real world. In particular, this author is preparing a series of posts dedicated to the topological study of data. That is, we want to study the loose structure of data potentially embedded in a very high-dimensional metric space. But in studying it from a topological perspective, we aim to eliminate the dependence on specific metrics and parameters (which can be awfully constricting, and even impertinent to the overall structure of the data). In addition, topology has been used to study graphics, image analysis and 3D modelling, networks, semantics, protein folding, solving systems of polynomial equations, and loads of topics in physics.

Recalling Metric Spaces, and Open Sets

Now we turn to generalizing metric spaces. The key property which we wish to generalize is that of open sets. For a metric space, and the reader should be thinking of the real line, the Euclidean plane, or three-dimensional Euclidean space, the open sets are easy to find. One can think of them as just “things without a boundary.” On the real line these look like open intervals $ (a, b)$ and unions of open intervals. In the plane, these would be more like open balls with a fixed center. In other words, it would be the interior of a disk.

To characterize this more mathematically, we define an open ball centered at $ x$ with radius $ \varepsilon$ in the real plane to be the set

$ \displaystyle B(x, \varepsilon) = \left \{ y \in \mathbb{R}^2 | d(x,y) < \varepsilon \right \}$

where $ d$ is the usual Euclidean metric on points in the plane. Whenever someone says open ball, the reader should picture the following:

An open ball of radius r, centered at the point x. [Wolfram Mathworld]

Now of course this doesn’t categorize all of the open sets, since we would expect the union of two of these things to also be open. In fact, it is not hard to see that even if we take an infinite (or uncountable!) union of these open balls centered at any points with any radii, we would still get something that “has no boundary.”

In addition, it appears we can also take intersections. That is, the intersection of two open balls should be open. But we have to be a bit more careful here, because we can break our intuition quite easily. In the case of the real line, I can take an intersection of open intervals which is definitely not open. For example, take the set of intervals $ \left \{ (1-1/n, 1+1/n) : n \in \mathbb{N} \right \}$. If we look at the intersection over all of these intervals, it is not hard to see that

$ \displaystyle \bigcap_{n \in \mathbb{N}} (1- 1/n, 1+1/n) = \left \{ 1 \right \}$

Specifically, the number 1 is in the intersection since it is contained in all of the open intervals. But any number $ x > 1$ cannot be in the intersection because for some large enough $ n$ it must be that $ 1 + 1/n < x$ (just solve this equation for $ n$ as a real number, and then take the ceiling). The case is similar case for $ x < 1$, so the intersection can only be the singleton set 1. This is clearly not an open interval.

So we just found that our intuition for open sets breaks down if we allow for infinite intersections, but everything else seems to work out. Furthermore, the definition of an open ball relied on nothing about Euclidean space except that it has a metric. We’re starting to smell a good definition:

Definition: Let $ X$ be a metric space with metric $ d$. An open set in $ X$ is either:

  • A union of any collection of open balls $ B(x, \varepsilon)$ where $ x \in X$, or
  • finite intersection of such open balls.

A set is closed if it is the complement of an open set.

In fact, this characterization of open sets is so good that we can redefine a bunch of properties of metric spaces just in terms of open sets. This is important because in a minute we will actually define a topological space by declaring which sets are open. Before we do that, let’s remain in the friendly world of metric spaces to investigate some of those redefinitions.

Neighborhoods, Sequences, and Continuous Functions

There is an essential switch in going from metric spaces to topological spaces that one must take, and it involves the concepts of neighborhoods.

Definition: Let $ x \in X$ be a point in a metric space $ X$. A neighborhood of $ x$ is any open set $ U$ containing $ x$. More specifically, we can distinguish between an open neighborhood and a closed neighborhood, but without qualifiers we will always mean an open neighborhood.

In particular, the concept of a neighborhood will completely replace the idea of a metric. We will say things like, “for any neighborhood…” and “there exists a neighborhood…”, which will translate in the case of metric spaces to, “for any sufficiently close point…” and “there exists a sufficiently close point…” The main point for this discussion, however, is that if open sets were defined in some other way, the definition would still apply.

Perhaps the simplest example of such a definition is that of a sequence converging. Recall the classical definition in terms of metrics:

DefinitionLet $ X$ be a metric space with metric $ d$, and let $ a_n$ be a sequence of elements in $ X$. We say $ a_n$ converges to $ a \in X$ if for any $ \varepsilon > 0$, there is some sufficiently large $ N$ so that the distance $ d(a_n, a) < \varepsilon$ whenever $ n > N$.

In other words, after the $ N$-th point in the sequence, the values will always stay within a tiny distance of $ a$, and we can pick that tiny distance (\varepsilon) arbitrarily close to $ a$. So the sequence must converge to $ a$.

This naturally gives rise to a definition in terms of open neighborhoods of $ a$:

DefinitionLet $ X, a_n, a$ be as in the previous definition. We say that $ a_n$ converges to $ a$ if for any open neighborhood $ U$ of $ a$, there is some sufficiently large $ N$ so that $ a_n \in U$ for all $ n > N$.

In particular, these two definitions are equivalent. Before we give the proof, the reader should be warned that pictures will make this proof obvious (but not rigorous), so we encourage the reader to follow along with a piece of paper. Open balls are drawn as circles despite the dimension of the space, and open neighborhoods are usually just drawn as “blobs” containing a certain point.

An open neighborhood V of a point p, and an open ball around p contained in V

To see the definitions are equivalent, suppose $ a_n$ converges as in the second definition. Then given an $ \varepsilon$, we can choose a particular choice of open neighborhood to satisfy the constraints of the first definition: just choose the open ball $ B(a, \varepsilon)$. This will translate in terms of the metric precisely to the first definition. Conversely if the first definition holds, all we need to show is that for any open neighborhood $ U$ of any point $ y$, we can always find an open ball $ B(y, \varepsilon)$ contained entirely in $ U$. We can apply this to pick that open ball around $ a$, and use the first definition to show that all of the $ a_n$ will be inside that open ball (and hence inside $ U$) forever onward.

The fact that we can always find such an open ball follows from the triangle inequality. If the open set $ U$ in question is a union of open balls, then the point $ y$ lies within some open ball $ B(x, r)$ where $ x \in U$. The following picture should convince the reader that we can pick a ball around $ y$ contained in $ B(x, r)$

Finding an open ball centered at y inside an open ball centered at x. (source: Wikibooks)

Specifically pick the radius $ \varepsilon$ so that $ d(x,y) + \varepsilon < r$, any point $ z$ inside the ball centered at $ y$ is also in the ball centered at $ x$, and we can see this by simply drawing the triangle connecting these three points, and applying the triangle inequality to show that $ d(x,z) < r$. A similar idea works if $ U$ is a finite intersection of open balls $ B_i$, where we just take the smallest ball around $ y$ of those we get by applying the above picture to each $ B_i$.

The other main definition we want to convert to the language of open sets is that of a continuous function. In particular, when we study metric spaces in pure mathematics, we are interested in the behavior of continuous functions between them (more so, even, than the properties of the spaces themselves). Indeed, when we study calculus in high school and university, this is all we care about: we want to look at minima and maxima of continuous functions, we want to study derivatives (the instantaneous rate of change) of a continuous function, and we want to prove theorems that hold for all continuous functions (such as the mean value theorem).

Identically, in topology, we are interested in the behavior of continuous functions on topological spaces. In fact, we will use special kinds of continuous functions to “declare” two spaces to be identical. We will see by the end of this post how this works, but first we need a definition of a continuous function in terms of open sets. As with neighborhoods, recall the classical definition:

Definition: A function $ f:X \to Y$ of metric spaces with metrics $ d_X, d_Y$ is called continuous if for all $ \varepsilon > 0$ there is a $ \delta > 0$ such that whenever $ d_X(x, y) < \delta$ the distance $ d_Y(f(x), f(y)) < \varepsilon$.

In words, whenever $ x,y$ are close in $ X$, it follows that $ f(x), f(y)$ are close in $ Y$.

Naturally, the corresponding definition in terms of open sets would be something along the lines of “for any open neighborhood $ U$ of $ x$ in $ X$, there is an open neighborhood $ V$ of $ f(x)$ in $ Y$ which contains $ f(U)$.” In fact, this is an equivalent definition (which the reader may verify), but there is a much simpler version that works better.

Definition: A function $ f:X \to Y$ is called continuous if the preimage of an open set under $ f$ is again an open set. That is, whenever $ V \subset Y$ is open, then $ f^{-1}(V)$ is open in $ X$.

The reason this is a better definition will become apparent later (in short: a general topology need not have “good” neighborhoods of a given point $ y$). But at least we can verify these three definitions all coincide for metric spaces. These dry computations are very similar to the one we gave for convergent sequences, so we leave it to those readers with a taste for blood. We will just simply mention that, for example, all polynomial functions are continuous with respect to this definition.

Topological Spaces, a World without Distance

We are now ready to define a general topological space.

Definition: Let $ X$ be any set. A topology on $ X$ is a family of subsets $ T$ of $ X$ for which the following three properties hold:

  • The empty set and the subset $ X$ are both in $ T$.
  • Any union of sets in $ T$ is again in $ T$.
  • Any finite intersection of sets in $ T$ is again in $ T$.

We call the pair $ (X,T)$ a topological space, and we call any set in $ T$ and open set of $ X$.

Definition: A set $ U$ in a topological space $ X$ is closed if its complement is open.

As we have already seen, any metric space $ (X,d)$ is a topological space, where the topology is the set of all open balls centered at all points of $ X$. We say the topology on $ X$ is induced by the metric $ d$. When $ X$ is either $ \mathbb{R}^n$ or $ \mathbb{C}^n$, we call the topology induced by the Euclidean metric the Euclidean topology on $ X$.

But these topological spaces are very well-behaved. We will work extensively with them in our applications, but there are a few classical examples that every student of the subject must know.

If we have any set $ X$, we may define a very silly topology on $ X$ by defining every subset of $ X$ to be open. This family of subsets trivially satisfies the requirements of a topology, and it is called the discrete topology. Perhaps the only interesting question we can ask about this topology is whether it is induced by some metric $ d$ on the underlying space. The avid reader of this blog should be able to answer this question quite easily.

The natural second example after the discrete topology is called the indiscrete topology. Here we simply define the topology as $ T = \left \{ \emptyset, X \right \}$. Again we see that this is a well-defined topology, and it’s duller than a conversation with the empty set.

As a third and slightly less trivial example, we point the reader to our proof gallery, where we define a topology on the integers and use it to prove that there are infinitely many primes.

Note that we can also define a topology on $ X$ by specifying a family of closed sets, as long as any intersection of closed sets is closed, and a finite union of closed sets is closed. This is because of the way unions, intersections, and complements interact. $ (\cup U_i)^{\textup{c}} = \cap U_i^{\textup{c}}$ and vice versa for intersections; proving this is a simple exercise in set theory.

Here is an extended (and vastly more interesting) example. Let $ X = \mathbb{R}^n$, and define a set $ U \subset X$ to be closed if it is the set of common roots of a collection of polynomials in $ n$ variables (which in our example below will be $ x$ and $ y$, but in general are often written $ x_1, \dots, x_n$). The set of roots is also called the zero locus of the collection of polynomials. This topology is called the Zariski topology on $ X$, and it is an extremely important topology in the field of algebraic geometry.

Before we verify that this is indeed a topology on $ X$, let us see a quick example. If $ X = \mathbb{R}^2$, the zero locus of the single polynomial $ y^2 – x^3 – x^2$ is the curve pictured below:

A nodal cubic curve (source Wikipedia).

The red curve is thus a closed set in the Zariski topology, and its complement is an open set. If we add in another polynomial (with a few exceptions) it is not hard to see that their common set of zeroes will either be the empty set or a finite set of points. Indeed, in the Zariski topology every finite set of points is closed. The intrepid reader can try to show that any finite set can be defined using exactly two polynomials (hint: you’ll get a better idea of how to do this in a moment, and without loss of generality, you can ensure one of the two is an interpolating polynomial).

Verifying that the Zariski topology is indeed a topology, it is clear that the empty set and the entire set are closed: the constant polynomial $ 1$ has no roots, and the zero polynomial has all points as its roots. Now, the intersection of any two closed sets is just the union of two collections $ \left \{ f_{\alpha} \right \} \cup \left \{ g_{\beta} \right \}$. By adding more constraints, we only keep the points with are solutions to both the $ f_{\alpha}$ and the $ g_{\beta}$ (despite the union symbol, this truly corresponds to an intersection). Moreover, it is clear that we can take arbitrary unions of families of polynomials and still get a single family of polynomials, which still defines a closed set.

On the other hand, given two closed sets defined by the families of polynomials $ \left \{ f_{\alpha} \right \} , \left \{ g_{\beta} \right \}$, we can achieve their union by looking at the closed set defined by the set of polynomial products $ \left \{ f_{\alpha}g_{\beta} \right \}$ for all possible pairs $ \alpha, \beta$. To show this defines the union, take any point $ x \in \mathbb{R}^n$ which is in the union of the two closed sets. In other words, $ x$ is simultaneously a zero of all $ f_{\alpha}$ or all $ g_{\beta}$. Since every polynomial in this new collection has either an $ f$ factor or a $ g$ factor, it follows that $ x$ is simultaneously a root of all of them. Conversely, let $ x$ is a simultaneous root of all of the $ f_{\alpha}g_{\beta}$. If it weren’t a common zero of all the $ f_{\alpha}$ and it weren’t a common zero of all the $ g_{\beta}$, then there would be some $ \alpha^*$ for which $ x$ is not a root of $ f_{\alpha^*}$ and similarly some $ \beta^*$ for which $ x$ is not a root of $ g_{\beta^*}$. But then $ x$ could not be a root of $ f_{\alpha^*}g_{\beta^*}$, contradicting that $ x$ is in the closed set to begin with. Thus we have verified that this actually defines the union of two closed sets. By induction, this gives us finite unions of closed sets being closed.

So the Zariski topology is in fact a valid topology on $ \mathbb{R}^n$, and it is not hard to see that if $ k$ is any field, then there is a well-defined Zariski topology on the set $ k^n$. In fact, studying this topology very closely yields a numerous amount of computational tools to solve problems like robot motion planning and automated theorem proving. We plan to investigate these topics in the future of this blog once we cover a little bit of ring theory, but for now the Zariski topology serves as a wonderful example of a useful topology.


One major aspect of mathematics is how to find the correct notion of calling two things “equivalent.” In the theory of metric spaces, the strongest possible such notion is called an isometry. That is, two metric spaces $ X, Y$ with metrics $ d_X, d_Y$ are called isometric if there exists a surjective function $ f: X \to Y$ which preserves distance (i.e. $ d_X(x,y) = d_Y(f(x), f(y))$ for all $ x,y \in X$, and the image $ f(X)$ is all of $ Y$). It is not hard to see that such functions are automatically both continuous and injective. The function $ f$ is called an isometry.

Now we can call two metric spaces “the same” if they are isometric. And they really are the same for all intents and purposes: the isometry $ f$ simply relabels the points of $ X$ as points of $ Y$, and maintains the appropriate distances. Indeed, isometry is such a strong notion of equivalence that isometries of Euclidean space are completely classified.

However, because we don’t have distances in a topological space, the next best thing is a notion of equivalence based on continuity. This gives rise to the following definition.

Definition: A function $ f: X \to Y$ between topological spaces is a homeomorphism if it is continuous, invertible, and its inverse $ f^{-1}$ is also continuous. In this case we call $ X$ and $ Y$ homeomorphic, and we write $ X \cong Y$.

In other words, we consider two topological spaces to be “the same” if one can be continuously transformed into the other in an invertible way. In still other words, a homeomorphism is a way to show that two topologies “agree” with each other. Indeed, since a topology is the only structure we have on our spaces, saying that two topologies agree is the strongest thing that can be said. (Of course, for many topological spaces we will impose other kinds of structure, but the moral still holds.)

As a first example, it is not hard to see that one can continuously transform a square into a circle (where these are considered subsets of the plane $ \mathbb{R}^2$ with the Euclidean topology):

Transform a circle into a square by projecting from the center of the circle (source: Quora).

To see how this is done, take any point $ x$ on the circle, and draw a ray from the center of the circle through $ x$. This line will intersect the square somewhere, and we can define $ f(x)$ to be that point of intersection. It is easy to see that a slight perturbation of the choice of $ x$ will only slightly change the image $ f(x)$, and that this mapping is invertible. This flavor of proof is standard in topology, because giving an argument in complete rigor (that is, defining an explicit homeomorphism) is extremely tedious and neither enlightening nor satisfying. And while there are a few holes in our explanation (for instance, what exactly is the topology of the square?), the argument is morally correct and conveys to the reader one aspect of what a homeomorphism can do.

On the other hand, in our first two examples of topological space, the discrete and indiscrete topologies, homeomorphisms are nonsensical. In fact, any two spaces with the discrete topology whose underlying sets have the same cardinality are homeomorphic, and the same goes for the indiscrete topology. This is simply because every function from a discrete space is continuous, and any function to an indiscrete space is continuous. In some sense, such topological spaces are considered pathological, because no topological tools can be used to glean any information about their structure.

As expected, the composition of two homeomorphisms is again a homeomorphism. From this it follows that homeomorphism is an equivalence relation, and so we can try to classify all topological spaces (or some interesting family of topological spaces) up to homeomorphism.

Of course, there are some very simple spaces that cannot be homeomorphic. For instance (again in the Euclidean topology), a circle is not homeomorphic to a line. While we will not prove this directly (that would require more tedious computations), there are good moral reasons why it is true. We will later identify a list of so-called topological invariants. These are properties of a topological space that are guaranteed to be preserved by homeomorphisms. In other words, if a space $ X$ has one of these properties and another space $ Y$ does not, then $ X$ and $ Y$ cannot be homeomorphic. A simple-minded topological invariant relevant to the question at hand is the existence of a “hole” in the space. Since the circle has a hole but the line does not, they cannot be homeomorphic. We will spend quite a lot of time developing more advanced topological invariants, but in the next primer we will list a few elementary and useful ones.

Of course there are many beautiful and fascinating topological spaces in higher dimensions. We will close this post with a description of a few of the most famous ones in dimension two (and, of course, we are ignoring what “dimension” rigorously means).

One nice space is the torus:

The torus (credit Wikipedia)

Otherwise known as the surface of a donut, a common mathematical joke is that a topologist cannot tell the difference between a donut and a coffee cup. Indeed, the two spaces are homeomorphic, so they are the same from a topologist’s point of view:

An explicit homeomorphism between a torus and a coffee cup (source Wikipedia).

This is a testament to the flexibility of homeomorphisms.

Another nice space is the Klein Bottle:

The Klein Bottle (source Wikipedia)

The Klein bottle is a fascinating object, because it does not “live” in three dimensions. Despite that it appears to intersect itself in the picture above, this is just a visualization of the Klein Bottle. It actually lives in four-dimensional space (which is impossible to visualize) and in this setting the space does not intersect itself. We say that the Klein Bottle can be embedded into $ \mathbb{R}^4$, but not $ \mathbb{R}^3$, and we will make this notion rigorous in the next primer. While this is not at all obvious, the torus and the Klein bottle are not homeomorphic.

The last space we will introduce is the real projective plane. This space, commonly denoted $ \mathbb{R}\textup{P}^2$, also does not embed in three-dimensional Euclidean space. Unlike the Klein Bottle, $ \mathbb{R}\textup{P}^2$ has no reasonable visualization, so a picture would be futile. Instead, we can think of it as a particular modification of a sphere: take a hollow sphere and “glue together” any pair of antipodal points (that is, points which are on the same line through the center of the sphere). This operation of “gluing,” although it may seem awkward, does define a perfectly good topological space (we will cover the details in the next primer). Of course, it is extremely hard to get a good idea of what it looks like, except to say that it is “kind of” like a sphere with some awkward twists in it. Again, $ \mathbb{R}\textup{P}^2$ is not homeomorphic to either of the torus or the Klein Bottle.

This only scratches the surface of commonly seen topological spaces (the Möbius strip comes to mind, for instance). While we don’t have nearly enough space or time on this blog to detail very many of them, next time we will investigate ways to take simple topological spaces and put them together to make more complex spaces. We will rigorize the notion of “gluing” spaces together, along with other common operations. We will also spend some time developing topological invariants which allow us to “count” the number of “holes” in a space. These invariants will become the sole focus of our applications of topology to data analysis.

Until then!

K-Nearest-Neighbors and Handwritten Digit Classification

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

The Axiom of Neighborliness: Let $ (X, d)$ be a metric space and let $ S \subset X$ be a finite set whose elements are classified by some function $ f : S \to \left \{ 1, 2, \dots, m \right \}$. We say that $ S$ satisfies the axiom of neighborliness if for every point $ x \in S$, if $ y$ is the closest point to $ x$, then $ f(x) = f(y)$. That is, $ y$ shares the same class as $ x$ if $ y$ is the nearest neighbor to $ x$.

For a more in-depth discussion of metrics, the reader should refer to this blog’s primer on the topic. For the purpose of this post and all foreseeable posts, $ X$ will always be $ \mathbb{R}^n$ for some $ n$, while the metric $ d$ will vary.

This axiom is actually a very strong assumption which is certainly not true of every data set. In particular, it highly depends on the problem setup. Having the wrong kinds or the wrong number of features, doing an improper conversion, or using the wrong metric can all invalidate the assumption even if the problem inherently has the needed structure. Luckily, for real-world applications we only need the data to adhere to the axiom of neighborliness in approximation (indeed, in practice the axiom is only verifiable in approximation). Of course, what we mean by “approximation” also depends on the problem and the user’s tolerance for error. Such is the nature of applied mathematics.

Once we understand the axiom, the machine learning “algorithm” is essentially obvious. For training, store a number of data points whose classes are known and fix a metric. To determine the class of an unknown data point, simply use the most common class of its nearest neighbors. As one may vary (as a global parameter) the number of neighbors one considers, this method is intuitively called k-nearest-neighbors.

The Most Basic Way to Learn: Copy Your Neighbors

Let’s iron out the details with a program and test it on some dummy data. Let’s construct a set of points in $ \mathbb{R}^2$ which manifestly satisfies the axiom of neighborliness. To do this, we’ll use Python’s random library to make a dataset sampled from two independent normal distributions.

import random

def gaussCluster(center, stdDev, count=50):
    return [(random.gauss(center[0], stdDev),
             random.gauss(center[1], stdDev)) for _ in range(count)]

def makeDummyData():
    return gaussCluster((-4,0), 1) + gaussCluster((4,0), 1)

The first function simply returns a cluster of points drawn from the specified normal distribution. For simplicity we equalize the covariance of the two random variables. The second function simply combines two clusters into a data set.

To give the dummy data class “labels,” we’ll simply have a second list that we keep alongside the data. The index of a data point in the first list corresponds to the index of its class label in the second. There are likely more elegant ways to organize this, but it suffices for now.

Implementing a metric is similarly straightforward. For now, we’ll use the standard Euclidean metric. That is, we simply take the sum of the squared differences of the coordinates of the given two points.

import math

def euclideanDistance(x,y):
    return math.sqrt(sum([(a-b)**2 for (a,b) in zip(x,y)]))

To actually implement the classifier, we create a function which itself returns a function.

import heapq

def makeKNNClassifier(data, labels, k, distance):
    def classify(x):
        closestPoints = heapq.nsmallest(k, enumerate(data),
                                        key=lambda y: distance(x, y[1]))
        closestLabels = [labels[i] for (i, pt) in closestPoints]
        return max(set(closestLabels), key=closestLabels.count)

    return classify

There are a few tricky things going on in this function that deserve discussion. First and foremost, we are defining a function within another function, and returning the created function. The important technical point here is that the created function retains all local variables which are in scope even after the function ends. Specifically, you can call “makeKNNClassifier” multiple times with different arguments, and the returned functions won’t interfere with each other. One is said to close over the values in the environment, and so this programming language feature is called a function closure or just a closure, for short. It allows us, for instance, to keep important data visible while hiding any low-level data it depends on, but which we don’t access directly. From a high level, the decision function entirely represents the logic of the program, and so this view is justified.

Second, we are using some relatively Pythonic constructions. The first line of “classify” uses of heapq to pick the $ k$ smallest elements of the data list, but in addition we use “enumerate” to preserve the index of the returned elements, and a custom key to have the judgement of “smallest” be determined by the custom distance function. Note that the indexed “y[1]” in the lambda function uses the point represented by “y” and not the saved index.

The second line simply extracts a list of the labels corresponding each of the closest points returned by the call to “nsmallest.” Finally, the third line returns the maximum of the given labels, where a label’s weight (determined by the poorly named “key” lambda) is its frequency in the “closestLabels” list.

Using these functions is quite simple:

trainingPoints = makeDummyData() # has 50 points from each class
trainingLabels = [1] * 50 + [2] * 50  # an arbitrary choice of labeling

f = makeKNNClassifier(trainingPoints, trainingLabels, 8, euclideanDistance)
print f((-3,0))

The reader may fiddle around with this example as desired, but we will not pursue it further. As usual, all code used in this post is available on this blog’s Github page. Let’s move on to something more difficult.

Handwritten Digits

One of the most classic examples in the classification literature is in recognizing handwritten digits. This originally showed up (as the legend goes) in the context of the United States Postal Service for the purpose of automatically sorting mail by the zip code of the destination. Although this author has no quantitative proof, the successful implementation of a scheme would likely save an enormous amount of labor and money. According to the Postal Facts site, there are 31,509 postal offices in the U.S. and, assuming each one processes mail, there is at least one employee at each office who would spend some time sorting by zip code. Given that the USPS processes 23 million pieces of mail per hour, a conservative estimate puts each office spending two hours of labor per day on sorting mail by zip code (resulting in a very rapid pace of 146.52 pieces of mail sorted per minute per worker). At a lower bound of $18/hr this amounts to a cost of $1,134,324 per day, or over 400 million dollars per year. Put in perspective, in one year the amount of money saved equals the entire two-year tuition of Moraine Valley Community College for 68,000 students (twice the current enrollment).

In short, the problem of sorting mail (and of classifying handwritten digits) begs to be automated, and indeed it has been to some degree for about four decades. Let’s see how k-nearest-neighbors fares in this realm.

We obtain our data from the UCI machine learning repository, and with a few minor modifications, we present it on this blog’s Github page (along with the rest of the code used in this post). A single line of the data file represents a handwritten digit and its label. The digit is a 256-element vector obtained by flattening a 16×16 binary-valued image in row-major order; the label is an integer representing the number in the picture. The data file contains 1593 instances with about 160 instances per digit.

In other words, our metric space is $ \left \{ 0,1 \right \}^{256}$, and we choose the Euclidean metric for simplicity. With the line wrapping to better display the “image,” one line from the data file looks like:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

After reading in the data appropriately, we randomly split the data set into two pieces, train on one piece, and test on the other. The following function does this,  returning the success rate of the classification algorithm on the testing piece.

import knn
import random

def column(A, j):
   return [row[j] for row in A]

def test(data, k):
   pts, labels = column(data, 0), column(data, 1)

   trainingData = pts[:800]
   trainingLabels = labels[:800]
   testData = pts[800:]
   testLabels = labels[800:]

   f = knn.makeKNNClassifier(trainingData, trainingLabels,
                             k, knn.euclideanDistance)
   correct = 0
   total = len(testLabels)

   for (point, label) in zip(testData, testLabels):
      if f(point) == label:
         correct += 1

   return float(correct) / total

A run with $ k=1$ gives a surprisingly good 89% success rate. Varying $ k$, we see this is about as good as it gets without any modifications to the algorithm or the metric. Indeed, the graph below shows that the handwritten digits data set agrees with the axiom of neighborliness to a fair approximation.

A graph of classification accuracy against k for values of k between 1 and 50. The graph clearly shows a downward trend as k increases, but all values k < 10 are comparably good.

Of course, there are many improvements we could make to this naive algorithm. But considering that it utilizes no domain knowledge and doesn’t manipulate the input data in any way, it’s not too shabby.

As a side note, it would be fun to get some tablet software and have it use this method to recognize numbers as one writes it. Alas, we have little time for these sorts of applications.

Advantages, Enhancements, and Problems

One reason k-nearest-neighbors is such a common and widely-known algorithm is its ease of implementation. Indeed, we implemented the core algorithm in a mere three lines of Python. On top of that, k-nearest-neighbors is pleasingly parallel, and inherently flexible. Unlike the Perceptron algorithm, which relies on linear separability, k-nearest-neighbors and the axiom of neighborliness allow for datasets with many different geometric structures. These lecture notes give a good example, as shown below, and the reader can surely conjure many more.

k-nearest-neighbors applied to a data set organized in concentric circles.

And of course, the flexibility is even greater by virtue of being able to use any metric for distance computations. One may, for instance, use the Manhattan metric if the points in question are locations in a city. Or if the data is sequential, one could use the dynamic time warping distance (which isn’t truly a metric, but is still useful). The possibilities are only limited by the discovery of new and useful metrics.

With such popularity, k-nearest-neighbors often comes with a number of modifications and enhancements. One enhancement is to heuristically remove certain points close to the decision boundary. This technique is called edited k-nearest-neighbors. Another is to weight certain features heavier in the distance computations, which requires one to programmatically determine which features help less with classification. This is getting close to the realm of a decision tree, and so we’ll leave this as an exercise to the reader.

The next improvement has to do with runtime. Given $ n$ training points and $ d$ features (d for dimension), one point requires $ O(nd)$ to classify. This is particularly expensive, because most of the distance computations performed are between points that are far away, and as $ k$ is usually small, they won’t influence in the classification.

One way to alleviate this is to store the data points in a data structure called a k-d tree. The k-d tree originated in computational geometry in the problem of point location. It partitions space into pieces based on the number of points in each resulting piece, and organizes the partitions into a tree. In other words, it will partition tightly where the points are dense, and loosely where the points are sparse. At each step of traversing the tree, one can check to see which sub-partition the unclassified point lies in, and descend appropriately. With certain guarantees, this reduces the computation to $ O(\log(n)d)$. Unfortunately, there are issues with large-dimensional spaces that are beyond the scope of this post. We plan to investigate k-d trees further in a future series on computational geometry.

The last issue we consider is in data scaling. Specifically, one needs to be very careful when converting the real world data into numerical data. We can think of each of the features as a random variable, and we want all of these random variables to have comparable variation. The reason is simply because we’re using spheres. One can describe k-nearest-neighbors as finding the smallest (filled-in) sphere centered at the unlabeled point containing $ k$ labeled data points, and using the most common of those labels to classify the new point. Of course, one can talk about “spheres” in any metric space; it’s just the set of all points within some fixed distance from the center (and this definition doesn’t depend on the dimension of the space). The important point is that a sphere has uniform length along every axis. If the data is scaled improperly, then the geometry of the sphere won’t mirror the geometry of the data, and the algorithm will flounder.

So now we’ve seen a smattering of topics about k-nearest-neighbors. We’d love to continue the discussion of modifications in the comments. Next time we’ll explore decision trees, and work with another data set. Until then!

Metric Spaces — A Primer

The Blessing of Distance

We have often mentioned the idea of a “metric” on this blog, and we briefly described a formal definition for it. Colloquially, a metric is simply the mathematical notion of a distance function, with certain well-behaved properties. Since we’re now starting to cover a few more metrics (and things which are distinctly not metrics) in the context of machine learning algorithms, we find it pertinent to lay out the definition once again, discuss some implications, and explore a few basic examples.

The most important thing to take away from this discussion is that not all spaces have a notion of distance. For a space to have a metric is a strong property with far-reaching mathematical consequences. Essentially, metrics impose a topology on a space, which the reader can think of as the contortionist’s flavor of geometry. We’ll explore this idea after a few examples.

On the other hand, from a practical standpoint one can still do interesting things without a true metric. The downside is that work relying on (the various kinds of) non-metrics doesn’t benefit as greatly from existing mathematics. This can often spiral into empirical evaluation, where justifications and quantitative guarantees are not to be found.

Metrics and Metric Spaces

Given a set $ X$, we say $ X$ is a metric space if it comes equipped with a special function $ d(x,y)$ that can compute the distance between any two points $ x,y$ of $ X$. Specifically, $ d$ must satisfy the axioms of a metric.

Definition: A function $ d: X \times X \to \mathbb{R}$ is a metric if it satisfies the following three properties for any choice of elements $ x, y, z \in X$.

  • $ d(x,y) \geq 0$ (non-negativity), and $ d(x,y) = 0$ if and only if $ x=y$.
  • $ d(x,y) = d(y,x)$ (symmetry)
  • $ d(x,y) + d(y,z) \geq d(x,z)$ (triangle inequality)

Our goal now is to convince the reader that these three axioms are sensible for every notion of distance to satisfy. The first bullet claims that the distance between any two things can never be negative (hence called “non-negativity”), and that the distance between two things can only be zero if those two things are actually the same thing. The second bullet is a matter of perspective; the distance function reads, “the distance between $ x$ and $ y$,” and this shouldn’t change based on which element comes first in the sentence. This is the “symmetry” condition.


If one wants to prove a function is a metric, the third bullet is often the hardest property to establish. It’s called the triangle inequality, and in words it says that the lengths of edges of triangles make sense if you measure them with $ d$. Thinking of $ x, y, z$ as the vertices of a triangle, such as the one at left, we don’t want the length of one edge to be longer than the combined lengths of the other two edges. It’s a basic fact of Euclidean geometry that such a triangle cannot be drawn.


Walking from one vertex to another has an obviously shortest route: the straight line path

Pedantically, we notice that the third bullet above uses $ \geq$, which includes that $ d(x,y) + d(y,z) = d(x,z)$. It is not hard to see that this occurs (in Euclidean space, at least) when $ y$ lies on the line segment between $ x$ and $ z$. In this case it’s not truly a triangle, but it’s just convenient to pack it under the same name.

Aside from analyzing the abstract properties of a metric, the best way to understand this definition is to explore lots and lots of examples.

Of Norms and Graphs and Levenshtein, of Taxicabs and Kings

The simplest metric one could construct is called the discrete metric. It is defined by $ d(x,y) = 0$ if $ x = y$ and $ d(x,y) = 1$ otherwise. The symmetry and non-negativity conditions are trivially satisfied, and the triangle inequality is easy to prove. If $ d(x,y) + d(y,z) < d(x,z) \leq 1$, then it must be that $ x \neq z$, but both $ x=y$ and $ y=z$. The transitivity of equality, however, implies $ x=z$, a contradiction.

The discrete metric is completely useless for practical purposes, because all it can do is tell one that two things are equal or not equal. We don’t need a metric to do this in real life. On the other hand, mathematically this metric has a lot of uses. It serves as a conveniently pathological counterexample allowing one to gauge the plausibility of purported theorems in topology. These sorts of things usually only show up in the realm of point-set topology, which we haven’t breached yet on this blog, so we’ll leave it as a relevant link for now (hit the first link to page 41).

The most well known metric by far is the Euclidean metric. In $ n$ dimensions, this is just

$ \displaystyle d((x_1, \dots, x_n), (y_1, \dots, y_n)) = \sqrt{(y_1 – x_1)^2 + \dots + (y_n – x_n)^2}$

The non-negativity and symmetry of this metric follow from the fact that $ (a – b)^2 = (b – a)^2 \geq 0$ for all real numbers $ a,b$. The triangle inequality is a bit more difficult to prove, and without using a cop-out like Minkowski’s inequality, one would need to prove the pythagorean theorem for $ \mathbb{R}^n$, which implies the Cauchy-Schwarz inequality, which in turn implies the triangle inequality. Instead, we will do this at the end of this primer for a general vector space and a general inner product. The special case of the usual Euclidean dot product (which induces the Euclidean metric as above) will follow trivially.

The next metric we will inspect is the taxicab metric, also known as the Manhattan metric for the way it mimics driving distances on a grid of streets.

The picture below illustrates this: the green line represents usual Euclidean distance between the two black dots, while the blue, red, and yellow lines all represent the same distance via the taxicab metric. In particular, the distance is the sum of the lengths of the individual line segments, and it’s easy to see that the choice of path is irrelevant.

Screen Shot 2016-06-25 at 5.46.29 PM

The red, yellow, and blue lines are all paths of equal length from the bottom left to the top right.

To make this more rigorous mathematically, we will pick the simplest possible path (the red one) to see that the distance is simply the sum of the differences of the x- and y-coordinates in absolute value. This generalizes to the following formula for arbitrary dimension.

$ \displaystyle d((x_1, \dots, x_n), (y_1, \dots, y_n)) = |x_1 – y_1| + |x_2 – y_2| + \dots + |x_n – y_n|$

For reasons the measure-theorist is familiar with, this metric is sometimes called the $ L_1$ metric. Much like the Euclidean metric, it also arises from a vector space (albeit not in the usual way). This function is non-negative and symmetric for the same reasons the Euclidean metric is. We will again defer the proof of the triangle inequality to the end of this post.

Next, we have the maximum metric, also known as the Chebyshev metric, which measures the distance it takes a king to travel from one point on a chessboard to another.

image source:

The picture to the left shows this in action. In particular, the distance required for a king to move from one square to another is exactly the maximum of the horizontal and vertical distances between the two squares.

If we generalized the chessboard and king’s movement rules to an arbitrary number of dimensions, it would result in taking the maximum of $ |x_i – y_i|$ over each coordinate indexed by $ i$.

Rigorously written, the maximum metric is defined by

$ \displaystyle d(x,y) = \max_i (|x_i – y_i|)$

As usual, symmetry and non-negativity are completely obvious. The triangle inequality is not so hard here. If $ z = x + y$, then

$ \max_i |z_i| = \max_i |x_i + y_i| \leq \max_i (|x_i| + |y_i|) \leq \max_i |x_i| + \max_i |y_i|$.

This is the distance between $ z$ and 0, and the more general result follows by translating the points in question (and it is easy to see that translation preserves distance).

Next, we can construct a metric on any undirected, weighted (or unweighted) graph $ G$, and we naturally call it the graph metric. The space is the set of vertices in $ G$, and the distance between two vertices is the shortest path between them, as per the weighting. In the case that $ G$ is unweighted, we can equivalently count the number of edges in the shortest path (shortest by edge count) or assume all edge weights are equal to 1. By virtue of being undirected and weights being non-negative, the symmetry and non-negativity conditions are trivially satisfied. The triangle inequality is (unusually) trivial as well. For if the shortest path from $ x$ to $ z$ were longer than the paths from $ x$ to $ y$ and $ y$ to $ z$ for some other vertex $ y$, then the latter is shorter than the former, a contradiction. For those readers familiar with group theory, this idea extends naturally to the metrics on groups, using the Cayley graph.

The last example is one we’ve explored at length on this blog, and that is the Levenshtein metric on words. Overly rigorously, it is a metric on a free monoid where we allow substitutions, insertions, and deletions. The reader is encouraged to read more about it in our post on metrics on words. An even simpler version of this is called the Hamming distance, where we only allow substitutions and the two words being compared must have the same length.

A Few Vague Words on Topology

As we have mentioned briefly before on this blog, a notion of distance allows one to consider any geometric construction that relies only on distance. The easiest examples are circles and ellipses, but one can also talk about convergent sequences, and other more analytic ideas.

But more importantly to mathematicians, metrics generate very tractable topologies. While we’re not suited to begin a full-blown discussion of topology in this primer, we will at least say that a topology is simply a definition of open sets, subject to appropriate properties. For a metric, open sets are usually defined again by circles. That is, one might define the open sets to be all the unions and finite intersections of open disks, where an open disk is a set of the form $ \left \{ y : d(x,y) < C \right \}$ for some center $ x$, and some constant $ C$.

The structure induced by these open sets is very flexible. In particular, two topological spaces are said to be equivalent if a function between them preserves open sets in both directions. This allows for all sorts of uncanny stretching and bending, such as those used to turn a sphere inside-out. The formal word for such a function is a homeomorphism, and the two spaces are said to be homeomorphic. One would be right to think that without certain assumptions, topologies could be wild and crazy beyond our imagination. The important point for this post is that a topology coming from a metric space is particularly well-behaved (at least as far as topologies go), satisfying a number of helpful properties for their analysis.

While it might seem weird and arbitrary to talk of open sets as a “structure” of a space, it turns out to yield a very rich theory and surprising applications. We plan to explore some of these applications on this blog in the distant future.

Inner Product Spaces: a Reprise

In what follows we will give a detailed but elementary treatment of the triangle inequality for a general inner product space. One should note that there are even more general spaces that allow for metrics with the triangle inequality, but these usually involve measure theory or take the triangle inequality as an axiom. In this post, we want to see the triangle inequality occur as a result of the existence of an inner product. For a refresher on inner product spaces and the basic definitions, the reader should refer to our primer on the subject.

Before we continue, we should also note which inner products induce which metrics. For the Euclidean metric it is obviously the Euclidean inner product. For the taxicab metric one should refer to the second page of these notes.

Let $ V$ be an inner product space, and let $ v \in V$. We define the norm of $ v$ to be $ \| v \| = \sqrt{\left \langle v,v \right \rangle}$. This coincides with the usual Euclidean norm if we use the Euclidean inner product, and the $ L_p$ norm if we use the appropriate integral inner product.

There are some trivial properties one would expect to be true of norms, such as non-negativity, $ \| v \| = 0$ if and only if $ v = 0$, and $ \| av \| = |a| \| v \|$ for scalars $ a$. We leave these as exercises to the reader.

As we noted in our primer on inner product spaces, two vectors $ v,w$ are said to be orthogonal if $ \left \langle v,w \right \rangle = 0$. From this we can prove the Pythagorean Theorem for an inner product space.

Theorem: If $ u,v$ are orthogonal vectors, then $ \| u + v \|^2 = \| u \|^2 + \| v \|^2$.

Proof. By definition, $ \| u + v \|^2 = \left \langle u+v, u+v \right \rangle$, and this expands by linearity of the inner product to

$ \displaystyle \|u \|^2 + \| v \|^2 + \left \langle u,v \right \rangle + \left \langle v,u \right \rangle$

As the two vectors are orthogonal, the right two terms are zero, giving the desired result. $ \square$

Now given a vector $ v$, we describe a useful way to decompose another vector $ u$ into two parts, where one is orthogonal to $ v$ and one is a scalar multiple of $ v$. A simple computation gives a unique result:

$ \displaystyle u = \frac{\left \langle u,v \right \rangle}{\| v \|^2}v + \left ( u – \frac{\left \langle u,v \right \rangle}{\|v \|^2}v \right )$

We call the first term the projection of $ u$ onto $ v$. The second term is then simply the remainder after subtracting off the projection. This construction helps us understand the relationship between two vectors, but it also helps us understand the relationship between the inner product and the norm, as in the following theorem.

Theorem: (The Cauchy-Schwarz Inequality). For all $ u,v \in V$,

$ | \left \langle u,v \right \rangle | \leq \|u \| \|v \|$.

Proof. If $ v=0$ then the inequality trivially holds, so suppose $ v \neq 0$. Consider the square norm of the orthogonal decomposition of $ u$ onto $ v$, where we denote the orthogonal part by $ w$.

$ \displaystyle u = \frac{\left \langle u,v \right \rangle}{\| v \|^2}v + w$

By the Pythagorean Theorem, we have

$ \displaystyle \| u \|^2 = \left \| \frac{\left \langle u,v \right \rangle}{\| v \|^2}v \right \|^2 + \| w \|^2$

Since the scalar multiples pass squared through the norm, this is the same as

$ \displaystyle \| u \|^2 = \frac{|\left \langle u,v \right \rangle |^2}{\| v \|^2} + \|w \|^2$

Since norms are non-negative, we can omit the $ w$ part and get an inequality

$ \displaystyle \| u \|^2 \geq \frac{| \left \langle u,v \right \rangle |^2}{\| v \|^2}$.

Multiplying both sides by $ \| v \|^2$ and taking square roots gives the result. $ \square$

And now we may finally get to the triangle inequality for norms, which says that $ \| u + v \| \leq \| u \| + \| v \|$. Before we prove this, note that we can bring this back to the world of metrics by defining a metric based on the norm as $ d(u,v) = \| u-v \|$, and the statement about the triangle inequality translates to what we expect it should. So concluding this primer we present the proof of the triangle inequality.

Theorem: For all $ u,v \in V, \| u+v \| \leq \| u \| + \| v \|$.

Proof. Expanding $ \| u + v \|^2$ using the properties of the inner product we get

$ \displaystyle \| u + v \|^2 = \| u \|^2 + \| v \|^2 + 2 \textup{Re}\left \langle u,v \right \rangle$

Where “Re” stands for the real part of the (possibly complex-valued) inner product. As the real part bounded by the complex absolute value, we introduce our first inequality as

$ \displaystyle \| u +v \|^2 \leq \| u \|^2 + \| v \|^2 + 2|\left \langle u,v \right \rangle|$

By the Cauchy-Schwarz inequality, the last term is bounded by the norms of $ u$ and $ v$, giving

$ \displaystyle \| u+v \|^2 \leq \| u \|^2 + \| v \|^2 + 2 \| u \| \| v \| = (\| u \| + \| v \|)^2$

And taking square roots gives the result. $ \square$