Conditional (Partitioned) Probability — A Primer

One of the main areas of difficulty in elementary probability, and one that requires the highest levels of scrutiny and rigor, is conditional probability. The ideas are simple enough: that we assign probabilities relative to the occurrence of some event. But shrewd applications of conditional probability (and in particular, efficient ways to compute conditional probability) are key to successful applications of this subject. This is the basis for Nate Silver‘s success, the logical flaws of many a political pundit, and the ability for a robot to tell where it is in an environment. As this author usually touts, the best way to avoid the pitfalls of such confusing subjects is to be mathematically rigorous. In doing so we will develop intuition for when conditional probability that experts show off as if it were trivial.

But before we can get to all of that, we will cover a few extra ideas from finite probability theory that were left out of the last post.

Our entire discussion will revolve around a finite probability space, as defined last time. Let’s briefly (and densely) recall some of the notation presented there. We will always denote our probability space by $\Omega$, and the corresponding probability mass function will be $f: \Omega \to [0,1]$. Recall that events are subsets $E \subset \Omega$, and the probability function $P$ accepts as inputs events $E$, and produces as output the sum of the probabilities of members of $E$. We abuse notation by saying $\textup{P}(x) = \textup{P}(\left \{ x \right \}) = f(x)$ and disregarding $f$ for the most part. We really think of $\textup{P}$ as an extension of $f$ to subsets of $\Omega$ instead of just single values of $\Omega$. Further recall that a random variable $X$ is a real-valued function function $\Omega \to \mathbb{R}$.

Partitions and Total Probability

A lot of reasoning in probability theory involves decomposing a complicated event into simpler events, or decomposing complicated random variables into simpler ones. Conditional probability is one way to do that, and conditional probability has very nice philosophical interpretations, but it fits into this more general scheme of “decomposing” events and variables into components.

The usual way to break up a set into pieces is via a partition. Recall the following set-theoretic definition.

Definition: partition of a set $X$ is a collection of subsets $X_i \in X$ so that every element $x \in X$ occurs in exactly one of the $X_i$.

Here are a few examples. We can partition the natural numbers $\mathbb{N}$ into even and odd numbers. We can partition the set of people in the world into subsets where each subset corresponds to a country and a person is placed in the subset corresponding to where they were born (an obvious simplification of the real world, but illustrates the point). The avid reader of this blog will remember how we used partitions to define quotient groups and quotient spaces. With a more applied flavor, finding a “good” partition is the ultimate goal of the clustering problem, and we saw a heuristic approach to this in our post on Lloyd’s algorithm.

You should think of a partition as a way to “cut up” a set into pieces. This colorful diagram is an example of a partition of a disc.

In fact, any time we have a canonical way to associate two things in a set, we can create a partition by putting all mutually associated things in the same piece of the partition. The rigorous name for this is an equivalence relation, but we won’t need that for the present discussion (partitions are the same thing as equivalence relations, just viewed in a different way).

Of course, the point is to apply this idea to probability spaces. Points (elements) in our probability space $\Omega$ are outcomes of some random experiment, and subsets $E \subset \Omega$ are events. So we can rephrase a partition for probability spaces as a choice of events $E_i \subset \Omega$ so that every outcome in $\Omega$ is part of exactly one event. Our first observation is quite a trivial one: the probabilities of the events in a partition sum to one. In symbols, if $E_1, \dots, E_m$ form our partition, then

$\displaystyle \sum_{i=1}^m \textup{P}(E_i) = 1$

Indeed, the definition of $\textup{P}$ is to sum over the probabilities of outcomes in an event. Since each outcome occurs exactly once among all the $E_i$, the above sum expands to

$\displaystyle \sum_{\omega \in \Omega} \textup{P}(\omega)$

Which by our axioms for a probability space is just one. We will give this observation the (non-standard) name the Lemma of Total Probability.

This was a nice warmup proof, but we can beef it up to make it more useful. If we have some other event $A$ which is not related to a partition in any way, we can break up $A$ with respect to the partition. Then, assuming this is simpler, we compute the probability that $A$ happens in terms of the probabilities of the pieces.

Theorem: Let $E_1, \dots , E_m$ be a partition of $\Omega$, and let $A$ be an arbitrary event. Then

$\displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(E_i \cap A)$

Proof. The proof is only marginally more complicated than that of the lemma of total probability. The probability of the event $A$ occurring is the sum of the probabilities of each of its outcomes occurring. Each outcome in $A$ occurs in exactly one of the $E_i$, and hence in exactly one of the sets $E_i \cap A$. If $E_i \cap A$ is empty, then its probability of occurring is zero (as per our definitions last time). So the sum on the right expands directly into the definition of $\textup{P}(A)$. $\square$

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E’s. That is, the E’s give us a partition of A.

A more useful way of thinking of this is that we can use the $E_i$ to define a partition of $A$ in a natural way. The subsets in the partition will just be the sets $E_i \cap A$, and we will throw out any of these that turn out to be empty. Then we can think of our “new” probability space being $A$, and the theorem is just a special case of the lemma of total probability. Interestingly enough, this special case is often called the Theorem of Total Probability.

The idea to think of the event $A$ as our “new” probability space is extremely useful. It shows its worth most prominently when we interpret the shift as, “gaining the information that $A$ has occurred.” Then the question becomes: given that $A$ occurs, what is the probability that some other event will occur? That is, we’re interested in the probability of some event $B$ relative to $A$. This is called the conditional probability of $B$ with respect to $A$, and is denoted $P(B | A)$ (read “the probability of B given A”).

To compute the conditional probability, simply scale $\textup{P}(A \cap B)$ by the assumed event $\textup{P}(A)$. That is,

$\displaystyle \textup{P}(B | A) = \frac{\textup{P}(A \cap B)}{\textup{P}(A)}$

Wikipedia provides a straightforward derivation of the formula, but the spirit of the proof is exactly what we said above. The denominator is our new sample space, and the numerator is the probability of outcomes that cause $B$ to occur which also cause $A$ to occur. Multiplying both sides of this formula by $\textup{P}(A)$, this identity can be used to arrive at another version of the theorem of total probability:

$\displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(A | E_i) \textup{P}(E_i)$

That is, if we know how to compute the probabilities of the $E_i$, and we know how likely $A$ is to occur in each of those scenarios, then we can compute the total probability of $A$ occurring independently of the $E_i$.

We can come up with loads of more or less trivial examples of the theorem of total probability on simple probability spaces. Say you play a craps-like game where you roll a die twice. If you get a one on the first roll, you lose, and otherwise you have to match your initial roll on the second to win. The probability you win can be analyzed with the theorem on total probability. We partition the sample space into events corresponding to the outcome of the first roll.

$\displaystyle \textup{P}(\textup{Win}) = \sum_{i=1}^6 \textup{P}(\textup{Win } | \textup{ 1st roll }= i) \textup{P}(\textup{1st roll } = i)$

The probability the first roll is $i$ is 1/6, and if the first roll is a 1 then the probability of winning after that is zero. In the other 5 cases the conditional probability is the same regardless of $i$: to match $i$ on the second roll has a 1/6 chance. So the probability of winning is

$\displaystyle 5 \cdot \frac{1}{6} \cdot \frac{1}{6} = \frac{5}{36}$

For the working mathematician, these kinds of examples are relatively low-tech, but it illustrates the main way conditional probability is used in practice. We have some process we want to analyze, and we break it up into steps and condition on the results of a given step. We will see in a moment a more complicated example of this.

Partitions via Random Variables

The most common kind of partition is created via a random variable with finitely many values (or countably many, but we haven’t breached infinite probability spaces yet). In this case, we can partition the sample space $\Omega$ based on the values of $X$. That is, for each value $x = X(\omega)$, we will have a subset of the partition $S_x$ be the set of all $\omega$ which map to $x$. In the parlance of functions, it is the preimage of a single value $x$;

$\displaystyle S_x = X^{-1}(x) = \left \{ \omega \in \Omega : X(\omega) = x\right \}$

And as the reader is probably expecting, we can use this to define a “relative” expected value of a random variable. Recall that if the image of $X$ is a finite set $x_1, \dots, x_n$, the expected value of $X$ is a sum

$\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)$

Suppose $X,Y$ are two such random variables, then the conditional probability of $X$ relative to the event $Y=y$ is the quantity

$\displaystyle \textup{P}(X=x | Y=y) = \frac{\textup{P}(X=x \textup{ and } Y=y)}{\textup{P}(Y=y)}$

And the conditional expectation of $X$ relative to the event $Y = y$, denoted $\textup{E}(X | Y = y)$ is a similar sum

$\displaystyle \textup{E}(X|Y=y) = \sum_{i=1}^n x_i \textup{P}(X = x_i | Y = y)$

Indeed, just as we implicitly “defined” a new sample space when we were partitioning based on events, here we are defining a new random variable (with the odd notation $X | Y=y$) whose domain is the preimage $Y^{-1}(y)$. We can then ask what the probability of it assuming a value $x$ is, and moreover what its expected value is.

Of course there is an analogue to the theorem of total probability lurking here. We want to say something like the true expected value of $X$ is a sum of the conditional expectations over all possible values of $Y$. We have to remember, though, that different values of $y$ can occur with very different probabilities, and the expected values of $X | Y=y$ can change wildly between them. Just as a quick (and morbid) example, if $X$ is the number of people who die on a randomly chosen day, and $Y$ is the number of atomic bombs dropped on that day, it is clear that the probability of $Y$ being positive is quite small, and the expected value of $X = Y=y$ will be dramatically larger if $y$ is positive than if it’s zero. (A few quick calculations based on tragic historic events show it would roughly double, using contemporary non-violent death rate estimates.)

And so instead of simply summing the expectation, we need to take an expectation over the values of $Y$. Thinking again of $X | Y=y$ as a random variable based on values of $Y$, it makes sense mathematically to take expectation. To distinguish between the two types of expectation, we will subscript the variable being “expected,” as in $\textup{E}_X(X|Y)$. That is, we have the following theorem.

TheoremThe expected value of $X$ satisfies

$\textup{E}_X(X) = \textup{E}_Y(\textup{E}_X(X|Y))$

Proof. Expand the definitions of what these values mean, and use the definition of conditional probability $\textup{P}(A \cap B) = \textup{P}(A | B) \textup{P}(B)$. We leave the proof as a trivial exercise to the reader, but if one cannot bear it, see Wikipedia for a full proof. $\square$

Let’s wrap up this post with a non-trivial example of all of this theory in action.

A Nontrivial Example: the Galton-Watson Branching Process

We are interested (as was the eponymous Sir Francis Galton in the 1800′s) in the survival of surnames through generations of marriage and children. The main tool to study such a generational phenomenon is the Galton-Watson branching process. The idea is quite simple, but its analysis quickly blossoms into a rich and detailed theoretical puzzle and a more general practical tool. Just before we get too deep into things, we should note that these ideas (along with other types of branching processes) are used to analyze a whole host of problems in probability theory and computer science. A few the author has recently been working with are the evolution of random graphs and graph property testing.

The gist is as follows: say we live in a patriarchal society in which surnames are passed down on the male side. We can image a family tree being grown step by step in this way At the root there is a single male, and he has $k$ children, some of which are girls and some of which are boys. They all go on to have some number of children, but only the men pass on the family name to their children, and only their male children pass on the family name further. If we only record the family tree along the male lines, we can ask whether the tree will be finite; that is, whether the family name will die out.

To make this rigorous, let us define an infinite sequence of random variables $X_1 X_2, \dots$ which represent the number of children each person in the tree has, and suppose further that all of these variables are independent and uniformly distributed from $1, \dots, n$ for some fixed $n$. This may be an unrealistic assumption, but it makes the analysis a bit simpler. The number of children more likely follows a Poisson distribution where the mean is a parameter we would estimate from real-world data, but we haven’t spoken of Poisson distributions on this blog yet so we will leave it out.

We further imagine the tree growing step by step: at step $i$ the $i$-th individual in the tree has $X_i$ children and then dies. If the individual is a woman we by default set $X_i = 0$. We can recursively describe the size of the tree at each step by another random variable $Y_i$. Clearly $Y_0 = 1$, and the recursion is $Y_n = Y_{n-1} + X_i - 1$. In words, $Y_i$ represents the current living population with the given surname. We say the tree is finite (the family name dies off), if for some $i$ we get $Y_i = 0$. The first time at which this happens is when the family name dies off, but abstractly we can imagine the sequence of random variables continuing forever. This is sometimes called fictitious continuation.

At last, we assume that the probability of having a boy or girl is a split 1/2. Now we can start asking questions. What is the probability that the surname dies off? What is the expected size of the tree in terms of $n$?

For the first question we use the theorem of total probability. In particular, suppose the first person has two boys. Then the whole tree is finite precisely when both boys’ sub-trees are finite. Indeed, the two boys’ sub-trees are independent of one another, and so the probability of both being finite is the product of the probabilities of each being finite. That is, more generally

$\displaystyle \textup{P}(\textup{finite } | k \textup{ boys}) = \textup{P}(\textup{finite})^k \textup{P}(\textup{two boys})$

Setting $z = \textup{P}(\textup{the tree is finite})$, we can compute $z$ directly by conditioning on all possibilities of the first person’s children. Notice how we must condition twice here.

$\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \textup{P}(k \textup{ boys } | i \textup{ children}) \textup{P}(i \textup{ children}) z^k$

The probability of getting $k$ boys is the same as flipping $i$ coins and getting $k$ heads, which is just

$\displaystyle \textup{P}(k \textup{ boys } | i \textup{ children}) = \binom{i}{k}\frac{1}{2^i}$

So the equation is

$\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \binom{i}{k} \frac{1}{2^i} \cdot \frac{1}{n} z^k$

From here, we’ve reduced the problem down to picking the correct root of a polynomial. For example, when $n=4$, the polynomial equation to solve is

$\displaystyle 64z = 5 + 10z + 10z^2 + 5z^3 + z^4$

We have to be a bit careful, here though. Not all solutions to this equation are valid answers. For instance, the roots must be between 0 and 1 (inclusive), and if there are multiple then one must rule out the irrelevant roots by some additional argument. Moreover, we would need to use a calculus argument to prove there is always a solution between 0 and 1 in the first place. But after all that is done, we can estimate the correct root computationally (or solve for exactly when our polynomials have small degree). Here for $n=4$, the probability of being finite is about 0.094.

We leave the second question, on the expected size of the tree, for the reader to ponder. Next time we’ll devote an entire post to Bayes Theorem (a trivial consequence of the definition of conditional probability), and see how it helps us compute probabilities for use in programs.

Until then!

Seam Carving for Content-Aware Image Scaling

The Problem with Cropping

Every programmer or graphic designer with some web development experience can attest to the fact that finding good images that have an exactly specified size is a pain. Since the dimensions of the sought picture are usually inflexible, an uncomfortable compromise can come in the form of cropping a large image down to size or scaling the image to have appropriate dimensions.

Both of these solutions are undesirable. In the example below, the caterpillar looks distorted in the scaled versions (top right and bottom left), and in the cropped version (bottom right) it’s more difficult to tell that the caterpillar is on a leaf; we have lost the surrounding context.

In this post we’ll look at a nice heuristic method for rescaling images called seam-carving, which pays attention to the contents of the image as it resacles. In particular, it only removes or adds pixels to the image that the viewer is least-likely to notice. In all but the most extreme cases it will avoid the ugly artifacts introduced by cropping and scaling, and with a bit of additional scaffolding it becomes a very useful addition to a graphic designer’s repertoire. At first we will focus on scaling an image down, and then we will see that the same technique can be used to enlarge an image.

Before we begin, we should motivate the reader with some examples of its use.

It’s clear that the caterpillar is far less distorted in all versions, and even in the harshly rescaled version, parts of the green background are preserved. Although the leaf is warped a little, it is still present, and it’s not obvious that the image was manipulated.

Now that the reader’s appetite has been whet, let’s jump into the mathematics of it. This method was pioneered by Avidan and Shamir, and the impatient reader can jump straight to their paper (which contains many more examples). In this post we hope to fill in the background and show a working implementation.

Images as Functions

One common way to view an image is as an approximation to a function of two real variables. Suppose we have an $n \times m$-pixel image ($n$ rows and $m$ columns of pixels). For simplicity (during the next few paragraphs), we will also assume that the pixel values of an image are grayscale intensity values between 0 and 255. Then we can imagine the pixel values as known integer values of a function $f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$. That is, if we take two integers $0 \leq x < n$ and $0 \leq y < m$ then we know the value $f(x,y)$; it’s just the intensity value at the corresponding pixel. For values outside these ranges, we can impose arbitrary values for $I$ (we don’t care what’s happening outside the image).

Moreover, it makes sense to assume that $f$ is a well-behaved function in between the pixels (i.e. it is differentiable). And so we can make reasonable guessed as to the true derivative of $f$ by looking at the differences between adjacent pixels. There are many ways to get a good approximation of the derivative of an image function, but we should pause a moment to realize why this is important to nail down for the purpose of resizing images.

A good rule of thumb with images is that regions of an image which are most important to the viewer are those which contain drastic changes in intensity or color. For instance, consider this portrait of Albert Einstein.

Which parts of this image first catch the eye? The unkempt hair, the wrinkled eyes, the bushy mustache? Certainly not the misty background, or the subtle shadows on his chin.

Indeed, one could even claim that an image having a large derivative at a certain pixel corresponds to high information content there (of course this is not true of all images, but perhaps it’s reasonable to claim this for photographs). And if we want to scale an image down in size, we are interested in eliminating those regions which have the smallest information content. Of course we cannot avoid losing some information: the image after resizing is smaller than the original, and a reasonable algorithm should not add any new information. But we can minimize the damage by intelligently picking which parts to remove; our naive assumption is that a small derivative at a pixel implies a small amount of information.

Of course we can’t just remove “regions” of an image to change its proportions. We have to remove the same number of pixels in each row or column to reduce the corresponding dimension (width or height, resp.). Before we get to that, though, let’s write a program to compute the gradient. For this program and the rest of the post we will use the Processing programming language, and our demonstrations will use the Javascript cross-compiler processing.js. The nice thing about Processing is that if you know Java then you know processing. All the basic language features are the same, and it’s just got an extra few native types and libraries to make graphics rendering and image displaying easier. As usual, all of the code used in this blog post is available on this blog’s Google code page.

Let’s compute the gradient of this picture, and call the picture $I$:

A very nice picture whose gradient we can compute. It was taken by the artist Ria Czichotzki.

Since this is a color image, we will call it a function $I: \mathbb{R}^2 \to \mathbb{R}^3$, in the sense that the input is a plane coordinate $(x,y)$, and the output $I(x,y) = (r,g,b)$ is a triple of color intensity values. We will approximate the image’s partial derivative $\left \langle \partial I / \partial x, \partial I / \partial y \right \rangle$ at $(x,y)$ by inspecting values of $I$ in a neighborhood of the point:

$I(x-1,y), I(x+1, y), I(x,y-1), I(x,y+1)$.

For each pixel we call the value $|I(x+1,y) - I(x-1,y)| / 2$ the partial derivative in the $x$ direction, and $|I(x,y+1) - I(x,y-1)| / 2$ the partial in the $y$ direction. Note that the values $I(x,y)$ are vectors, so the norm signs here are really computing the distance between the two values of $I$.

There are two ways to see why this makes sense as an approximation. The first is analytic: by definition, the partial derivative $\partial I / \partial x$ is a limit:

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x,y)|}{h}$

It turns out that this limit is equivalent to

$\displaystyle \lim_{h \to 0} \frac{|I(x+h,y) - I(x-h,y)|}{2h}$

And the closer $h$ gets to zero the better the approximation of the limit is. Since the closest we can make $h$ is $h=1$ (we don’t know any other values of $I$ with nonzero $h$), we plug in the corresponding values for neighboring pixels. The partial $\partial I / \partial y$ is similar.

The second way to view it is geometric.

The slope of the blue secant line is not a bad approximation to the derivative at x, provided the resolution is fine enough.

The salient fact here is that a nicely-behaved curve at $x$ will have a derivative close to the secant line between the points $(x-1, f(x-1))$ and $(x+1, f(x+1))$. Indeed, this idea inspires the original definition of the derivative. The slope of the secant line is just $(f(x+1) - f(x-1)) / 2$. As we saw in our post on numerical integration, we can do much better than a linear guess (specifically, we can use do any order of polynomial interpolation we wish), but for the purposes of displaying the concept of seam-carving, a linear guess will suffice.

And so with this intuitive understanding of how to approximate the gradient, the algorithm to actually do it is a straightforward loop. Here we compute the horizontal gradient (that is, the derivative $\partial I / \partial x$).

PImage horizontalGradient(PImage img) {
color left, right;
int center;
PImage newImage = createImage(img.width, img.height, RGB);

for (int x = 0; x < img.width; x++) {
for (int y = 0; y < img.height; y++) {
center = x + y*img.width;

left = x == 0 ? img.pixels[center] : img.pixels[(x-1) + y*img.width];
right = x == img.width-1 ? img.pixels[center] : img.pixels[(x+1) + y*img.width];

newImage.pixels[center] = color(colorDistance(left, right));
}
}

return newImage;
}


The details are a bit nit-picky, but the idea is simple. If we’re inspecting a non-edge pixel, then we can use the formula directly and compute the values of the neighboring left and right pixels. Otherwise, the “left” pixel or the “right” pixel will be outside the bounds of the image, and so we replace it with the pixel we’re inspecting. Mathematically, we’d be computing the difference $|I(x, y) - I(x+1, y)|$ and $|I(x-1,y) - I(x, y)|$. Additionally, since we’ll later only be interested in the relative sizes of the gradient, we can ignore the factor of 1/2 in the formula we derived.

The parts of this code that are specific to Processing also deserve some attention. Specifically, we use the built-in types PImage and color, for representing images and colors, respectively. The “createImage” function creates an empty image of the specified size. And peculiarly, the pixels of a PImage are stored as a one-dimensional array. So as we’re iterating through the rows and columns, we must compute the correct location of the sought pixel in the pixel array (this is why we have a variable called “center”). Finally, as in Java, the ternary if notation is used to keep the syntax short, and those two lines simply check for the boundary conditions we stated above.

The last unexplained bit of the above code is the “colorDistance” function. As our image function $I(x,y)$ has triples of numbers as values, we need to compute the distance between two values via the standard distance formula. We have encapsulated this in a separate function. Note that because (in this section of the blog) we are displaying the results in an image, we have to convert to an integer at the end.

int colorDistance(color c1, color c2) {
float r = red(c1) - red(c2);
float g = green(c1) - green(c2);
float b = blue(c1) - blue(c2);
return (int)sqrt(r*r + g*g + b*b);
}


Let’s see this in action on the picture we introduced earlier.

The reader who is interested in comparing the two more closely may visit this interactive page. Note that we only compute the horizontal gradient, so certain locations in the image have a large derivative but are still dark in this image. For instance, the top of the door in the background and the wooden bars supporting the bottom of the chair are dark despite the vertical color variations.

The vertical gradient computation is entirely analogous, and is left as an exercise to the reader.

Since we want to inspect both vertical and horizontal gradients, we will call the total gradient matrix $G$ the matrix whose entries $g_{i,j}$ are the sums of the magnitudes of the horizontal and vertical gradients at $i,j$:

$\displaystyle g_{i,j} = \left | \frac{\partial I}{\partial x} (i,j) \right | + \left | \frac{\partial I}{\partial y} (i,j) \right |$

The function $e(x,y) = g_{x,y}$ is often called an energy function for $I$. We will mention now that there are other energy functions one can consider, and use this energy function for the remainder of this post.

Seams, and Dynamic Programming

Back to the problem of resizing, we want a way to remove only those regions of an image that have low total gradient across all of the pixels in the region removed. But of course when resizing an image we must maintain the rectangular shape, and so we have to add or remove the same number of pixels in each column or row.

For the purpose of scaling an image down in width (and the other cases are similar), we have a few options. We could find the pixel in each row with minimal total gradient and remove it. More conservatively, we could remove those columns with minimal gradient (as a sum of the total gradient of each pixel in the column). More brashly, we could just remove pixels of lowest gradient willy-nilly from the image, and slide the rows left.

If none of these ideas sound like they would work, it’s because they don’t. We encourage the unpersuaded reader to try out each possibility on a variety of images to see just how poorly they perform. But of these options, removing an entire column happens to distort the image less than the others. Indeed, the idea of a “seam” in an image is just a slight generalization of a column. Intuitively, a seam $s_i$ is a trail of pixels traversing the image from the bottom to the top, and at each step the pixel trail can veer to the right or left by at most one pixel.

Definition: Let $I$ be an $n \times m$ image with nonnegative integer coordinates indexed from zero. A vertical seam in $I$ is a list of coordinates $s_i = (x_i, y_i)$ with the following properties:

• $y_0 = 0$ is at the bottom of the image.
• $y_{n-1} = n-1$ is at the top of the image.
• $y_i$ is strictly increasing.
• $|x_i - x_{i+1}| \leq 1$ for all $0 \leq i < n-1$.

These conditions simply formalize what we mean by a seam. The first and second impose that the seam traverses from top to bottom. The third requires the seam to always “go up,” so that there is only one pixel in each row. The last requires the seam to be “connected” in the sense that it doesn’t veer too far at any given step.

Here are some examples of some vertical seams. One can easily define horizontal seams by swapping the placement of $x, y$ in the above list of conditions.

So the goal is now to remove the seams of lowest total gradient. Here the total gradient of a seam is just the sum of the energy values of the pixels in the seam.

Unfortunately there are many more seams to choose from than columns (or even individual pixels). It might seem difficult at first to find the seam with the minimal total gradient. Luckily, if we’re only interested in minima, we can use dynamic programming to compute the minimal seam ending at any given pixel in linear time.

We point the reader unfamiliar with dynamic programming to our Python primer on this topic. In this case, the sub-problem we’re working with is the minimal total gradient value of all seams from the bottom of the image to a fixed pixel. Let’s call this value $v(a,b)$. If we know $v(a,b)$ for all pixels below, say, row $i$, then we can compute the $v(i+1,b)$ for the entire row $i+1$ by taking pixel $(i+1,j)$, and adding its gradient value to the minimum of the values of possible predecessors in a seam, $v(i,j-1), v(i,j), v(i,j+1)$ (respecting the appropriate boundary conditions).

Once we’ve computed $v(a,b)$ for the entire matrix, we can look at the minimal value at the top of the image $\min_j v(n,j)$, and work backwards down the image to compute which seam gave us this minimum.

Let’s make this concrete and compute the function $v$ as a two-dimensional array called “seamFitness.”

void computeVerticalSeams() {
seamFitness = new float[img.width][img.height];
for (int i = 0; i < img.width; i++) {
}

for (int y = 1; y < img.height; y++) {
for (int x = 0; x < img.width; x++) {

if (x == 0) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x+1][y-1]);
} else if (x == img.width-1) {
seamFitness[x][y] += min(seamFitness[x][y-1], seamFitness[x-1][y-1]);
} else {
seamFitness[x][y] += min(seamFitness[x-1][y-1], seamFitness[x][y-1], seamFitness[x+1][y-1]);
}
}
}
}


We have two global variables at work here (global is bad, I know, but it’s Processing; it’s made for prototyping). The seamFitness array, and the gradientMagnitude array. We assume at the start of this function that the gradientMagnitude array is filled with sensible values.

Here we first initialize the zero’th row of the seamFitness array to have the same values as the gradient of the image. This is simply because a seam of length 1 has only one gradient value. Note here the coordinates are a bit backwards: the first coordinate represents the choice of a column, and the second represents the choice of a row. We can think of the coordinate axes of our image function having the origin in the bottom-left, the same as we might do mathematically.

Then we iterate over the rows in the matrix, and in each column we compute the fitness based on the fitness of the previous row. That’s it

To actually remove a seam, we need to create a new image of the right size, and shift the pixels to the right (or left) of the image into place. The details are technically important, but tedious to describe fully. So we leave the inspection of the code as an exercise to the reader. We provide the Processing code on this blog’s Google Code page, and show an example of its use below. Note each the image resizes every time the user clicks within the image.

Photograph by Raphael Goetter.

It’s interesting (and indeed the goal) to see how at first nothing is warped, and then the lines on the walls curve around the woman’s foot, and then finally the woman’s body is distorted before she gets smushed into a tiny box by the oppressive mouse.

As a quick side note, we attempted to provide an interactive version of this Processing program online in the same way we did for the gradient computation example. Processing is quite nice in that any Processing program (which doesn’t use any fancy Java libraries) can be cross-compiled to Javascript via the processing.js library. This is what we did for the gradient example. But in doing so for the (admittedly inefficient and memory-leaky) seam-carving program, it appeared to run an order of magnitude slower in the browser than locally. This was this author’s first time using Processing, so the reason for the drastic jump in runtime is unclear. If any readers are familiar with processing.js, a clarification would be very welcome in the comments.

Inserting Seams, Removing Objects, and Videos

In addition to removing seams to scale an image down, one can just as easily insert seams to make an image larger. To insert a seam, just double each pixel in the seam and push the rest of the pixels on the row to the right. The process is not hard, but it requires avoiding one pitfall: if we just add a single seam at a time, then the seam with minimum total energy will never change! So we’ll just add the same seam over and over again. Instead, if we want to add $k$ seams, one should compute the minimum $k$ seams and insert them all. If the desired resize is too large, then the programmer should pick an appropriate batch size and add seams in batches.

Another nice technique that comes from the seam-carving algorithm is to intelligently protect or destroy specific regions in the image. To do this requires a minor modification of the gradient computation, but the rest of the algorithm is identical. To protect a region, provide some way of user input specifying which pixels in the image are important, and give those pixels an artificially large gradient value (e.g., the maximum value of an integer). If the down-scaling is not too extreme, the seam computations will be guaranteed not to use any of those pixels, and inserted seams will never repeat those pixels. To remove a region, we just give the desired pixels an arbitrarily low gradient value. Then these pixels will be guaranteed to occur in the minimal seams, and will be removed from the picture.

The technique of seam-carving is a very nice tool, and as we just saw it can be extended to a variety of other techniques. In fact, seam-carving and its applications to object removal and image resizing are implemented in all of the recent versions of Photoshop. The techniques are used to adapt applications to environments with limited screen space, such as a mobile phone or tablet. Seam carving can even be adapted for use in videos. This involves an extension of the dynamic program to work across multiple frames, formally finding a minimal graph cut between two frames so that each piece of the cut is a seam in the corresponding frame. Of course there is a lot more detail to it (and the paper linked above uses this detail to improve the basic image-resizing algorithm), but that’s the rough idea.

We’ve done precious little on this blog with images, but we’d like to get more into graphics programming. There’s a wealth of linear algebra, computational geometry, and artificial intelligence hiding behind most of the computer games we like to play, and it would be fun to dive deeper into these topics. Of course, with every new post this author suggests ten new directions for this blog to go. It’s a curse and a blessing.

Until next time!

Probability Theory — A Primer

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to  future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set $\left \{ 0,1 \right \}$ might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

Definition: A finite set $\Omega$ equipped with a function $f: \Omega \to [0,1]$ is a probability space if the function $f$ satisfies the property

$\displaystyle \sum_{\omega \in \Omega} f(\omega) = 1$

That is, the sum of all the values of $f$ must be 1.

Sometimes the set $\Omega$ is called the sample space, and the act of choosing an element of $\Omega$ according to the probabilities given by $f$ is called drawing an example. The function $f$ is usually called the probability mass function. Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an event.

Definition: An event $E \subset \Omega$ is a subset of a sample space.

For instance, suppose our probability space is $\Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \}$ and $f$ is defined by setting $f(x) = 1/6$ for all $x \in \Omega$ (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset $\left \{ 2, 4, 6 \right \}$, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function $f$ (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a probability measure $\textup{P}: 2^\Omega \to \mathbb{R}$, where $2^\Omega$ denotes the set of all subsets of $\Omega$. The example of a die roll guides our intuition: the probability of any event should be the sum of the probabilities of the outcomes contained in it. i.e. we define

$\displaystyle \textup{P}(E) = \sum_{e \in E} f(e)$

where by convention the empty sum has value zero. Note that the function $\textup{P}$ is often denoted $\textup{Pr}$.

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: $\textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1$ by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities $p$ and $1-p$ for some $p$. Such a probability space is often called a Bernoulli trial.

Now that the function $\textup{P}$ is defined on all events, we can simplify our notation considerably. Because the probability mass function $f$ uniquely determines $\textup{P}$ and because $\textup{P}$ contains all information about $f$ in it ($\textup{P}(\left \{ \omega \right \}) = f(\omega)$), we may speak of $\textup{P}$ as the probability measure of $\Omega$, and leave $f$ out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of $\textup{P}$ is understood as above.

There are some other quick properties we can state or prove about probability measures: $\textup{P}(\left \{ \right \}) = 0$ by convention, if $E, F$ are disjoint then $\textup{P}(E \cup F) = \textup{P}(E) + \textup{P}(F)$, and if $E \subset F \subset \Omega$ then $\textup{P}(E) \leq \textup{P}(F)$. The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

$\displaystyle \Omega = \left \{ (1,1), (1,2), (1,3), \dots, (6,4), (6,5), (6,6) \right \}$

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of $\Omega$ because all of the randomness inherent in the game comes from the generation of an output in $\Omega$ (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

Definition: random variable $X$ is a real-valued function on the sample space $\Omega \to \mathbb{R}$.

So for example the random variable for the sum of the two dice would be $X(a,b) = a+b$. We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of all random variables $\textup{RV}(\Omega)$. It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if $X, Y$ are two random variables, so is $aX + bY$ for real numbers $a, b$. This function operates linearly, in the sense that its value is $(aX + bY)(\omega) = aX(\omega) + bY(\omega)$. We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, $XY$ is the product of two random variables (defined by $XY(\omega) = X(\omega)Y(\omega)$) and one can imagine such awkward constructions as $X/Y$ or $X^Y$. We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event $E$, we can define a random variable which is 0 or 1 depending on whether the input is a member of $E$. That is,

Definition: An indicator random variable $1_E$ is defined by setting $1_E(\omega) = 1$ when $\omega \in E$ and 0 otherwise. A common abuse of notation for singleton sets is to denote $1_{\left \{ \omega \right \} }$ by $1_\omega$.

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

Definition: Denote by $\left \{ X = k \right \}$ the set of outcomes $\omega \in \Omega$ for which $X(\omega) = k$. With the function notation, $\left \{ X = k \right \} = X^{-1}(k)$.

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define $\left \{ X < 5 \right \}$ or $\left \{ X \textup{ is even} \right \}$ just as we would naively construct sets. It works in general for any subset of $S \subset \mathbb{R}$. The notation is $\left \{ X \in S \right \} = X^{-1}(S)$, and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure $\textup{P}$. That is, we want to write things like $\textup{P}(X \textup{ is even})$ and read it in our head “the probability that $X$ is even”.

This is made rigorous by simply setting

$\displaystyle \textup{P}(X \in S) = \sum_{\omega \in X^{-1}(S)} \textup{P}(\omega)$

In words, it is just the sum of the probabilities that individual outcomes will have a value under $X$ that lands in $S$. We will also use for $\textup{P}(\left \{ X \in S \right \} \cap \left \{ Y \in T \right \})$ the shorthand notation $\textup{P}(X \in S, Y \in T)$ or $\textup{P}(X \in S \textup{ and } Y \in T)$.

Often times $\left \{ X \in S \right \}$ will be smaller than $\Omega$ itself, even if $S$ is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let $X$ be the profit function. Then $\textup{P}(X > 0)$ is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable $\textup{im}(X)$ is a finite subset of real numbers. In other words, the set of all events of the form $\left \{ X = x_i \right \}$ where $x_i \in \textup{im}(X)$ form a partition of $\Omega$. As such, we get the following immediate identity:

$\displaystyle 1 = \sum_{x_i \in \textup{im} (X)} P(X = x_i)$

The set of such events is called the probability distribution of the random variable $X$.

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events $E,F \subset \Omega$ are independent if the probability of both $E, F$ occurring is the product of the probabilities of each event occurring. That is, $\textup{P}(E \cap F) = \textup{P}(E)\textup{P}(F)$. There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from $E,F$ being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, $X,Y$ are independent random variables if

$\displaystyle \textup{P}(X = x, Y = y) = \textup{P}(X=x)\textup{P}(Y=y)$

for all $x,y \in \mathbb{R}$.

Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

Definition: Let $X$ be a random variable on a finite probability space $\Omega$. The expected value of $X$, denoted $\textup{E}(X)$, is the quantity

$\displaystyle \textup{E}(X) = \sum_{\omega \in \Omega} X(\omega) \textup{P}(\omega)$

Note that if we label the image of $X$ by $x_1, \dots, x_n$ then this is equivalent to

$\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)$

The most important fact about expectation is that it is a linear functional on random variables. That is,

Theorem: If $X,Y$ are random variables on a finite probability space and $a,b \in \mathbb{R}$, then

$\displaystyle \textup{E}(aX + bY) = a\textup{E}(X) + b\textup{E}(Y)$

Proof. The only real step in the proof is to note that for each possible pair of values $x, y$ in the images of $X,Y$ resp., the events $E_{x,y} = \left \{ X = x, Y=y \right \}$ form a partition of the sample space $\Omega$. That is, because $aX + bY$ has a constant value on $E_{x,y}$, the second definition of expected value gives

$\displaystyle \textup{E}(aX + bY) = \sum_{x \in \textup{im} (X)} \sum_{y \in \textup{im} (Y)} (ax + by) \textup{P}(X = x, Y = y)$

and a little bit of algebraic elbow grease reduces this expression to $a\textup{E}(X) + b\textup{E}(Y)$. We leave this as an exercise to the reader, with the additional note that the sum $\sum_{y \in \textup{im}(Y)} \textup{P}(X = x, Y = y)$ is identical to $\textup{P}(X = x)$. $\square$

If we additionally know that $X,Y$ are independent random variables, then the same technique used above allows one to say something about the expectation of the product $\textup{E}(XY)$ (again by definition, $XY(\omega) = X(\omega)Y(\omega)$). In this case $\textup{E}(XY) = \textup{E}(X)\textup{E}(Y)$. We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that $\textup{E}(X) \in \textup{im}(X)$. For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

$\displaystyle 1 \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} = \frac{5}{2}$

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable $X$ into simpler (usually indicator) random variables, the computation of $\textup{E}(X)$ becomes trivial.

tournament  $T$ is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a Hamiltonian path, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers $(v_1, \dots, v_n)$, where we visit vertex $v_i$ at stage $i$ of the traversal. The condition for this to be a valid Hamiltonian path is that $(v_i, v_{i+1})$ is an edge in $T$ for all $i$.

Now if we construct a tournament on $n$ vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, $X$ is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in $\textup{E}(X)$.

To compute this, simply note that we can break $X = \sum_p X_p$, where $p$ ranges over all possible lists of the vertices. Then $\textup{E}(X) = \sum_p \textup{E}(X_p)$, and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is $n!$ as this is the number of possible lists of $n$ items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

$\textup{P}(\textup{first edge and second edge and } \dots \textup{ and last edge})$

which by independence is

$\displaystyle \prod_{i = 1}^n \textup{P}(i^\textup{th} \textup{ edge is chosen}) = \frac{1}{2^{n-1}}$

That is, the expected number of Hamiltonian paths is $n!2^{-(n-1)}$.

Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable $X$ are throughout the real line.

Definition: The variance of a random variable $X$ is the quantity $\textup{E}((X - \textup{E}(X))^2)$.

That is, $\textup{E}(X)$ is a number, and so $X - \textup{E}(X)$ is the random variable defined by $(X - \textup{E}(X))(\omega) = X(\omega) - \textup{E}(X)$. It is the expectation of the square of the deviation of $X$ from its expected value.

One often denotes the variance by $\textup{Var}(X)$ or $\sigma^2$. The square is for silly reasons: the standard deviation, denoted $\sigma$ and equivalent to $\sqrt{\textup{Var}(X)}$ has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

• $\textup{Var}(X) = \textup{E}(X^2) - \textup{E}(X)^2$.
• $\textup{Var}(aX) = a^2\textup{Var}(X)$.
• When $X,Y$ are independent then variance is additive: $\textup{Var}(X+Y) = \textup{Var}(X) + \textup{Var}(Y)$.
• Variance is invariant under constant additives: $\textup{Var}(X+c) = \textup{Var}(X)$.

In addition, the quantity $\textup{Var}(aX + bY)$ is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is covariance.

Definition: Let $X,Y$ be random variables. The covariance of $X$ and $Y$, denoted $\textup{Cov}(X,Y)$, is the quantity $\textup{E}((X - \textup{E}(X))(Y - \textup{E}(Y)))$.

Note the similarities between the variance definition and this one: if $X=Y$ then the two quantities coincide. That is, $\textup{Cov}(X,X) = \textup{Var}(X)$.

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

Theorem: Let $X,Y$ be random variables with variances $\sigma_X^2, \sigma_Y^2$. Then their covariance is at most the product of the standard deviations in magnitude:

$|\textup{Cov}(X,Y)| \leq \sigma_X \sigma_Y$

Proof. Take any two non-constant random variables $X$ and $Y$ (we will replace these later with $X - \textup{E}(X), Y - \textup{E}(Y)$). Construct a new random variable $(tX + Y)^2$ where $t$ is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, $\textup{E}((tX + Y)^2)$. Expanding this and using linearity gives

$\displaystyle f(t) = t^2 \textup{E}(X^2) + 2t \textup{E}(XY) + \textup{E}(Y^2) \geq 0$

This is a quadratic function of a single variable $t$ which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

$\displaystyle 4 \textup{E}(XY)^2 - 4 \textup{E}(X^2) \textup{E}(Y^2) \leq 0$

and so dividing by 4 and replacing $X,Y$ with $X - \textup{E}(X), Y - \textup{E}(Y)$, resp., gives

$\textup{Cov}(X,Y)^2 \leq \sigma_X^2 \sigma_Y^2$

and the result follows. $\square$

Note that equality holds in the discriminant formula precisely when $Y = -tX$ (the discriminant is zero), and after the replacement this translates to $Y - \textup{E}(Y) = -t(X - \textup{E}(X))$ for some fixed value of $t$. In other words, for some real numbers $a,b$ we have $Y = aX + b$.

This has important consequences even in English: the covariance is maximized when $Y$ is a linear function of $X$, and otherwise is bounded from above and below. By dividing both sides of the inequality by $\sigma_X \sigma_Y$ we get the following definition:

Definition: The Pearson correlation coefficient of two random variables $X,Y$ is defined by

$\displaystyle r= \frac{\textup{Cov}(X,Y)}{\sigma_X \sigma_Y}$

If $r$ is close to 1, we call $X$ and $Y$ positively correlated. If $r$ is close to -1 we call them negatively correlated, and if $r$ is close to zero we call them uncorrelated.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

The  horizontal axis plots a sample of values of the random variable $X$ and the vertical plots a sample of $Y$. The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Various examples of the Pearson correlation coefficient, credit Wikipedia.

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of $Y$ given values of $X$ (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for $\textup{Var}(aX + bY)$. Expanding the definition, it is not hard to see that this amounts to the following proposition:

Proposition: The variance operator satisfies

$\displaystyle \textup{Var}(aX+bY) = a^2\textup{Var}(X) + b^2\textup{Var}(Y) + 2ab \textup{Cov}(X,Y)$

And using induction we get a general formula:

$\displaystyle \textup{Var} \left ( \sum_{i=1}^n a_i X_i \right ) = \sum_{i=1}^n \sum_{j = 1}^n a_i a_j \textup{Cov}(X_i,X_j)$

Note that in the general sum, we get a bunch of terms $\textup{Cov}(X_i,X_i) = \textup{Var}(X_i)$.

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

Definition: The covariance matrix of a collection of random variables $X_1, \dots, X_n$ is the matrix whose $(i,j)$ entry is $\textup{Cov}(X_i,X_j)$.

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables $Y_j$ (which are linear combinations of the original variables $X_i$) such that the covariance of distinct pairs $Y_j, Y_k$ are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called principal component analysis.

Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!

Information Distance — A Primer

This post assumes familiarity with our primer on Kolmogorov complexity. We recommend the uninformed reader begin there. We will do our best to keep consistent notation across both posts.

Kolmogorov Complexity as a Metric

Over the past fifty years mathematicians have been piling up more and more theorems about Kolmogorov complexity, and for good reason. One of the main interpretations of the Kolmogorov complexity function $K$ is that for a given string $x$, $K(x)$ is the best theoretical compression of $x$ under any compression scheme. So a negative result about $K$ can provide useful bounds on how good a real-world compressor can be. It turns out that these properties also turn $K$ into a useful tool for machine learning. The idea is summarized as follows:

Let $x,y$ be binary strings, and as usual let’s fix some universal programming language $L$ in which to write all of our programs. Let $p(x,y)$ be the shortest program which computes both $y$ when given $x$ as an input, and $x$ given $y$. We would imagine that if $x,y$ are unrelated, then the length of the program $|p(x,y)|$ would be roughly $K(x) + K(y)$, simply by running the shortest program to output $x$ using no inputs, followed by the same thing for $y$. As usual there will be some additive constant term independent of both $x$ and $y$. We denote this by $c$ or $O(1)$ interchangeably.

We would further imagine that if $x,y$ are related (that is, if there is some information about $x$ contained in $y$ or vice versa), then the program $p(x,y)$ would utilize that information and hence be shorter than $K(x) + K(y)$. It turns out that there is an even better way to characterize $p$, and with a few modifications we can turn the length of $p$ into something similar to a metric on the set of all strings.

This metric has some strikingly attractive features. We will see that it is “universal” with respect to a certain class of distance functions (which is unfortunately not the class of all metrics). In particular, for any of these functions $f$, the length of $|p(x,y)|$ will be at worst a small amount larger than $f(x,y)$. In words, if $x$ and $y$ are similar according to any of these distance functions, then they will be similar according to $p$. Of course the devil is in the details, but this is the right idea to have in mind while we wade through the computations.

An Aside on Metrics, and Properties of Kolmogorov Complexity

In recent posts on this blog we’ve covered a number of important examples of metrics and investigated how a metric creates structure in a space. But as powerful and rare as fruitful metrics are, we have barely scratched the surface of the vast amount of literature on the subject.

As usual with our computations in Kolmogorov complexity, all of our equalities will be true up to some kind of additive sloppiness. Most of the time it will be an additive constant $O(1)$ which is independent of anything else in the equation. We will usually omit the constant with that implicit understanding, and instead we will specify the times when it is an exact equality (or when the additive sloppiness is something other than a constant).

And so, unavoidably, the “metric” we define won’t be a true metric. It will only satisfy the metric properties (positive definite, symmetric, triangle inequality) up to a non-constant additive sloppiness. This will be part of the main theorem of this post.

Before we can reach the heart of the matter (and as a nice warm-up), we need to establish a few more properties of $K$. Recall that by $K(x|y)$ we mean the shortest program which computes $x$ when provided $y$ as an auxiliary input. We call this the conditional complexity of $x$ given $y$. Further, recall that $K(x,y)$ is the length of the shortest program which outputs both $x$ and $y$, and a way to distinguish between the two (if everything is in binary, the distinguishing part is nontrivial; should the reader be interested, this sort of conversation is made for comment threads). Finally, the comma notation works for auxiliary inputs as well: $K(x|y,z)$ is the length of the shortest program outputting $x$ when given $y,z$ and a way to distinguish them as input.

For example, the conditional Kolmogorov complexity $K(1^n | n) = c$ is constant: the length of the string $1^n$ provides all but a finite amount of information about it. On the other hand, if $x,y$ are random strings (their bits are generated independently and uniformly at random), then $K(y|x) = K(y)$; there is no information about $y$ contained in $x$.

Definition: Let $x$ be a (binary) string. We denote by $x^*$ the shortest program which computes $x$. That is, $K(x) = |x^*|$. If there are two shortest programs which compute $x$, then $x^*$ refers to the first in the standard enumeration of all programs.

As a quick aside, the “standard enumeration” is simple: treat a binary string as if it were a natural number written in base 2, and enumerate all strings in increasing order of their corresponding number. The choice of enumeration is irrelevant, though; all that matters is that it is consistent throughout our theory.

Proposition: Kolmogorov complexity has the following properties up to additive constants:

1. $K(x|y^*) = K(x|y,K(y))$
2. $K(x|y^*) \leq K(x|y)$, and $K(x|y) \leq K(x|y^*) + O(\log(K(y)))$
3. $K(x,y) = K(x) + K(y|x^*)$

The first item simply states that giving $y^*$ as input to a program is the same as giving $y$ and $K(y)$. This is not hard to prove. If $p$ is the shortest program computing $x$ from $y,K(y)$, then we can modify it slightly to work with $y^*$ instead. Just add to the beginning of $p$ the following instructions:

Compute K(y) as the length of the input y*
Simulate y* and record its output y

Since $y^*$ is a finite string and represents a terminating program, these two steps produce the values needed to run $p$. Moreover, the program description is constant in length, independent of $y^*$.

On the other hand, if $q$ is a program computing $x$ from $y^*$, we are tasked with finding $y^*$ given $y, K(y)$. The argument a standard but slightly more complicated technique in theoretical computer science called dovetailing. In particular, since we know the length of $y^*$, and there are only finitely many programs of the same length, we can get a list $p_1, p_2, \dots p_n$ of all programs of length $K(y)$. We then interleave the simulation of each of these programs; that is, we run the first step of all of the $p_i$, then the second, third, and so on. Once we find a program which halts and outputs $y$ (and we are guaranteed that one will do so) we can stop. In pseudocode, this is just the subroutine:

L = [all programs of length K(y) in lexicographic order]
i = 1
while True:
for program in L:
run step i of program
if program terminates and outputs y:
return program

The fact that this algorithm will terminate proves the claim.

The second item in the proposition has a similar proof, and we leave it as an exercise to the reader. (Hint: the logarithm in the second part of the statement comes from the hard-coding of a binary representation of the number $K(y)$)

The third item, that $K(x,y) = K(x) + K(y|x^*)$ has a much more difficult proof, and its consequences are far-reaching. We will use it often in our computations. The intrepid reader will see Theorem 3.9.1 in the text of Li & Vitanyi for a complete proof, but luckily one half of the proof is trivial. That is, the proof that $K(x,y) \leq K(x) + K(y|x^*) + c$ is similar to the argument we used above. Let $p,q$ be the shortest programs computing $x$ and $y$ given $x^*$, respectively. We can combine them into a program computing $x$ and $y$. First run $p$ to compute $x$ and compute the length of $p$. As we saw, these two pieces of data are equivalent to $x^*$, and so we can compute $y$ using $q$ as above, adding at most a finite amount program text to do so.

This property is so important it has a name.

Lemma: (Symmetry of information)

$\displaystyle K(x,y) = K(x) + K(y|x^*) = K(y) + K(x|y^*)$

This is true (and named appropriately) since there is symmetry in the quantity $K(x,y) = K(y,x)$. Note in particular that this doesn’t hold without the star: $K(x,y) = K(x) + K(y|x) + O(\log(K(x)))$. Those readers who completed the exercise above will know where the logarithm comes from.

The Almost-Triangle Inequality

The first application of the symmetry of information is (surprisingly) a variant of the triangle inequality. Specifically, the function $f(x,y) = K(x|y^*)$ satisfies the metric inequalities up to an additive constant sloppiness.

$\displaystyle K(x|y^*) \leq K(x|z^*) + K(z|y^*) + c$

where $c$ does not depend on $x, y, z$. To prove this, see that

$\displaystyle K(x,z | y^*) = K(x,y,z) - K(y) \leq K(z) + K(x|z^*) + K(y|z^*) - K(y)$

The first equality is by the symmetry of information $K(x,y,z) = K(y) + K(x,z|y^*)$, and the second follows from the fact that $K(x,y,z) \leq K(z) + K(x|z^*) + K(y|z^*)$. This is the same argument we used to prove the $\leq$ case of the symmetry of information lemma.

Now we can rearrange the terms and use the symmetry of information twice, $K(z) + K(y|z^*) = K(y,z)$ and $K(y,z) - K(y) = K(z|y^*)$, to reach the final result.

This is interesting because it’s our first indication that Kolmogorov complexity can play a role in a metric. But there are some issues: $K(x|y)$ is in general not symmetric. We need to some up with a symmetric quantity to use instead. There are quite a few details to this process (see this paper if you want to know them all), but the result is quite nice.

Theorem: Let $E(x,y)$ be the length of the shortest program which computes $x$ given $y$ as input and $y$ given $x$. Then

$\displaystyle E(x,y) = \max (K(x|y), K(y|x)) + O(\log(M))$

where $M = \max(K(x|y), K(y|x))$.

That is, our intuitive idea of what the “information distance” from $x$ to $y$ should be coincides up to an additive logarithmic factor with the maximum of the conditional Kolmogorov complexities. If two strings are “close” with respect to $E$, then there is a lot of mutual information between them. In the same paper listed above, the researchers (Bennett et al.) prove that $E$ is a “metric” (up to additive constants) and so this gives a reasonable estimate for the true information distance in terms of conditional Kolmogorov complexities.

However, $E$ is not the final metric used in applications, but just an inspiration for other functions. This is where the story gets slightly more complicated.

Normalized Information Distance(s)

At this point we realize that the information distance $E$ defined above is not as good as we’d like it to be. One of its major deficiencies is that it does not compute relative distances very well. That is, it doesn’t handle strings of varying size as well as it should.

For example, take $x$ to be a random string of length $n$ for arbitrary $n$.   The quantity $E(x, \varepsilon)$, where $\varepsilon$ is the empty string is just $K(x) + c$ (if the input is empty, compute $x$, otherwise output the empty string). But in a sense there is no information about $\varepsilon$ in any string. In other words, $\varepsilon$ is maximally dissimilar to all nonempty strings. But according to $E$, the empty string is variably dissimilar to other strings: it’s “less similar” to strings with higher Kolmogorov complexity. This is counter-intuitive, and hence undesirable.

Unfortunately the literature is littered with alternative distance functions, and the researchers involved spend little effort relating them to each other (this is part of the business of defining things “up to sloppiness”). We are about to define the principal example we will be concerned with, and we will discuss its relationship with its computationally-friendly cousins at the end.

The link between all of these examples is normalization. That is (again up to minor additive sloppiness we’ll make clear shortly) the distance functions take values in $[0,1]$, and a value of 0 means the strings are maximally similar, and a value of 1 implies maximal dissimilarity.

Definition: Let $\Sigma = \left \{ 0,1 \right \}^*$ be the set of binary strings. A normalized distance $f$ is a function $\Sigma \times \Sigma \to [0,1]$ which is symmetric and satisfies the following density condition for all $x \in \Sigma$ and all $0 \leq e \leq 1$:

$\displaystyle |\left \{ y : d(x,y) \leq e \right \}| < 2^{eK(x) + 1}$

That is, there is a restriction on the number of strings that are close to $x$. There is a sensible reason for such a convoluted condition: this is the Kolmogorov-complexity analogue of the Kraft inequality. One of the picky details we’ve blatantly left out in our discussion of Kolmogorov complexity is that the programs we’re allowed to write must collectively form a prefix-code. That is, no program is a proper prefix of another program. If the implications of this are unclear (or confusing), the reader may safely ignore it. It is purely a tool for theoretical analysis, and the full details are again in the text of Li & Vitanyi. We will come back to discuss other issues with this density condition later (in the mean time, think about why it’s potentially dubious), but now let us define our similarity metric.

Definition: The normalized information distance $d(x,y)$ is defined by

$\displaystyle d(x,y) = \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))}$

The reason we switched from $K(x|y)$ to $K(x|y^*)$ will become apparent in our calculations (we will make heavy use of the symmetry of information, which does not hold by a constant factor for $K(x|y)$).

Quickly note that this alleviates our empty string problem we had with the non-normalized metric. $d(x,\varepsilon) = K(x)/K(x) = 1$, so they are maximally dissimilar regardless of what $x$ is.

Theorem 1: (Metric Axioms) $d(x,y)$ satisfies the metric axioms up to additive $O(1/M)$ precision, where $M$ is the maximum of the Kolmogorov complexities of the strings involved in the (in)equality.

Theorem 2: (Universality) $d(x,y)$ is universal with respect to the class of computable normalized distance functions. That is, if $f$ is a normalized distance, then for all $x,y$ we have the following inequality:

$d(x,y) \leq f(x,y) + O(1/M)$

where again $M$ is the minimum of the Kolmogorov complexities of the strings involved.

We should note that in fact theorem 2 holds for even more general normalized distance functions, the so-called “upper semi-computable” functions. Skipping the rigorous definition, this just means that one can recursively approximate the true value by giving a consistently improved upper bound which converges to the actual value. It is not hard to see that $K$ is an upper semi-computable function, although it is unknown whether $d$ is (and many believe it is not).

The proof of the first theorem is straightforward but notationally dense.

Proof of Theorem 1 (Metric Axioms): The value $d(x,x) = K(x|x^*)/K(x) = O(1/K(x))$, since $K(x|x^*) = K(x|x,K(x))$ is trivially constant, and $d(x,y) \geq 0$ since Kolmogorov complexity is non-negative. Moreover, $d(x,y)$ is exactly symmetric, so the proof boils down to verifying the triangle inequality holds.

Let $x,y,z$ be strings. We gave a proof above that $K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$. We will modify this inequality to achieve our desired result, and there are two cases:

Case 1: $K(z) \leq \max(K(x), K(y))$. Take the maximum of each side of the two inequalities for $K(x|y^*), K(y|x^*)$ to get

$\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*) + K(z|y^*) , K(y|z^*) + K(z|x^*)) + O(1)$

We can further increase the right hand side by taking termwise maxima

$\displaystyle \max(K(x|y^*), K(y|x^*)) \leq \max(K(x|z^*), K(z|x^*)) + \max(K(y|z^*), K(z|y^*)) + O(1)$

Now divide through by $\max(K(x), K(y))$ to get

$\displaystyle \frac{\max(K(x|y^*), K(y|x^*))}{\max(K(x), K(y))} \leq \frac{\max(K(x|z^*), K(z|x^*))}{\max(K(x), K(y))} + \frac{\max(K(y|x^*), K(z|y^*))}{\max(K(x), K(y))} + O(1/M)$

Finally, since $K(z)$ is smaller than the max of $K(x), K(y)$, we can replace the  $K(y)$ in the denominator of the first term of the right hand side by $K(z)$. This will only possibly increase the fraction, and for the same reason we can replace $K(x)$ by $K(z)$ in the second term. This achieves the triangle inequality up to $O(1/M)$, as desired.

Case 2: $K(z) = \max(K(x), K(y), K(z))$. Without loss of generality we may also assume $K(x) \geq K(y)$, for the other possibility has an identical argument. Now we can boil the inequality down to something simpler. We already know the denominators have to all be $K(z)$ in the right hand side, and $K(x)$ in the left. Moreover, we claim $K(z|x^*) \geq K(x|z^*)$. This is by the symmetry of information:

$\displaystyle K(x,z) = K(x|z^*) + K(z) = K(z|x^*) + K(x) \leq K(z|x^*) + K(z)$

Subtracting $K(z)$ establishes the claim, and similarly we have $K(z|y^*) \geq K(y|z^*)$. So the triangle inequality reduces to

$\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(z|x^*)}{K(z)} + \frac{K(z|y^*)}{K(z)} + O(1/K(z))$

Applying our original inequality again to get $K(x|y^*) \leq K(x|z^*) + K(z|y^*) + O(1)$, we may divide through by $K(x)$ and there are two additional cases.

$\displaystyle \frac{K(x|y^*)}{K(x)} \leq \frac{K(x|z^*) + K(z|y^*) + O(1)}{K(x)}$

If the right-hand side is less than or equal to 1, then adding a constant $c$ to the top and bottom of the fraction only increases the value of the fraction, and doesn’t violate the inequality. So we choose to add $K(z)-K(x)$ to the top and bottom of the right-hand side and again using the symmetry of information, we get exactly the required value.

If the right-hand side is greater than 1, then adding any constant to the top and bottom decreases the value of the fraction, but it still remains greater than 1. Since $K(x|y^*) \leq K(x)$ (a simple exercise), we see that the left-hand side is at most 1, and our same trick of adding $K(z) - K(x)$ works. $\square$

The proof of the universality theorem is considerably more elegant.

Proof of Theorem 2 (Universality): Let $f$ be any normalized distance function, and set $e = f(x,y)$. Suppose further that $K(x) \leq K(y)$.

Let us enumerate all strings $v$ such that $f(x,v) \leq e$. In particular, since $e = f(x,y)$, $y$ is included in this enumeration. By the density condition, the number of such strings is at most $2^{eK(x) + 1}$. The index of $y$ in this enumeration can be used as an effective description of $y$ when given $x$ as input. That is, there is a program which includes in its description the index of $y$ and outputs $y$ given $x$. Since the number of bits needed to describe the index of $y$ is at most $\log(2^{eK(x) + 1}) = eK(x) + 1$, we have

$\displaystyle K(y|x) \leq eK(x) + 1$

Again the symmetry of information lemma gives us $K(x|y^*) \leq K(y|x^*)$. And now

$\displaystyle d(x,y) = \frac{K(y|x^*)}{K(y)} \leq \frac{K(y|x) + O(1)}{K(y)} \leq \frac{eK(x) + O(1)}{K(y)}$

Since $K(x) \leq K(y)$, we can replace the denominator of the last expression with $K(x)$ (only increasing the fraction) to get $d(x,y) \leq e + O(1/K(x))$. But $e$ was just $f(x,y)$, so this completes the proof of this case.

In the case $K(y) < K(x)$, the proof is similar (enumerating the index of $x$ instead), and at the end we get

$\displaystyle d(x,y) = \frac{K(x|y^*)}{K(x)} \leq \frac{eK(y) + O(1)}{K(y)} = f(x,y) + O(1/K(y))$

The theorem is proved. $\square$

Why Normalized Distance Functions?

The practical implications of the two theorems are immense. What we’re saying is that if we can represent some feature of string similarity by a normalized distance function, then that feature will be captured automatically by the normalized information distance $d$. The researchers who discovered normalized information distance (and proved its universality) argue that in fact upper semi-computable normalized distance functions encapsulate all real-world metrics we would ever care about! Of course, there is still the problem that Kolmogorov complexity is uncomputable, but we can certainly come up with reasonable approximations (we will see precisely this in our next post).

And these same researchers have shown that approximations to $d$ do represent a good deal of universality in practice. They’ve applied the same idea to fields as varied as genome clustering, language clustering, and music clustering. We will of course investigate the applications for ourselves on this blog, but their results seem to apply to data mining in any field.

But still this raises the obvious question (which goes unaddressed in any research article this author has read): does every metric have a sensible interpretation (or modification) as a normalized distance function? That awkward density condition seems particularly suspect, and is at the core of this author’s argument that the answer is no.

Consider the following example. Let $f$ be a normalized distance function, and fix $e = 1$. The density condition says that for any $x$ we want, the number of strings which are within distance 1 of $x$ is bounded by $2^{K(x) + 1}$. In particular, this quantity is finite, so there can only be finitely many strings which are within distance 1 of $x$. But there are infinitely many strings, so this is a contradiction!

Even if we rule out this (arguably trivial) case of $e=1$, we still run into problems. Let $e = 1 - \varepsilon$ for any sufficiently small $\varepsilon > 0$. Then fix $x = 0$ (the string consisting of the single bit 0). The number of strings which are within distance $e$ of $x$ is bounded by $2^{eK(x) + 1} < 2^{K(x) + 1}$ is again finite (and quite small, since $K(0)$ is about as simple as it gets). In other words, there are only a finite number of strings that are not maximally dissimilar to $0$. But one can easily come up with an infinite number of strings which share something in common with $0$: just use $0^n$ for any $n$ you please. It is ludicrous to say that every metric should call $0$ as dissimilar to $0^n$ as the empty string is to a random string of a thousand bits.

In general, this author doesn’t find it likely that one can take any arbitrary $f(x,y)$ which is both symmetric and has values in $[0,1]$ and modify it to satisfy the density condition. Indeed, this author has yet to see any example of a natural normalized similarity metric. There is one which is a modification of Hamming distance, but it is relatively awkward and involves the Kolmogorov complexity of the strings involved. If the reader has any ideas to the contrary, please share them in the comments.

So it appears that the class of normalized distance functions is not as large as we might wish, and in light of this the universality theorem is not as impressive. On the other hand, there is no denying the success of applying the normalized information distance to complex real-world problems. Something profound is going on, but from this author’s viewpoint more theoretical work is needed to establish why.

Friendly Cousins of Normalized Information Distance

In practice we want to compute $K(x|y^*)$ in terms of quantities we can actually approximate. Due to the symmetry of information, we can rewrite the metric formula as

$\displaystyle d(x,y)=\frac{K(x,y) - \min(K(x), K(y))}{\max(K(x), K(y))}$

Indeed, since our main interpretation of $K(x)$ is as the size of the smallest “compressed version” of the string $x$, it would seem that we can approximate the function $K$ by using real-world compression algorithms. And for the $K(x,y)$ part, we recognize that (due to the need to specify a way to distinguish between the outputs $x,y$)

$K(x,y) \leq K(xy) + O(\log(\max(K(x), K(y))))$,

where $K(xy)$ is the Kolmogorov complexity of the concatenation of the two strings. So if we’re willing to forgive additive logarithmic sloppiness (technically, $O(\log(K(x))/K(x))$ sloppiness, which goes to zero asymptotocally), we can approximate normalized information distance as

$\displaystyle d(x,y) = \frac{K(xy) - \min(K(x), K(y))}{\max(K(x), K(y))}$

In the literature researchers will also simplify the metric by removing the “star” notation

$\displaystyle d(x,y) = \frac{\max(K(x|y), K(y|x))}{\max(K(x), K(y))}$

Unfortunately these two things aren’t equivalent. As we saw in our “basic properties” of $K(x|y)$,

$K(x|y) \leq K(x|y^*) + O(\log(K(y)))$

Indeed, it is not the case that $K(x|y) = K(x|y^*)$. An easy counterexample is by trying to equate $K(K(x) | x) = K(K(x) | x^*)$. We have already proven that the right hand side is always constant, but the left hand side could not be. An exercise in Li & Vitanyi shows there is an infinite family of strings $x$ for which $K(K(x) | x) \geq \log(|x|)$.

And so these two metrics cannot be equal, but they are close. In fact, denoting the non-star version by $d_2$ and the regular version by $d_1$, we have $d_2(x,y) \leq d_1(x,y) + O(1)$. This changes the metric properties and the universality claim, because $O(1/K)$ precision is stronger than $O(1)$ precision. Indeed, the true constant is always less than 1 (e.g. when $K(y) > K(x)$ it is $K(y^*)/K(y)$), but this means the metric can potentially take values in the range $[0,2]$, which is edging further and further away from the notion of normalization we originally strove for.

Finally, the last example of a cousin metric is

$\displaystyle d_3(x,y) = \frac{K(x|y^*) + K(y|x^*)}{K(x,y)}$

We will leave it to the reader to verify this function again satisfies the metric inequalities (in the same way that the original normalized information distance does). On the other hand, it only satisfies universality up to a factor of 2. So while it still may give some nice results in practice (and it is easy to see how to approximate this), the first choice of normalized information distance was theoretically more precise.

Applications

We’ve just waded through a veritable bog of theory, but we’ve seen some big surprises along the way. Next time we’ll put these theoretical claims to the test by seeing how well we can cluster and classify data using the normalized information distance (and introducing as little domain knowledge as possible). Until then!