Featured Posts

The Fast Fourier Transform,
and a Noisy Signal
Eigenfaces
(Facial Recognition)
Elliptic Curves
and Cryptography
bezier-dog-image-small circle-wedge-sphere baby_programmers
Bezier Curves and Picasso Computing Homology Probably Approximately
Correct – A Formal Theory
of Learning
About these ads

Making Hybrid Images

monalisa

The Mona Lisa

Leonardo da Vinci’s Mona Lisa is one of the most famous paintings of all time. And there has always been a discussion around her enigmatic smile. He used a trademark Renaissance technique called sfumato, which involves many thin layers of glaze mixed with subtle pigments. The striking result is that when you look directly at Mona Lisa’s smile, it seems to disappear. But when you look at the background your peripherals see a smiling face.

One could spend decades studying the works of these masters from various perspectives, but if we want to hone in on the disappearing nature of that smile, mathematics can provide valuable insights. Indeed, though he may not have known the relationship between his work and da Vinci’s, hundreds of years later Salvador Dali did the artist’s equivalent of mathematically isolating the problem with his painting, “Gala Contemplating the Mediterranean Sea.”

gala-dali

Gala Contemplating the Mediterranean Sea (Salvador Dali, 1976)

Here you see a woman in the foreground, but step back quite far from the picture and there is a (more or less) clear image of Abraham Lincoln. Here the question of gaze is the blaring focus of the work. Now of course Dali and da Vinci weren’t scribbling down equations and computing integrals; their artistic expression was much less well-defined. But we the artistically challenged have tools of our own: mathematics, science, and programming.

In 2006 Aude Oliva, Antonio Torralba, and Philippe. G. Schyns used those tools to merge the distance of Dali and the faded smiles of da Vinci into one cohesive idea. In their 2006 paper they presented the notion of a “hybrid image,” presented below.

monalisas

The Mona Lisas of Science

If you look closely, you’ll see three women, each of which looks the teensiest bit strange, like they might be trying to suppress a smile, but none of them are smiling. Blur your eyes or step back a few meters, and they clearly look happy. The effect is quite dramatic. At the risk of being overly dramatic, these three women are literally modern day versions of Mona Lisa, the “Mona Lisas of Science,” if you will.

Another, perhaps more famous version of their technique, since it was more widely publicized, is their “Marilyn Einstein,” which up close is Albert Einstein and from far away is Marilyn Monroe.

marilyn-einstein

Marilyn Einstein

This one gets to the heart of the question of what the eye sees at close range versus long range. And it turns out that you can address this question (and create brilliant works of art like the ones above) with some basic Fourier analysis.

Intuitive Fourier analysis (and references)

The basic idea of Fourier analysis is the idea that smooth functions are hard to understand, and realization of how great it would be if we could decompose them into simpler pieces. Decomposing complex things into simpler parts is one of the main tools in all of mathematics, and Fourier analysis is one of the clearest examples of its application.

In particular, the things we care about are functions f(x) with specific properties I won’t detail here like “smoothness” and “finiteness.” And the building blocks are the complex exponential functions

\displaystyle e^{2 \pi i kx}

where k can be any integer. If you have done some linear algebra (and ignore this if you haven’t), then I can summarize the idea succinctly by saying the complex exponentials form an orthonormal basis for the vector space of square-integrable functions.

Back in colloquial language, what the Fourier theorem says is that any function of the kind we care about can be broken down into (perhaps infinitely many) pieces of this form called Fourier coefficients (I’m abusing the word “coefficient” here). The way it’s breaking down is also pleasingly simple: it’s a linear combination. Informally that means you’re just adding up all the complex exponentials with specific weights for each one. Mathematically, the conversion from the function to its Fourier coefficients is called the Fourier transform, and the set of all Fourier coefficients together is called the Fourier spectrum. So if you want to learn about your function f, or more importantly modify it in some way, you can inspect and modify its spectrum instead. The reason this is useful is that Fourier coefficients have very natural interpretations in sound and images, as we’ll see for the latter.

We wrote f(x) and the complex exponential as a function of one real variable, but you can do the same thing for two variables (or a hundred!). And, if you’re willing to do some abusing and ignore the complexness of complex numbers, then you can visualize “complex exponentials in two variables” as images of stripes whose orientation and thickness correspond to two parameters (i.e., the k in the offset equation becomes two coefficients). The video below shows how such complex exponentials can be used to build up an image of striking detail. The left frame shows which complex exponential is currently being added, and the right frame shows the layers all put together. I think the result is quite beautiful.

This just goes to show how powerful da Vinci’s idea of fine layering is: it’s as powerful as possible because it can create any image! 

Now for digital images like the one above, everything is finite. So rather than have an infinitely precise function and a corresponding infinite set of Fourier coefficients, you get a finite list of sampled values (pixels) and a corresponding grid of Fourier coefficients. But the important and beautiful theorem is, and I want to emphasize how groundbreakingly important this is:

If you give me an image (or any function!) I can compute the decomposition very efficiently.

And the same theorem lets you go the other way: if you give me the decomposition, I can compute the original function’s samples quite easily. The algorithm to do this is called the Fast Fourier transform, and if any piece of mathematics or computer science has a legitimate claim to changing the world, it’s the Fast Fourier transform. It’s hard to pinpoint specific applications, because the transform is so ubiquitous across science and engineering, but we definitely would not have cell phones, satellites, internet, or electronics anywhere near as small as we do without the Fourier transform and the ability to compute it quickly.

Constructing hybrid images is one particularly nice example of manipulating the Fourier spectrum of two images, and then combining them back into a single image. That’s what we’ll do now.

As a side note, by the nature of brevity, the discussion above is a big disservice to the mathematics involved. I summarized and abused in ways that mathematicians would object to. If you want to see a much better treatment of the material, this blog has a long series of posts developing Fourier transforms and their discrete analogues from scratch. See our four primers, which lead into the main content posts where we implement the Fast Fourier transform in Python and use it to apply digital watermarks to an image. Note that in those posts, as in this one, all of the materials and code used are posted on this blog’s Github page.

High and low frequencies

For images, interpreting ranges of Fourier coefficients is easy to do. You can imagine the coefficients lying on a grid in the plane like so:

sherlock-spectrum

Each dot in this grid corresponds to how “intense” the Fourier coefficient is. That is, it’s the magnitude of the (complex) coefficient of the corresponding complex exponential. Now the points that are closer to the origin correspond informally to the broad, smooth changes in the image. These are called “low frequency” coefficients. And points that are further away correspond to sharp changes and edges, and are likewise called “high frequency” components. So the if you wanted to “hybridize” two images, you’d pick ones with complementary intensities in these regions. That’s why Einstein (with all his wiry hair and wrinkles) and Monroe (with smooth features) are such good candidates. That’s also why, when we layered the Fourier components one by one in the video from earlier, we see the fuzzy shapes emerge before the fine details.

Moreover, we can “extract” the high frequency Fourier components by simply removing the low frequency ones. It’s a bit more complicated than that, since you want the transition from “something” to “nothing” to be smooth in sone sense. A proper discussion of this would go into sampling and the Nyquist frequency, but that’s beyond the scope of this post. Rather, we’ll just define a family of “filtering functions” without motivation and observe that they work well.

Definition: The Gaussian filter function with variance \sigma and center (a, b) is the function

\displaystyle g(x,y) = e^{-\frac{(x - a)^2 + (y - b)^2}{2 \sigma^2}}

It looks like this

image credit Wikipedia

image credit Wikipedia

In particular, at zero the function is 1 and it gradually drops to zero as you get farther away. The parameter \sigma controls the rate at which it vanishes, and in the picture above the center is set to (0,0).

Now what we’ll do is take our image, compute its spectrum, and multiply coordinatewise with a certain Gaussian function. If we’re trying to get rid of high-frequency components (called a “low-pass filter” because it lets the low frequencies through), we can just multiply the Fourier coefficients directly by the filter values g(x,y), and if we’re doing a “high-pass filter” we multiply by 1 - g(x,y).

Before we get to the code, here’s an example of a low-pass filter. First, take this image of Marilyn Monroe

marilyn

Now compute its Fourier transform

dft

Apply the low-pass filter

filtered-dft

And reverse the Fourier transform to get an image

low-passed-marilyn

In fact, this is a common operation in programs like photoshop for blurring an image (it’s called a Gaussian blur for obvious reasons). Here’s the python code to do this. You can download it along with all of the other resources used in making this post on this blog’s Github page.

import numpy
from numpy.fft import fft2, ifft2, fftshift, ifftshift
from scipy import misc
from scipy import ndimage
import math

def makeGaussianFilter(numRows, numCols, sigma, highPass=True):
   centerI = int(numRows/2) + 1 if numRows % 2 == 1 else int(numRows/2)
   centerJ = int(numCols/2) + 1 if numCols % 2 == 1 else int(numCols/2)

   def gaussian(i,j):
      coefficient = math.exp(-1.0 * ((i - centerI)**2 + (j - centerJ)**2) / (2 * sigma**2))
      return 1 - coefficient if highPass else coefficient

   return numpy.array([[gaussian(i,j) for j in range(numCols)] for i in range(numRows)])

def filterDFT(imageMatrix, filterMatrix):
   shiftedDFT = fftshift(fft2(imageMatrix))
   filteredDFT = shiftedDFT * filterMatrix
   return ifft2(ifftshift(filteredDFT))

def lowPass(imageMatrix, sigma):
   n,m = imageMatrix.shape
   return filterDFT(imageMatrix, makeGaussianFilter(n, m, sigma, highPass=False))

def highPass(imageMatrix, sigma):
   n,m = imageMatrix.shape
   return filterDFT(imageMatrix, makeGaussianFilter(n, m, sigma, highPass=True))

if __name__ == "__main__":
   marilyn = ndimage.imread("marilyn.png", flatten=True)
   lowPassedMarilyn = lowPass(marilyn, 20)
   misc.imsave("low-passed-marilyn.png", numpy.real(lowPassedMarilyn))

The first function samples the values from a Gaussian function with the specified parameters, discretizing the function and storing the values in a matrix. Then the filterDFT function applies the filter by doing coordinatewise multiplication (note these are all numpy arrays). We can do the same thing with a high-pass filter, producing the edgy image below

high-passed-marilyn

And if we compute the average of these two images, we basically get back to the original.

sum-of-marilyns

So the only difference between this and a hybrid image is that you take the low-passed part of one image and the high-passed part of another. Then the art is in balancing the parameters so as to make the averaged image look right. Indeed, with the following picture of Einstein and the above shot of Monroe, we can get a pretty good recreation of the Oliva-Torralba-Schyns piece. I think with more tinkering it could be even better (I did barely any centering/aligning/resizing to the original images).

Albert Einstein, Marilyn Monroe, and their hybridization.

Albert Einstein, Marilyn Monroe, and their hybridization.

And here’s the code for it

def hybridImage(highFreqImg, lowFreqImg, sigmaHigh, sigmaLow):
   highPassed = highPass(highFreqImg, sigmaHigh)
   lowPassed = lowPass(lowFreqImg, sigmaLow)

   return highPassed + lowPassed

Interestingly enough, doing it in reverse doesn’t give quite as pleasing results, but it still technically works. So there’s something particularly important that the high-passed image does have a lot of high-frequency components, and vice versa for the low pass.

backwards

You can see some of the other hybrid images Oliva et al constructed over at their web gallery.

Next Steps

How can we take this idea further? There are a few avenues I can think of. The most obvious one would be to see how this extends to video. Could one come up with generic parameters so that when two videos are hybridized (frame by frame, using this technique) it is only easy to see one at close distance? Or else, could we apply a three-dimensional transform to a video and modify that in some principled way? I think one would not likely find anything astounding, but who knows?

Second would be to look at the many other transforms we have at our disposal. How does manipulating the spectra of these transforms affect the original image, and can you make images that are hybridized in senses other than this one?

And finally, can we bring this idea down in dimension to work with one-dimensional signals? In particular, can we hybridize music? It could usher in a new generation of mashup songs that sound different depending on whether you wear earmuffs :)

Until next time!

Occam’s Razor and PAC-learning

So far our discussion of learning theory has been seeing the definition of PAC-learningtinkering with it, and seeing simple examples of learnable concept classes. We’ve said that our real interest is in proving big theorems about what big classes of problems can and can’t be learned. One major tool for doing this with PAC is the concept of VC-dimension, but to set the stage we’re going to prove a simpler theorem that gives a nice picture of PAC-learning when your hypothesis class is small. In short, the theorem we’ll prove says that if you have a finite set of hypotheses to work with, and you can always find a hypothesis that’s consistent with the data you’ve seen, then you can learn efficiently. It’s obvious, but we want to quantify exactly how much data you need to ensure low error. This will also give us some concrete mathematical justification for philosophical claims about simplicity, and the theorems won’t change much when we generalize to VC-dimension in a future post.

The Chernoff bound

One tool we will need in this post, which shows up all across learning theory, is the Chernoff-Hoeffding bound. We covered this famous inequality in detail previously on this blog, but the part of that post we need is the following theorem that says, informally, that if you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation. Here’s the slightly simplified version we’ll use:

Theorem: Let X_1, \dots, X_m be independent random variables whose values are in the range [0,1]. Call \mu_i = \mathbf{E}[X_i], X = \sum_i X_i, and \mu = \mathbf{E}[X] = \sum_i \mu_i. Then for all t > 0,

\displaystyle \Pr(|X-\mu| > t) \leq 2e^{-2t^2 / m}

One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed. This is important because in PAC we need guarantees that hold for any distribution generating data. Indeed, in our case the random variables above will be individual examples drawn from the distribution generating the data. We’ll be estimating the probability that our hypothesis has error deviating more than \varepsilon, and we’ll want to bound this by \delta, as in the definition of PAC-learning. Since the amount of deviation (error) and the number of samples (m) both occur in the exponent, the trick is in balancing the two values to get what we want.

Realizability and finite hypothesis classes

Let’s recall the PAC model once more. We have a distribution D generating labeled examples (x, c(x)), where c is an unknown function coming from some concept class C. Our algorithm can draw a polynomial number of these examples, and it must produce a hypothesis h from some hypothesis class H (which may or may not contain c). The guarantee we need is that, for any \delta, \varepsilon > 0, the algorithm produces a hypothesis whose error on D is at most \varepsilon, and this event happens with probability at least 1-\delta. All of these probabilities are taken over the randomness in the algorithm’s choices and the distribution D, and it has to work no matter what the distribution D is.

Let’s introduce some simplifications. First, we’ll assume that the hypothesis and concept classes H and C are finite. Second, we’ll assume that C \subset H, so that you can actually hope to find a hypothesis of zero error. This is called realizability. Later we’ll relax these first two assumptions, but they make the analysis a bit cleaner. Finally, we’ll assume that we have an algorithm which, when given labeled examples, can find in polynomial time a hypothesis h \in H that is consistent with every example.

These assumptions give a trivial learning algorithm: draw a bunch of examples and output any consistent hypothesis. The question is, how many examples do we need to guarantee that the hypothesis we find has the prescribed generalization error? It will certainly grow with 1 / \varepsilon, but we need to ensure it will only grow polynomially fast in this parameter. Indeed, realizability is such a strong assumption that we can prove a polynomial bound using even more basic probability theory than the Chernoff bound.

Theorem: A algorithm that efficiently finds a consistent hypothesis will PAC-learn any finite concept class provided it has at least m samples, where

\displaystyle m \geq \frac{1}{\varepsilon} \left ( \log |H| + \log \left ( \frac{1}{\delta} \right ) \right )

Proof. All we need to do is bound the probability that a bad hypothesis (one with error more than \varepsilon) is consistent with the given data. Now fix D, c, \delta, \varepsilon, and draw m examples and let h be any hypothesis that is consistent with the drawn examples. Suppose that the bad thing happens, that \Pr_D(h(x) \neq c(x)) > \varepsilon.

Because the examples are all drawn independently from D, the chance that all m examples are consistent with h is

\displaystyle (1 - \Pr_{x \sim D}(h(x) \neq c(x)))^m < (1 - \varepsilon)^m

What we’re saying here is, the probability that a specific bad hypothesis is actually consistent with your drawn examples is exponentially small in the error tolerance. So if we apply the union bound, the probability that some hypothesis you could produce is bad is at most (1 - \varepsilon)^m S, where S is the number of hypotheses the algorithm might produce.

A crude upper bound on the number of hypotheses you could produce is just the total number of hypotheses, |H|. Even cruder, let’s use the inequality (1 - x) < e^{-x} to give the bound

\displaystyle (1 - \varepsilon)^m |H| < e^{-\varepsilon m} |H|

Now we want to make sure that this probability, the probability of choosing a high-error (yet consistent) hypothesis, is at most \delta. So we can set the above quantity less than \delta and solve for m:

\displaystyle e^{-\varepsilon m} |H| \leq \delta

Taking logs and solving for m gives the desired bound.

\square

An obvious objection is: what if you aren’t working with a hypothesis class where you can guarantee that you’ll find a consistent hypothesis? Well, in that case we’ll need to inspect the definition of PAC again and reevaluate our measures of error. It turns out we’ll get a similar theorem as above, but with the stipulation that we’re only achieving error within epsilon of the error of the best available hypothesis.

But before we go on, this theorem has some deep philosophical interpretations. In particular, suppose that, before drawing your data, you could choose to work with one of two finite hypothesis classes H_1, H_2, with |H_1| > |H_2|. If you can find a consistent hypothesis no matter which hypothesis class you use, then this theorem says that your generalization guarantees are much stronger if you start with the smaller hypothesis class.

In other words, all else being equal, the smaller set of hypotheses is better. For this reason, the theorem is sometimes called the “Occam’s Razor” theorem. We’ll see a generalization of this theorem in the next section.

Unrealizability and an extra epsilon

Now suppose that $H$ doesn’t contain any hypotheses with error less than \varepsilon. What can we hope to do in this case? One thing is that we can hope to find a hypothesis whose error is within \varepsilon of the minimal error of any hypothesis in H. Moreover, we might not have any consistent hypotheses for some data samples! So rather than require an algorithm to produce an h \in H that is perfectly consistent with the data, we just need it to produce a hypothesis that has minimal empirical error, in the sense that it is as close to consistent as the best hypothesis of h on the data you happened to draw. It seems like such a strategy would find you a hypothesis that’s close to the best one in H, but we need to prove it and determine how many samples we need to draw to succeed.

So let’s make some definitions to codify this. For a given hypothesis, call \textup{err}(h) the true error of h on the distribution D. Our assumption is that there may be no hypotheses in H with \textup{err}(h) = 0. Next we’ll call the empirical error \hat{\textup{err}}(h).

Definition: We say a concept class C is agnostically learnable using the hypothesis class H if for all c \in C and all distributions D (and all \varepsilon, \delta > 0), there is a learning algorithm A which produces a hypothesis h that with probability at least 1 - \delta satisfies

\displaystyle \text{err}(h) \leq \min_{h' \in H} \text{err}(h') + \varepsilon

and everything runs in the same sort of polynomial time as for vanilla PAC-learning. This is called the agnostic setting or the unrealizable setting, in the sense that we may not be able to find a hypothesis with perfect empirical error.

We seek to prove that all concept classes are agnostically learnable with a finite hypothesis class, provided you have an algorithm that can minimize empirical error. But actually we’ll prove something stronger.

Theorem: Let H be a finite hypothesis class and m the number of samples drawn. Then for any \delta > 0, with probability 1-\delta the following holds:

\displaystyle \forall h \in H, \hat{\text{err}}(h) \leq \text{err}(h) + \sqrt{\frac{\log |H| + \log(2 / \delta)}{2m}}

In other words, we can precisely quantify how the empirical error converges to the true error as the number of samples grows. But this holds for all hypotheses in H, so this provides a uniform bound of the difference between true and empirical error for the entire hypothesis class.

Proving this requires the Chernoff bound. Fix a single hypothesis h \in H. If you draw an example x, call Z the random variable which is 1 when h(x) \neq c(x), and 0 otherwise. So if you draw m samples and call the i-th variable Z_i, the empirical error of the hypothesis is \frac{1}{m}\sum_i Z_i. Moreover, the actual error is the expectation of this random variable since \mathbf{E}[1/m \sum_i Z_i] = Z.

So what we’re asking is the probability that the empirical error deviates from the true error by a lot. Let’s call “a lot” some parameter \varepsilon/2 > 0 (the reason for dividing by two will become clear in the corollary to the theorem). Then plugging things into the Chernoff-Hoeffding bound gives a bound on the probability of the “bad event,” that the empirical error deviates too much.

\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2e^{-\frac{\varepsilon^2m}{2}}

Now to get a bound on the probability that some hypothesis is bad, we apply the union bound and use the fact that |H| is finite to get

\displaystyle \Pr[|\hat{\text{err}}(h) - \text{err}(h)| > \varepsilon / 2] < 2|H|e^{-\frac{\varepsilon^2m}{2}}

Now say we want to bound this probability by \delta. We set 2|H|e^{-\varepsilon^2m/2} \leq \delta, solve for m, and get

\displaystyle m \geq \frac{2}{\varepsilon^2}\left ( \log |H| + \log \frac{2}{\delta} \right )

This gives us a concrete quantification of the tradeoff between m, \varepsilon, \delta, and |H|. Indeed, if we pick m to be this large, then solving for \varepsilon / 2 gives the exact inequality from the theorem.

\square

Now we know that if we pick enough samples (polynomially many in all the parameters), and our algorithm can find a hypothesis h of minimal empirical error, then we get the following corollary:

Corollary: For any \varepsilon, \delta > 0, the algorithm that draws m \geq \frac{2}{\varepsilon^2}(\log |H| + \log(2/ \delta)) examples and finds any hypothesis of minimal empirical error will, with probability at least 1-\delta, produce a hypothesis that is within \varepsilon of the best hypothesis in H.

Proof. By the previous theorem, with the desired probability, for all h \in H we have |\hat{\text{err}}(h) - \text{err}(h)| < \varepsilon/2. Call g = \min_{h' \in H} \text{err}(h'). Then because the empirical error of h is also minimal, we have |\hat{\text{err}}(g) - \text{err}(h)| < \varepsilon / 2. And using the previous theorem again and the triangle inequality, we get |\text{err}(g) - \text{err}(h)| < 2 \varepsilon / 2 = \varepsilon. In words, the true error of the algorithm’s hypothesis is close to the error of the best hypothesis, as desired.

\square

Next time

Both of these theorems tell us something about the generalization guarantees for learning with hypothesis classes of a certain size. But this isn’t exactly the most reasonable measure of the “complexity” of a family of hypotheses. For example, one could have a hypothesis class with a billion intervals on \mathbb{R} (say you’re trying to learn intervals, or thresholds, or something easy), and the guarantees we proved in this post are nowhere near optimal.

So the question is: say you have a potentially infinite class of hypotheses, but the hypotheses are all “simple” in some way. First, what is the right notion of simplicity? And second, how can you get guarantees based on that analogous to these? We’ll discuss this next time when we define the VC-dimension.

Until then!

A Rook Game

Problem: Two players take turns moving a rook on an 8×8 chessboard. The rook is only allowed to move south or west (but not both in a single turn), and may move any number of squares in the chosen direction on a turn. The loser is the player who first cannot move the rook. What is the optimal play for any starting position?

rook-board

Solution: Take advantage of the symmetry of the board. If the rook is not on the diagonal, the optimal strategy is to move it to the diagonal. Then when the other player moves it off, your next move is to move it back to the diagonal. If your opponent starts their turn with the rook always on the diagonal, then you will never lose, and by the symmetry of the board you can always move the rook back to the diagonal. This provides an optimal algorithm for either player. In particular, if the rook starts on a square that is not on the diagonal, then player 1 can guarantee a win, and otherwise player 2 can.

Symmetry is one of the most powerful tools in all of mathematics, and this is a simple albeit illustrative example of its usage.

When Greedy Algorithms are Perfect: the Matroid

Greedy algorithms are by far one of the easiest and most well-understood algorithmic techniques. There is a wealth of variations, but at its core the greedy algorithm optimizes something using the natural rule, “pick what looks best” at any step. So a greedy routing algorithm would say to a routing problem: “You want to visit all these locations with minimum travel time? Let’s start by going to the closest one. And from there to the next closest one. And so on.”

Because greedy algorithms are so simple, researchers have naturally made a big effort to understand their performance. Under what conditions will they actually solve the problem we’re trying to solve, or at least get close? In a previous post we gave some easy-to-state conditions under which greedy gives a good approximation, but the obvious question remains: can we characterize when greedy algorithms give an optimal solution to a problem?

The answer is yes, and the framework that enables us to do this is called a matroid. That is, if we can phrase the problem we’re trying to solve as a matroid, then the greedy algorithm is guaranteed to be optimal. Let’s start with an example when greedy is provably optimal: the minimum spanning tree problem. Throughout the article we’ll assume the reader is familiar with the very basics of linear algebra and graph theory (though we’ll remind ourselves what a minimum spanning tree is shortly). For a refresher, this blog has primers on both subjects. But first, some history.

History

Matroids were first introduced by Hassler Whitney in 1935, and independently discovered a little later by B.L. van der Waerden (a big name in combinatorics). They were both interested in devising a general description of “independence,” the properties of which are strikingly similar when specified in linear algebra and graph theory. Since then the study of matroids has blossomed into a large and beautiful theory, one part of which is the characterization of the greedy algorithm: greedy is optimal on a problem if and only if the problem can be represented as a matroid. Mathematicians have also characterized which matroids can be modeled as spanning trees of graphs (we will see this momentarily). As such, matroids have become a standard topic in the theory and practice of algorithms.

Minimum Spanning Trees

It is often natural in an undirected graph G = (V,E) to find a connected subset of edges that touch every vertex. As an example, if you’re working on a power network you might want to identify a “backbone” of the network so that you can use the backbone to cheaply travel from any node to any other node. Similarly, in a routing network (like the internet) it costs a lot of money to lay down cable, it’s in the interest of the internet service providers to design analogous backbones into their infrastructure.

A minimal subset of edges in a backbone like this is guaranteed to form a tree. This is simply because if you have a cycle in your subgraph then removing any edge on that cycle doesn’t break connectivity or the fact that you can get from any vertex to any other (and trees are the maximal subgraphs without cycles). As such, these “backbones” are called spanning trees. “Span” here means that you can get from any vertex to any other vertex, and it suggests the connection to linear algebra that we’ll describe later, and it’s a simple property of a tree that there is a unique path between any two vertices in the tree.

An example of a spanning tree

An example of a spanning tree

When your edges e \in E have nonnegative weights w_e \in \mathbb{R}^{\geq 0}, we can further ask to find a minimum cost spanning tree. The cost of a spanning tree T is just the sum of its edges, and it’s important enough of a definition to offset.

Definition: A minimum spanning tree T of a weighted graph G (with weights w_e \geq 0 for e \in E) is a spanning tree which minimizes the quantity

w(T) = \sum_{e \in T} w_e

There are a lot of algorithms to find minimal spanning trees, but one that will lead us to matroids is Kruskal’s algorithm. It’s quite simple. We’ll maintain a forest F in G, which is just a subgraph consisting of a bunch of trees that may or may not be connected. At the beginning F is just all the vertices with no edges. And then at each step we add to F the edge e whose weight is smallest and also does not introduce any cycles into F. If the input graph G is connected then this will always produce a minimal spanning tree.

Theorem: Kruskal’s algorithm produces a minimal spanning tree of a connected graph.

Proof. Call F_t the forest produced at step t of the algorithm. Then F_0 is the set of all vertices of G and F_{n-1} is the final forest output by Kruskal’s (as a quick exercise, prove all spanning trees on n vertices have n-1 edges, so we will stop after n-1 rounds). It’s clear that F_{n-1} is a tree because the algorithm guarantees no F_i will have a cycle. And any tree with n-1 edges is necessarily a spanning tree, because if some vertex were left out then there would be n-1 edges on a subgraph of n-1 vertices, necessarily causing a cycle somewhere in that subgraph.

Now we’ll prove that F_{n-1} has minimal cost. We’ll prove this in a similar manner to the general proof for matroids. Indeed, say you had a tree T whose cost is strictly less than that of F_{n-1} (we can also suppose that T is minimal, but this is not necessary). Pick the minimal weight edge e \in T that is not in F_{n-1}. Adding e to F_{n-1} introduces a unique cycle C in F_{n-1}. This cycle has some strange properties. First, e has the highest cost of any edge on C. For otherwise, Kruskal’s algorithm would have chosen it before the heavier weight edges. Second, there is another edge in C that’s not in T (because T was a tree it can’t have the entire cycle). Call such an edge e'. Now we can remove e' from F_{n-1} and add e. This can only increase the total cost of F_{n-1}, but this transformation produces a tree with one more edge in common with T than before. This contradicts that T had strictly lower weight than F_{n-1}, because repeating the process we described would eventually transform F_{n-1} into T exactly, while only increasing the total cost.

\square

Just to recap, we defined sets of edges to be “good” if they did not contain a cycle, and a spanning tree is a maximal set of edges with this property. In this scenario, the greedy algorithm performed optimally at finding a spanning tree with minimal total cost.

Columns of Matrices

Now let’s consider a different kind of problem. Say I give you a matrix like this one:

\displaystyle A = \begin{pmatrix} 2 & 0 & 1 & -1 & 0 \\ 0 & -4 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 7 \end{pmatrix}

In the standard interpretation of linear algebra, this matrix represents a linear function f from one vector space V to another W, with the basis (v_1, \dots, v_5) of V being represented by columns and the basis (w_1, w_2, w_3) of W being represented by the rows. Column j tells you how to write f(v_j) as a linear combination of the w_i, and in so doing uniquely defines f.

Now one thing we want to calculate is the rank of this matrix. That is, what is the dimension of the image of V under f? By linear algebraic arguments we know that this is equivalent to asking “how many linearly independent columns of A can we find”? An interesting consequence is that if you have two sets of columns that are both linearly independent and maximally so (adding any other column to either set would necessarily introduce a dependence in that set), then these two sets have the same size. This is part of why the rank of a matrix is well-defined.

If we were to give the columns of A costs, then we could ask about finding the minimal-cost maximally-independent column set. It sounds like a mouthful, but it’s exactly the same idea as with spanning trees: we want a set of vectors that spans the whole column space of A, but contains no “cycles” (linearly dependent combinations), and we want the cheapest such set.

So we have two kinds of “independence systems” that seem to be related. One interesting question we can ask is whether these kinds of independence systems are “the same” in a reasonable way. Hardcore readers of this blog may see the connection quite quickly. For any graph G = (V,E), there is a natural linear map from E to V, so that a linear dependence among the columns (edges) corresponds to a cycle in G. This map is called the incidence matrix by combinatorialists and the first boundary map by topologists.

The map is easy to construct: for each edge e = (v_i,v_j) you add a column with a 1 in the j-th row and a -1 in the i-th row. Then taking a sum of edges gives you zero if and only if the edges form a cycle. So we can think of a set of edges as “independent” if they don’t contain a cycle. It’s a little bit less general than independence over \mathbb{R}, but you can make it exactly the same kind of independence if you change your field from real numbers to \mathbb{Z}/2\mathbb{Z}. We won’t do this because it will detract from our end goal (to analyze greedy algorithms in realistic settings), but for further reading this survey of Oxley assumes that perspective.

So with the recognition of how similar these notions of independence are, we are ready to define matroids.

The Matroid

So far we’ve seen two kinds of independence: “sets of edges with no cycles” (also called forests) and “sets of linearly independent vectors.” Both of these share two trivial properties: there are always nonempty independent sets, and every subset of an independent set is independent. We will call any family of subsets with this property an independence system.

Definition: Let X be a finite set. An independence system over X is a family \mathscr{I} of subsets of X with the following two properties.

  1. \mathscr{I} is nonempty.
  2. If I \in \mathscr{I}, then so is every subset of I.

This is too general to characterize greedy algorithms, so we need one more property shared by our examples. There are a few things we do, but here’s one nice property that turns out to be enough.

Definition: A matroid M = (X, \mathscr{I}) is a set X and an independence system \mathscr{I} over X with the following property:

If A, B are in \mathscr{I} with |A| = |B| + 1, then there is an element x \in A \setminus B such that B \cup \{ a \} \in \mathscr{I}.

In other words, this property says if I have an independent set that is not maximally independent, I can grow the set by adding some suitably-chosen element from a larger independent set. We’ll call this the extension property. For a warmup exercise, let’s prove that the extension property is equivalent to the following (assuming the other properties of a matroid):

For every subset Y \subset X, all maximal independent sets contained in Y have equal size.

Proof. For one direction, if you have two maximal sets A, B \subset Y \subset X that are not the same size (say A is bigger), then you can take any subset of A whose size is exactly |B| + 1, and use the extension property to make B larger, a contradiction. For the other direction, say that I know all maximal independent sets of any Y \subset X have the same size, and you give me A, B \subset X. I need to find an a \in A \setminus B that I can add to B and keep it independent. What I do is take the subset Y = A \cup B. Now the sizes of A, B don’t change, but B can’t be maximal inside Y because it’s smaller than A (A might not be maximal either, but it’s still independent). And the only way to extend B is by adding something from A, as desired.

\square

So we can use the extension property and the cardinality property interchangeably when talking about matroids. Continuing to connect matroid language to linear algebra and graph theory, the maximal independent sets of a matroid are called bases, the size of any basis is the rank of the matroid, and the minimal dependent sets are called circuits. In fact, you can characterize matroids in terms of the properties of their circuits, which are dual to the properties of bases (and hence all independent sets) in a very concrete sense.

But while you could spend all day characterizing the many kinds of matroids and comatroids out there, we are still faced with the task of seeing how the greedy algorithm performs on a matroid. That is, suppose that your matroid M = (X, \mathscr{I}) has a nonnegative real number w(x) associated with each x \in X. And suppose we had a black-box function to determine if a given set S \subset X is independent. Then the greedy algorithm maintains a set B, and at every step adds a minimum weight element that maintains the independence of B. If we measure the cost of a subset by the sum of the weights of its elements, then the question is whether the greedy algorithm finds a minimum weight basis of the matroid.

The answer is even better than yes. In fact, the answer is that the greedy algorithm performs perfectly if and only if the problem is a matroid! More rigorously,

Theorem: Suppose that M = (X, \mathscr{I}) is an independence system, and that we have a black-box algorithm to determine whether a given set is independent. Define the greedy algorithm to iteratively adds the cheapest element of X that maintains independence. Then the greedy algorithm produces a maximally independent set S of minimal cost for every nonnegative cost function on X, if and only if M is a matroid.

It’s clear that the algorithm will produce a set that is maximally independent. The only question is whether what it produces has minimum weight among all maximally independent sets. We’ll break the theorem into the two directions of the “if and only if”:

Part 1: If M is a matroid, then greedy works perfectly no matter the cost function.
Part 2: If greedy works perfectly for every cost function, then M is a matroid.

Proof of Part 1.

Call the cost function w : X \to \mathbb{R}^{\geq 0}, and suppose that the greedy algorithm picks elements B = \{ x_1, x_2, \dots, x_r \} (in that order). It’s easy to see that w(x_1) \leq w(x_2) \leq \dots \leq w(x_r). Now if you give me any list of r independent elements y_1, y_2, \dots, y_r \in X that has w(y_1) \leq \dots \leq w(y_r), I claim that w(x_i) \leq w(y_i) for all i. This proves what we want, because if there were a basis of size r with smaller weight, sorting its elements by weight would give a list contradicting this claim.

To prove the claim, suppose to the contrary that it were false, and for some k we have w(x_k) > w(y_k). Moreover, pick the smallest k for which this is true. Note k > 1, and so we can look at the special sets S = \{ x_1, \dots, x_{k-1} \} and T = \{ y_1, \dots, y_k \}. Now |T| = |S|+1, so by the matroid property there is some j between 1 and r so that S \cup \{ y_j \} is an independent set (and y_j is not in S). But then w(y_j) \leq w(y_k) < w(x_k), and so the greedy algorithm would have picked y_j before it picks x_k (and the strict inequality means they’re different elements). This contradicts how the greedy algorithm runs, and hence proves the claim.

Proof of Part 2.

We’ll prove this contrapositively as follows. Suppose we have our independence system and it doesn’t satisfy the last matroid condition. Then we’ll construct a special weight function that causes the greedy algorithm to fail. So let A,B be independent sets with |A| = |B| + 1, but for every a \in A \setminus B adding a to B never gives you an independent set.

Now what we’ll do is define our weight function so that the greedy algorithm picks the elements we want in the order we want (roughly). In particular, we’ll assign all elements of A \cap B a tiny weight we’ll call w_1. For elements of B - A we’ll use w_2, and for A - B we’ll use w_3, with w_4 for everything else. In a more compact notation:

CodeCogsEqn

We need two things for this weight function to screw up the greedy algorithm. The first is that w_1 < w_2 < w_3 < w_4, so that greedy picks the elements in the order we want. Note that this means it’ll first pick all of A \cap B, and then all of B - A, and by assumption it won’t be able to pick anything from A - B, but since B is assumed to be non-maximal, we have to pick at least one element from X - (A \cup B) and pay w_4 for it.

So the second thing we want is that the cost of doing greedy is worse than picking any maximally independent set that contains A (and we know that there has to be some maximal independent set containing A). In other words, if we call m the size of a maximally independent set, we want

\displaystyle |A \cap B| w_1 + |B-A|w_2 + (m - |B|)w_4 > |A \cap B|w_1 + |A-B|w_3 + (m-|A|)w_4

This can be rearranged (using the fact that |A| = |B|+1) to

\displaystyle w_4 > |A-B|w_3 - |B-A|w_2

The point here is that the greedy picks too many elements of weight w_4, since if we were to start by taking all of A (instead of all of B), then we could get by with one fewer. That might not be optimal, but it’s better than greedy and that’s enough for the proof.

So we just need to make w_4 large enough to make this inequality hold, while still maintaining w_2 < w_3. There are probably many ways to do this, and here’s one. Pick some 0 < \varepsilon < 1, and set

settings

It’s trivial that w_1 < w_2 and w_3 < w_4. For the rest we need some observations. First, the fact that |A-B| = |B-A| + 1 implies that w_2 < w_3. Second, both |A-B| and |B-A| are nonempty, since otherwise the second property of independence systems would contradict our assumption that augmenting B with elements of A breaks independence. Using this, we can divide by these quantities to get

\displaystyle w_4 = 2 > 1 = \frac{|A-B|(1 + \varepsilon)}{|A-B|} - \frac{|B-A|\varepsilon}{|B-A|}

This proves the claim and finishes the proof.

\square

As a side note, we proved everything here with respect to minimizing the sum of the weights, but one can prove an identical theorem for maximization. The only part that’s really different is picking the clever weight function in part 2. In fact, you can convert between the two by defining a new weight function that subtracts the old weights from some fixed number N that is larger than any of the original weights. So these two problems really are the same thing.

This is pretty amazing! So if you can prove your problem is a matroid then you have an awesome algorithm automatically. And if you run the greedy algorithm for fun and it seems like it works all the time, then that may be hinting that your problem is a matroid. This is one of the best situations one could possibly hope for.

But as usual, there are a few caveats to consider. They are both related to efficiency. The first is the black box algorithm for determining if a set is independent. In a problem like minimum spanning tree or finding independent columns of a matrix, there are polynomial time algorithms for determining independence. These two can both be done, for example, with Gaussian elimination. But there’s nothing to stop our favorite matroid from requiring an exponential amount of time to check if a set is independent. This makes greedy all but useless, since we need to check for independence many times in every round.

Another, perhaps subtler, issue is that the size of the ground set X might be exponentially larger than the rank of the matroid. In other words, at every step our greedy algorithm needs to find a new element to add to the set it’s building up. But there could be such a huge ocean of candidates, all but a few of which break independence. In practice an algorithm might be working with X implicitly, so we could still hope to solve the problem if we had enough knowledge to speed up the search for a new element.

There are still other concerns. For example, a naive approach to implementing greedy takes quadratic time, since you may have to look through every element of X to find the minimum-cost guy to add. What if you just have to have faster runtime than O(n^2)? You can still be interested in finding more efficient algorithms that still perform perfectly, and to the best of my knowledge there’s nothing that says that greedy is the only exact algorithm for your favorite matroid. And then there are models where you don’t have direct/random access to the input, and lots of other ways that you can improve on greedy. But those stories are for another time.

Until then!