A Proofless Introduction to Information Theory

There are two basic problems in information theory that are very easy to explain. Two people, Alice and Bob, want to communicate over a digital channel over some long period of time, and they know the probability that certain messages will be sent ahead of time. For example, English language sentences are more likely than gibberish, and “Hi” is much more likely than “asphyxiation.” The problems are:

  1. Say communication is very expensive. Then the problem is to come up with an encoding scheme for the messages which minimizes the expected length of an encoded message and guarantees the ability to unambiguously decode a message. This is called the noiseless coding problem.
  2. Say communication is not expensive, but error prone. In particular, each bit i of your message is erroneously flipped with some known probably p, and all the errors are independent. Then the question is, how can one encode their messages to as to guarantee (with high probability) the ability to decode any sent message? This is called the noisy coding problem.

There are actually many models of “communication with noise” that generalize (2), such as models based on Markov chains. We are not going to cover them here.

Here is a simple example for the noiseless problem. Say you are just sending binary digits as your messages, and you know that the string “00000000” (eight zeros) occurs half the time, and all other eight-bit strings occur equally likely in the other half. It would make sense, then, to encode the “eight zeros” string as a 0, and prefix all other strings with a 1 to distinguish them from zero. You would save on average 7 \cdot 1/2 + (-1) \cdot 1/2 = 3 bits in every message.

One amazing thing about these two problems is that they were posed and solved in the same paper by Claude Shannon in 1948. One byproduct of his work was the notion of entropy, which in this context measures the “information content” of a message, or the expected “compressibility” of a single bit under the best encoding. For the extremely dedicated reader of this blog, note this differs from Kolmogorov complexity in that we’re not analyzing the compressibility of a string by itself, but rather when compared to a distribution. So really we should think of (the domain of) the distribution as being compressed, not the string.

Claude Shannon. Image credit: Wikipedia

Entropy and noiseless encoding

Before we can state Shannon’s theorems we have to define entropy.

Definition: Suppose D is a distribution on a finite set X, and I’ll use D(x) to denote the probability of drawing x from D. The entropy of D, denoted H(D) is defined as

H(D) = \sum_{x \in X} D(x) \log \frac{1}{D(x)}

It is strange to think about this sum in abstract, so let’s suppose D is a biased coin flip with bias 0 \leq p \leq 1 of landing heads. Then we can plot the entropy as follows

Image source: Wikipedia

Image source: Wikipedia

The horizontal axis is the bias p, and the vertical axis is the value of H(D), which with some algebra is - p \log p - (1-p) \log (1-p). From the graph above we can see that the entropy is maximized when p=1/2 and minimized at p=0, 1. You can verify all of this with calculus, and you can prove that the uniform distribution maximizes entropy in general as well.

So what is this saying? A high entropy measures how incompressible something is, and low entropy gives us lots of compressibility. Indeed, if our message consisted of the results of 10 such coin flips, and p was close to 1, we could be able to compress a lot by encoding strings with lots of 1’s using few bits. On the other hand, if p=1/2 we couldn’t get any compression at all. All strings would be equally likely.

Shannon’s famous theorem shows that the entropy of the distribution is actually all that matters. Some quick notation: \{ 0,1 \}^* is the set of all binary strings.

Theorem (Noiseless Coding Theorem) [Shannon 1948]: For every finite set X and distribution D over X, there are encoding and decoding functions \textup{Enc}: X \to \{0,1 \}^*, \textup{Dec}: \{ 0,1 \}^* \to X such that

  1. The encoding/decoding actually works, i.e. \textup{Dec}(\textup{Enc}(x)) = x for all x.
  2. The expected length of an encoded message is between H(D) and H(D) + 1.

Moreover, no encoding scheme can do better.

Item 2 and the last sentence are the magical parts. In other words, if you know your distribution over messages, you precisely know how long to expect your messages to be. And you know that you can’t hope to do any better!

As the title of this post says, we aren’t going to give a proof here. Wikipedia has a proof if you’re really interested in the details.

Noisy Coding

The noisy coding problem is more interesting because in a certain sense (that was not solved by Shannon) it is still being studied today in the field of coding theory. The interpretation of the noisy coding problem is that you want to be able to recover from white noise errors introduced during transmission. The concept is called error correction. To restate what we said earlier, we want to recover from error with probability asymptotically close to 1, where the probability is over the errors.

It should be intuitively clear that you can’t do so without your encoding “blowing up” the length of the messages. Indeed, if your encoding does not blow up the message length then a single error will confound you since many valid messages would differ by only a single bit. So the question is does such an encoding exist, and if so how much do we need to blow up the message length? Shannon’s second theorem answers both questions.

Theorem (Noisy Coding Theorem) [Shannon 1948]: For any constant noise rate p < 1/2, there is an encoding scheme \textup{Enc} : \{ 0,1 \}^k \to \{0,1\}^{ck}, \textup{Dec} : \{ 0,1 \}^{ck} \to \{ 0,1\}^k with the following property. If x is the message sent by Alice, and y is the message received by Bob (i.e. \textup{Enc}(x) with random noise), then \Pr[\textup{Dec}(y) = x] \to 1 as a function of n=ck. In addition, if we denote by H(p) the entropy of the distribution of an error on a single bit, then choosing any c > \frac{1}{1-H(p)} guarantees the existence of such an encoding scheme, and no scheme exists for any smaller c.

This theorem formalizes a “yes” answer to the noisy coding problem, but moreover it characterizes the blowup needed for such a scheme to exist. The deep fact is that it only depends on the noise rate.

A word about the proof: it’s probabilistic. That is, Shannon proved such an encoding scheme exists by picking \textup{Enc} to be a random function (!). Then \textup{Dec}(y) finds (nonconstructively) the string x such that the number of bits different between \textup{Enc}(x) and y is minimized. This “number of bits that differ” measure is called the Hamming distance. Then he showed using relatively standard probability tools that this scheme has the needed properties with high probability, the implication being that some scheme has to exist for such a probability to even be positive. The sharp threshold for c takes a bit more work. If you want the details, check out the first few lectures of Madhu Sudan’s MIT class.

The non-algorithmic nature of his solution is what opened the door to more research. The question has surpassed, “Are there any encodings that work?” to the more interesting, “What is the algorithmic cost of constructing such an encoding?” It became a question of complexity, not computability. Moreover, the guarantees people wanted were strengthened to worst case guarantees. In other words, if I can guarantee at most 12 errors, is there an encoding scheme that will allow me to always recover the original message, and not just with high probability? One can imagine that if your message contains nuclear codes or your bank balance, you’d definitely want to have 100% recovery ability.

Indeed, two years later Richard Hamming spawned the theory of error correcting codes and defined codes that can always correct a single error. This theory has expanded and grown over the last sixty years, and these days the algorithmic problems of coding theory have deep connections to most areas of computer science, including learning theory, cryptography, and quantum computing.

We’ll cover Hamming’s basic codes next time, and then move on to Reed-Solomon codes and others. Until then!

The Giant Component and Explosive Percolation

Last time we left off with a tantalizing conjecture: a random graph with edge probability p = 5/n is almost surely a connected graph. We arrived at that conjecture from some ad-hoc data analysis, so let’s go back and treat it with some more rigorous mathematical techniques. As we do, we’ll discover some very interesting “threshold theorems” that essentially say a random graph will either certainly have a property, or it will certainly not have it.

phase-transition-n-grows

The phase transition we empirically observed from last time.

Big components

Recalling the basic definition: an Erdős-Rényi (ER) random graph with n vertices and edge probability p is a probability distribution over all graphs on n vertices. Generatively, you draw from an ER distribution by flipping a p-biased coin for each pair of vertices, and adding the edge if you flip heads. We call the random event of drawing a graph from this distribution a “random graph” even though it’s not a graph, and we denote an ER random graph by G(n,p). When p = 1/2, the distribution G(n,1/2) is the uniform distribution over all graphs on n vertices.

Now let’s get to some theorems. The main tools we’ll use are called the first and second moment method. Let’s illustrate them by example.

The first moment method

Say we want to know what values of p are likely to produce graphs with isolated vertices (vertices with no neighbors), and which are not. Of course, the value of p will depend on n \to \infty in general, but we can already see by example that if p = 1/2 then the probability of a fixed vertex being isolated is 2^{-n} \to 0. We can use the union bound (sum this value over all vertices) to show that the probability of any vertex being isolated is at most n2^{-n} which also tends to zero very quickly. This is not the first moment method, I’m just making the point that all of our results will be interpreted asymptotically as n \to \infty.

So now we can ask: what is the expected number of isolated vertices? If I call X the random variable that counts the expected number of isolated vertices, then I’m asking about \mathbb{E}[X]. Really what I’m doing is interpreting X as a random variable depending on n, p(n), and asking about the evolution of \mathbb{E}[X] as n \to \infty.

Now the first moment method states, somewhat obviously, that if the expectation tends to zero then the value of X itself also tends to zero. Indeed, this follows from Markov’s inequality, which states that the probability that X \geq a is bounded by \mathbb{E}[X]/a. In symbols,

\displaystyle \Pr[X \geq a] \leq \frac{\mathbb{E}[X]}{a}.

In our case X is counting something (it’s integer valued), so asking whether X > 0 is equivalent to asking whether X \geq 1. The upper bound on the probability of X being strictly positive is then just \mathbb{E}[X].

So let’s find out when the expected number of isolated vertices goes to zero. We’ll use the wondrous linearity of expectation to split X into a sum of counts for each vertex. That is, if X_i is 1 when vertex i is isolated and 0 otherwise (this is called an indicator variable), then X = \sum_{i=1}^n X_i and linearity of expectation gives

\displaystyle \mathbb{E}[X] = \mathbb{E}[\sum_{i=1}^n X_i] = \sum_{i=1}^n \mathbb{E}[X_i]

Now the expectation of an indicator random variable is just the probability that the event occurs (it’s trivial to check). It’s easy to compute the probability that a vertex is isolated: it’s (1-p)^n. So the sum above works out to be n(1-p)^n. It should really be n(1-p)^{n-1} but the extra factor of (1-p) doesn’t change anything. The question is what’s the “smallest” way to set p as a function of n in order to make the above thing go to zero? Using the fact that (1-x) < e^{-x} for all x > 0, we get

n(1-p)^n < ne^{-pn}

And setting p = (\log n) / n simplifies the right hand side to ne^{- \log n} = n / n = 1. This is almost what we want, so let’s set p to be anything that grows asymptotically faster than (\log n) / n. The notation for this is \omega((\log n) / n). Then using some slick asymptotic notation we can prove that the RHS of the inequality above goes to zero, and so the LHS must as well. Back to the big picture: we just showed that the expectation of X (the expected number of isolated vertices) goes to zero, and so by the first moment method the value of X (the actual number of isolated vertices) has to go to zero with probability tending to 1.

Some quick interpretations: when p = (\log n) / n each vertex has \log n neighbors in expectation. Moreover, having no isolated vertices is just a little bit short of the entire graph being connected (our ultimate goal is to figure out exactly when this happens). But already we can see that our conjecture from the beginning is probably false: we aren’t able to use this same method to show that when p = c/n for some constant c rules out isolated vertices as n \to \infty. We just got lucky in our data analysis that 5 is about the natural log of 100 (which is 4.6).

The second moment method

Now what about the other side of the coin? If p is asymptotically less than (\log n) / n do we necessarily get isolated vertices? That would really put our conjecture to rest. In this case the answer is yes, but it might not be in general. Let’s discuss.

We said that in general if \mathbb{E}[X] \to 0 then the value of X has to go to zero too (that’s the first moment method). The flip side of this is: if \mathbb{E}[X] \to \infty does necessarily the value of X also tend to infinity? The answer is not always yes. Here is a gruesome example I originally heard from a book: say X is the number of people that will die in the next decade due to an asteroid hitting the earth. The probability that the event happens is quite small, but if it does happen then the number of people that will die is quite large. It is perfectly reasonable for this to drag up the expectation (as the world population grows every decade), but at least we hope a growing population doesn’t by itself increase the value of X.

Mathematics is on our side here. We’re asking under what conditions on \mathbb{E}[X] does the following implication hold: \mathbb{E}[X] \to \infty implies \Pr[X > 0] \to 1.

With the first moment method we used Markov’s inequality (a statement about expectation, also called the first moment). With the second moment method we’ll use a statement about the second moment (variances), and the most common is Chebyshev’s inequality. Chebyshev’s inequality states that the probability X deviates from its expectation by more than c is bounded by \textup{Var}[X] / c^2. In symbols, for all c > 0 we have

\displaystyle \Pr[|X - \mathbb{E}[X]| \geq c] \leq \frac{\textup{Var}[X]}{c^2}

Now the opposite of X > 0, written in terms of deviation from expectation, is |X - \mathbb{E}[X]| \geq \mathbb{E}[X]. In words, in order for any number a to be zero, it has to have a distance of at least b from any number b. It’s such a stupidly simple statement it’s almost confusing. So then we’re saying that

\displaystyle \Pr[X = 0] \leq \frac{\textup{Var}[X]}{\mathbb{E}[X]^2}.

In order to make this probability go to zero, it’s enough to have \textup{Var}[X] = o(\mathbb{E}[X]^2). Again, the little-o means “grows asymptotically slower than.” So the numerator of the fraction on the RHS will grow asymptotically slower than the denominator, meaning the whole fraction tends to zero. This condition and its implication are together called the “second moment method.”

Great! So we just need to compute \textup{Var}[X] and check what conditions on p make it fit the theorem. Recall that \textup{Var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2, and we want to upper bound this in terms of \mathbb{E}[X]^2. Let’s compute \mathbb{E}[X]^2 first.

\displaystyle \mathbb{E}[X]^2 = n^2(1-p)^{2n}

Now the variance.

\displaystyle \textup{Var}[X] = \mathbb{E}[X^2] - n^2(1-p)^{2n}

Expanding X as a sum of indicator variables X_i for each vertex, we can split the square into a sum over pairs. Note that X_i^2 = X_i since they are 0-1 valued indicator variables, and X_iX_j is the indicator variable for both events happening simultaneously.

\displaystyle \begin{aligned} \mathbb{E}[X^2] &= \mathbb{E}[\sum_{i,j} X_{i,j}] \\ &=\mathbb{E} \left [ \sum_i X_i^2 + \sum_{i \neq j} X_iX_j \right ] \\ &= \sum_i \mathbb{E}[X_i^2] + \sum_{i \neq j} \mathbb{E}[X_iX_j] \end{aligned}

By what we said about indicators, the last line is just

\displaystyle \sum_i \Pr[i \textup{ is isolated}] + \sum_{i \neq j} \Pr[i,j \textup{ are both isolated}]

And we can compute each of these pieces quite easily. They are (asymptotically ignoring some constants):

\displaystyle n(1-p)^n + n^2(1-p)(1-p)^{2n-4}

Now combining the two terms together (subtracting off the square of the expectation),

\displaystyle \begin{aligned} \textup{Var}[X] &\leq n(1-p)^n + n^2(1-p)^{-3}(1-p)^{2n} - n^2(1-p)^{2n} \\ &= n(1-p)^n + n^2(1-p)^{2n} \left ( (1-p)^{-3} - 1 \right ) \end{aligned}

Now we divide by \mathbb{E}[X]^2 to get n^{-1}(1-p)^{-n} + (1-p)^{-3} - 1. Since we’re trying to see if p = (\log n) / n is a sharp threshold, the natural choice is to let p = o((\log n) / n). Indeed, using the 1-x < e^{-x} upper bound and plugging in the little-o bounds the whole quantity by

\displaystyle \frac{1}{n}e^{o(\log n)} + o(n^{1/n}) - 1 = o(1)

i.e., the whole thing tends to zero, as desired.

Other thresholds

So we just showed that the property of having no isolated vertices in a random graph has a sharp threshold at p = (\log n) / n. Meaning at any larger probability the graph is almost surely devoid of isolated vertices, and at any lower probability the graph almost surely has some isolated vertices.

This might seem like a miracle theorem, but there turns out to be similar theorems for lots of properties. Most of them you can also prove using basically the same method we’ve been using here. I’ll list some below. Also note they are all sharp, two-sided thresholds in the same way that the isolated vertex boundary is.

  • The existence of a component of size \omega(\log (n)) has a threshold of 1/n.
  • p = c/n for any c > 0 is a threshold for the existence of a giant component of linear size \Theta(n). Moreover, above this threshold no other components will have size \omega(\log n).
  • In addition to (\log n) / n being a threshold for having no isolated vertices, it is also a threshold for connectivity.
  • p = (\log n + \log \log n + c(n)) / n is a sharp threshold for the existence of Hamiltonian cycles in the following sense: if c(n) = \omega(1) then there will be a Hamilton cycle almost surely, if c(n) \to -\infty there will be no Hamiltonian cycle almost surely, and if c(n) \to c the probability of a Hamiltonian cycle is e^{-e^{-c}}. This was proved by Kolmos and Szemeredi in 1983. Moreover, there is an efficient algorithm to find Hamiltonian cycles in these random graphs when they exist with high probability.

Explosive Percolation

So now we know that as the probability of an edge increases, at some point the graph will spontaneously become connected; at some time that is roughly \log(n) before, the so-called “giant component” will emerge and quickly engulf the entire graph.

Here’s a different perspective on this situation originally set forth by Achlioptas, D’Souza, and Spencer in 2009. It has since become called an “Achlioptas process.”

The idea is that you are watching a random graph grow. Rather than think about random graphs as having a probability above or below some threshold, you can think of it as the number of edges growing (so the thresholds will all be multiplied by n). Then you can imagine that you start with an empty graph, and at every time step someone is adding a new random edge to your graph. Fine, eventually you’ll get so many edges that a giant component emerges and you can measure when that happens.

But now imagine that instead of being given a single random new edge, you are given a choice. Say God presents you with two random edges, and you must pick which to add to your graph. Obviously you will eventually still get a giant component, but the question is how long can you prevent it from occurring? That is, how far back can we push the threshold for connectedness by cleverly selecting the new edge?

What Achlioptas and company conjectured was that you can push it back (some), but that when you push it back as far as it can go, the threshold becomes discontinuous. That is, they believed there was a constant \delta \geq 1/2 such that the size of the largest component jumps from o(n) to \delta n in o(n) steps.

This turned out to be false, and Riordan and Warnke proved it. Nevertheless, the idea has been interpreted in an interesting light. People have claimed it is a useful model of disaster in the following sense. If you imagine that an edge between two vertices is a “crisis” relating two entities. Then in every step God presents you with two crises and you only have the resources to fix one. The idea is that when the entire graph is connected, you have this one big disaster where all the problems are interacting with each other. The percolation process describes how long you can “survive” while avoiding the big disaster.

There are critiques of this interpretation, though, mainly about how simplistic it is. In particular, an Achlioptas process models a crisis as an exogenous force when in reality problems are usually endogenous. You don’t expect a meteor to hit the Earth, but you do expect humans to have an impact on the environment. Also, not everybody in the network is trying to avoid errors. Some companies thrive in economic downturns by managing your toxic assets, for example. So one could reasonably argue that Achlioptas processes aren’t complex enough to model the realistic types of disasters we face.

Either way, I find it fantastic that something like a random graph (which for decades was securely in pure combinatorics away from applications) is spurring such discussion.

Next time, we’ll take one more dive into the theory of Erdős-Rényi random graphs to prove a very “meta” theorem about sharp thresholds. Then we’ll turn our attention to other models of random graphs, hopefully more realistic ones :)

Until then!

When Greedy Algorithms are Perfect: the Matroid

Greedy algorithms are by far one of the easiest and most well-understood algorithmic techniques. There is a wealth of variations, but at its core the greedy algorithm optimizes something using the natural rule, “pick what looks best” at any step. So a greedy routing algorithm would say to a routing problem: “You want to visit all these locations with minimum travel time? Let’s start by going to the closest one. And from there to the next closest one. And so on.”

Because greedy algorithms are so simple, researchers have naturally made a big effort to understand their performance. Under what conditions will they actually solve the problem we’re trying to solve, or at least get close? In a previous post we gave some easy-to-state conditions under which greedy gives a good approximation, but the obvious question remains: can we characterize when greedy algorithms give an optimal solution to a problem?

The answer is yes, and the framework that enables us to do this is called a matroid. That is, if we can phrase the problem we’re trying to solve as a matroid, then the greedy algorithm is guaranteed to be optimal. Let’s start with an example when greedy is provably optimal: the minimum spanning tree problem. Throughout the article we’ll assume the reader is familiar with the very basics of linear algebra and graph theory (though we’ll remind ourselves what a minimum spanning tree is shortly). For a refresher, this blog has primers on both subjects. But first, some history.

History

Matroids were first introduced by Hassler Whitney in 1935, and independently discovered a little later by B.L. van der Waerden (a big name in combinatorics). They were both interested in devising a general description of “independence,” the properties of which are strikingly similar when specified in linear algebra and graph theory. Since then the study of matroids has blossomed into a large and beautiful theory, one part of which is the characterization of the greedy algorithm: greedy is optimal on a problem if and only if the problem can be represented as a matroid. Mathematicians have also characterized which matroids can be modeled as spanning trees of graphs (we will see this momentarily). As such, matroids have become a standard topic in the theory and practice of algorithms.

Minimum Spanning Trees

It is often natural in an undirected graph G = (V,E) to find a connected subset of edges that touch every vertex. As an example, if you’re working on a power network you might want to identify a “backbone” of the network so that you can use the backbone to cheaply travel from any node to any other node. Similarly, in a routing network (like the internet) it costs a lot of money to lay down cable, it’s in the interest of the internet service providers to design analogous backbones into their infrastructure.

A minimal subset of edges in a backbone like this is guaranteed to form a tree. This is simply because if you have a cycle in your subgraph then removing any edge on that cycle doesn’t break connectivity or the fact that you can get from any vertex to any other (and trees are the maximal subgraphs without cycles). As such, these “backbones” are called spanning trees. “Span” here means that you can get from any vertex to any other vertex, and it suggests the connection to linear algebra that we’ll describe later, and it’s a simple property of a tree that there is a unique path between any two vertices in the tree.

An example of a spanning tree

An example of a spanning tree

When your edges e \in E have nonnegative weights w_e \in \mathbb{R}^{\geq 0}, we can further ask to find a minimum cost spanning tree. The cost of a spanning tree T is just the sum of its edges, and it’s important enough of a definition to offset.

Definition: A minimum spanning tree T of a weighted graph G (with weights w_e \geq 0 for e \in E) is a spanning tree which minimizes the quantity

w(T) = \sum_{e \in T} w_e

There are a lot of algorithms to find minimal spanning trees, but one that will lead us to matroids is Kruskal’s algorithm. It’s quite simple. We’ll maintain a forest F in G, which is just a subgraph consisting of a bunch of trees that may or may not be connected. At the beginning F is just all the vertices with no edges. And then at each step we add to F the edge e whose weight is smallest and also does not introduce any cycles into F. If the input graph G is connected then this will always produce a minimal spanning tree.

Theorem: Kruskal’s algorithm produces a minimal spanning tree of a connected graph.

Proof. Call F_t the forest produced at step t of the algorithm. Then F_0 is the set of all vertices of G and F_{n-1} is the final forest output by Kruskal’s (as a quick exercise, prove all spanning trees on n vertices have n-1 edges, so we will stop after n-1 rounds). It’s clear that F_{n-1} is a tree because the algorithm guarantees no F_i will have a cycle. And any tree with n-1 edges is necessarily a spanning tree, because if some vertex were left out then there would be n-1 edges on a subgraph of n-1 vertices, necessarily causing a cycle somewhere in that subgraph.

Now we’ll prove that F_{n-1} has minimal cost. We’ll prove this in a similar manner to the general proof for matroids. Indeed, say you had a tree T whose cost is strictly less than that of F_{n-1} (we can also suppose that T is minimal, but this is not necessary). Pick the minimal weight edge e \in T that is not in F_{n-1}. Adding e to F_{n-1} introduces a unique cycle C in F_{n-1}. This cycle has some strange properties. First, e has the highest cost of any edge on C. For otherwise, Kruskal’s algorithm would have chosen it before the heavier weight edges. Second, there is another edge in C that’s not in T (because T was a tree it can’t have the entire cycle). Call such an edge e'. Now we can remove e' from F_{n-1} and add e. This can only increase the total cost of F_{n-1}, but this transformation produces a tree with one more edge in common with T than before. This contradicts that T had strictly lower weight than F_{n-1}, because repeating the process we described would eventually transform F_{n-1} into T exactly, while only increasing the total cost.

\square

Just to recap, we defined sets of edges to be “good” if they did not contain a cycle, and a spanning tree is a maximal set of edges with this property. In this scenario, the greedy algorithm performed optimally at finding a spanning tree with minimal total cost.

Columns of Matrices

Now let’s consider a different kind of problem. Say I give you a matrix like this one:

\displaystyle A = \begin{pmatrix} 2 & 0 & 1 & -1 & 0 \\ 0 & -4 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 7 \end{pmatrix}

In the standard interpretation of linear algebra, this matrix represents a linear function f from one vector space V to another W, with the basis (v_1, \dots, v_5) of V being represented by columns and the basis (w_1, w_2, w_3) of W being represented by the rows. Column j tells you how to write f(v_j) as a linear combination of the w_i, and in so doing uniquely defines f.

Now one thing we want to calculate is the rank of this matrix. That is, what is the dimension of the image of V under f? By linear algebraic arguments we know that this is equivalent to asking “how many linearly independent columns of A can we find”? An interesting consequence is that if you have two sets of columns that are both linearly independent and maximally so (adding any other column to either set would necessarily introduce a dependence in that set), then these two sets have the same size. This is part of why the rank of a matrix is well-defined.

If we were to give the columns of A costs, then we could ask about finding the minimal-cost maximally-independent column set. It sounds like a mouthful, but it’s exactly the same idea as with spanning trees: we want a set of vectors that spans the whole column space of A, but contains no “cycles” (linearly dependent combinations), and we want the cheapest such set.

So we have two kinds of “independence systems” that seem to be related. One interesting question we can ask is whether these kinds of independence systems are “the same” in a reasonable way. Hardcore readers of this blog may see the connection quite quickly. For any graph G = (V,E), there is a natural linear map from E to V, so that a linear dependence among the columns (edges) corresponds to a cycle in G. This map is called the incidence matrix by combinatorialists and the first boundary map by topologists.

The map is easy to construct: for each edge e = (v_i,v_j) you add a column with a 1 in the j-th row and a -1 in the i-th row. Then taking a sum of edges gives you zero if and only if the edges form a cycle. So we can think of a set of edges as “independent” if they don’t contain a cycle. It’s a little bit less general than independence over \mathbb{R}, but you can make it exactly the same kind of independence if you change your field from real numbers to \mathbb{Z}/2\mathbb{Z}. We won’t do this because it will detract from our end goal (to analyze greedy algorithms in realistic settings), but for further reading this survey of Oxley assumes that perspective.

So with the recognition of how similar these notions of independence are, we are ready to define matroids.

The Matroid

So far we’ve seen two kinds of independence: “sets of edges with no cycles” (also called forests) and “sets of linearly independent vectors.” Both of these share two trivial properties: there are always nonempty independent sets, and every subset of an independent set is independent. We will call any family of subsets with this property an independence system.

Definition: Let X be a finite set. An independence system over X is a family \mathscr{I} of subsets of X with the following two properties.

  1. \mathscr{I} is nonempty.
  2. If I \in \mathscr{I}, then so is every subset of I.

This is too general to characterize greedy algorithms, so we need one more property shared by our examples. There are a few things we do, but here’s one nice property that turns out to be enough.

Definition: A matroid M = (X, \mathscr{I}) is a set X and an independence system \mathscr{I} over X with the following property:

If A, B are in \mathscr{I} with |A| = |B| + 1, then there is an element x \in A \setminus B such that B \cup \{ a \} \in \mathscr{I}.

In other words, this property says if I have an independent set that is not maximally independent, I can grow the set by adding some suitably-chosen element from a larger independent set. We’ll call this the extension property. For a warmup exercise, let’s prove that the extension property is equivalent to the following (assuming the other properties of a matroid):

For every subset Y \subset X, all maximal independent sets contained in Y have equal size.

Proof. For one direction, if you have two maximal sets A, B \subset Y \subset X that are not the same size (say A is bigger), then you can take any subset of A whose size is exactly |B| + 1, and use the extension property to make B larger, a contradiction. For the other direction, say that I know all maximal independent sets of any Y \subset X have the same size, and you give me A, B \subset X. I need to find an a \in A \setminus B that I can add to B and keep it independent. What I do is take the subset Y = A \cup B. Now the sizes of A, B don’t change, but B can’t be maximal inside Y because it’s smaller than A (A might not be maximal either, but it’s still independent). And the only way to extend B is by adding something from A, as desired.

\square

So we can use the extension property and the cardinality property interchangeably when talking about matroids. Continuing to connect matroid language to linear algebra and graph theory, the maximal independent sets of a matroid are called bases, the size of any basis is the rank of the matroid, and the minimal dependent sets are called circuits. In fact, you can characterize matroids in terms of the properties of their circuits, which are dual to the properties of bases (and hence all independent sets) in a very concrete sense.

But while you could spend all day characterizing the many kinds of matroids and comatroids out there, we are still faced with the task of seeing how the greedy algorithm performs on a matroid. That is, suppose that your matroid M = (X, \mathscr{I}) has a nonnegative real number w(x) associated with each x \in X. And suppose we had a black-box function to determine if a given set S \subset X is independent. Then the greedy algorithm maintains a set B, and at every step adds a minimum weight element that maintains the independence of B. If we measure the cost of a subset by the sum of the weights of its elements, then the question is whether the greedy algorithm finds a minimum weight basis of the matroid.

The answer is even better than yes. In fact, the answer is that the greedy algorithm performs perfectly if and only if the problem is a matroid! More rigorously,

Theorem: Suppose that M = (X, \mathscr{I}) is an independence system, and that we have a black-box algorithm to determine whether a given set is independent. Define the greedy algorithm to iteratively adds the cheapest element of X that maintains independence. Then the greedy algorithm produces a maximally independent set S of minimal cost for every nonnegative cost function on X, if and only if M is a matroid.

It’s clear that the algorithm will produce a set that is maximally independent. The only question is whether what it produces has minimum weight among all maximally independent sets. We’ll break the theorem into the two directions of the “if and only if”:

Part 1: If M is a matroid, then greedy works perfectly no matter the cost function.
Part 2: If greedy works perfectly for every cost function, then M is a matroid.

Proof of Part 1.

Call the cost function w : X \to \mathbb{R}^{\geq 0}, and suppose that the greedy algorithm picks elements B = \{ x_1, x_2, \dots, x_r \} (in that order). It’s easy to see that w(x_1) \leq w(x_2) \leq \dots \leq w(x_r). Now if you give me any list of r independent elements y_1, y_2, \dots, y_r \in X that has w(y_1) \leq \dots \leq w(y_r), I claim that w(x_i) \leq w(y_i) for all i. This proves what we want, because if there were a basis of size r with smaller weight, sorting its elements by weight would give a list contradicting this claim.

To prove the claim, suppose to the contrary that it were false, and for some k we have w(x_k) > w(y_k). Moreover, pick the smallest k for which this is true. Note k > 1, and so we can look at the special sets S = \{ x_1, \dots, x_{k-1} \} and T = \{ y_1, \dots, y_k \}. Now |T| = |S|+1, so by the matroid property there is some j between 1 and r so that S \cup \{ y_j \} is an independent set (and y_j is not in S). But then w(y_j) \leq w(y_k) < w(x_k), and so the greedy algorithm would have picked y_j before it picks x_k (and the strict inequality means they’re different elements). This contradicts how the greedy algorithm runs, and hence proves the claim.

Proof of Part 2.

We’ll prove this contrapositively as follows. Suppose we have our independence system and it doesn’t satisfy the last matroid condition. Then we’ll construct a special weight function that causes the greedy algorithm to fail. So let A,B be independent sets with |A| = |B| + 1, but for every a \in A \setminus B adding a to B never gives you an independent set.

Now what we’ll do is define our weight function so that the greedy algorithm picks the elements we want in the order we want (roughly). In particular, we’ll assign all elements of A \cap B a tiny weight we’ll call w_1. For elements of B - A we’ll use w_2, and for A - B we’ll use w_3, with w_4 for everything else. In a more compact notation:

CodeCogsEqn

We need two things for this weight function to screw up the greedy algorithm. The first is that w_1 < w_2 < w_3 < w_4, so that greedy picks the elements in the order we want. Note that this means it’ll first pick all of A \cap B, and then all of B - A, and by assumption it won’t be able to pick anything from A - B, but since B is assumed to be non-maximal, we have to pick at least one element from X - (A \cup B) and pay w_4 for it.

So the second thing we want is that the cost of doing greedy is worse than picking any maximally independent set that contains A (and we know that there has to be some maximal independent set containing A). In other words, if we call m the size of a maximally independent set, we want

\displaystyle |A \cap B| w_1 + |B-A|w_2 + (m - |B|)w_4 > |A \cap B|w_1 + |A-B|w_3 + (m-|A|)w_4

This can be rearranged (using the fact that |A| = |B|+1) to

\displaystyle w_4 > |A-B|w_3 - |B-A|w_2

The point here is that the greedy picks too many elements of weight w_4, since if we were to start by taking all of A (instead of all of B), then we could get by with one fewer. That might not be optimal, but it’s better than greedy and that’s enough for the proof.

So we just need to make w_4 large enough to make this inequality hold, while still maintaining w_2 < w_3. There are probably many ways to do this, and here’s one. Pick some 0 < \varepsilon < 1, and set

settings

It’s trivial that w_1 < w_2 and w_3 < w_4. For the rest we need some observations. First, the fact that |A-B| = |B-A| + 1 implies that w_2 < w_3. Second, both |A-B| and |B-A| are nonempty, since otherwise the second property of independence systems would contradict our assumption that augmenting B with elements of A breaks independence. Using this, we can divide by these quantities to get

\displaystyle w_4 = 2 > 1 = \frac{|A-B|(1 + \varepsilon)}{|A-B|} - \frac{|B-A|\varepsilon}{|B-A|}

This proves the claim and finishes the proof.

\square

As a side note, we proved everything here with respect to minimizing the sum of the weights, but one can prove an identical theorem for maximization. The only part that’s really different is picking the clever weight function in part 2. In fact, you can convert between the two by defining a new weight function that subtracts the old weights from some fixed number N that is larger than any of the original weights. So these two problems really are the same thing.

This is pretty amazing! So if you can prove your problem is a matroid then you have an awesome algorithm automatically. And if you run the greedy algorithm for fun and it seems like it works all the time, then that may be hinting that your problem is a matroid. This is one of the best situations one could possibly hope for.

But as usual, there are a few caveats to consider. They are both related to efficiency. The first is the black box algorithm for determining if a set is independent. In a problem like minimum spanning tree or finding independent columns of a matrix, there are polynomial time algorithms for determining independence. These two can both be done, for example, with Gaussian elimination. But there’s nothing to stop our favorite matroid from requiring an exponential amount of time to check if a set is independent. This makes greedy all but useless, since we need to check for independence many times in every round.

Another, perhaps subtler, issue is that the size of the ground set X might be exponentially larger than the rank of the matroid. In other words, at every step our greedy algorithm needs to find a new element to add to the set it’s building up. But there could be such a huge ocean of candidates, all but a few of which break independence. In practice an algorithm might be working with X implicitly, so we could still hope to solve the problem if we had enough knowledge to speed up the search for a new element.

There are still other concerns. For example, a naive approach to implementing greedy takes quadratic time, since you may have to look through every element of X to find the minimum-cost guy to add. What if you just have to have faster runtime than O(n^2)? You can still be interested in finding more efficient algorithms that still perform perfectly, and to the best of my knowledge there’s nothing that says that greedy is the only exact algorithm for your favorite matroid. And then there are models where you don’t have direct/random access to the input, and lots of other ways that you can improve on greedy. But those stories are for another time.

Until then!

When Greedy Algorithms are Good Enough: Submodularity and the (1 – 1/e)-Approximation

Greedy algorithms are among the simplest and most intuitive algorithms known to humans. Their name essentially gives their description: do the thing that looks best right now, and repeat until nothing looks good anymore or you’re forced to stop. Some of the best situations in computer science are also when greedy algorithms are optimal or near-optimal. There is a beautiful theory of this situation, known as the theory of matroids. We haven’t covered matroids on this blog (at some point we will), but in this post we will focus on the next best thing: when the greedy algorithm guarantees a reasonably good approximation to the optimal solution.

This situation isn’t hard to formalize, and we’ll make it as abstract as possible. Say you have a set of objects X, and you’re looking to find the “best” subset S \subset X. Here “best” is just measured by a fixed (known, efficiently computable) objective function f : 2^X \to \mathbb{R}. That is, f accepts as input subsets of X and outputs numbers so that better subsets have larger numbers. Then the goal is to find a subset maximizing X.

In this generality the problem is clearly impossible. You’d have to check all subsets to be sure you didn’t miss the best one. So what conditions do we need on either X or f or both that makes this problem tractable? There are plenty you could try, but one very rich property is submodularity.

The Submodularity Condition

I think the simplest way to explain submodularity is in terms of coverage. Say you’re starting a new radio show and you have to choose which radio stations to broadcast from to reach the largest number of listeners. For simplicity say each radio station has one tower it broadcasts from, and you have a good estimate of the number of listeners you would reach if you broadcast from a given tower. For more simplicity, say it costs the same to broadcast from each tower, and your budget restricts you to a maximum of ten stations to broadcast from. So the question is: how do you pick towers to maximize your overall reach?

The hidden condition here is that some towers overlap in which listeners they reach. So if you broadcast from two towers in the same city, a listener who has access to both will just pick one or the other. In other words, there’s a diminished benefit to picking two overlapping towers if you already have chosen one.

In our version of the problem, picking both of these towers has some small amount of "overkill."

In our version of the problem, picking both of these towers has some small amount of “overkill.”

This “diminishing returns” condition is a general idea you can impose on any function that takes in subsets of a given set and produces numbers. If X is a set then for what seems like a strange reason we denote the set of all subsets of X by 2^X. So we can state this condition more formally,

Definition: Let X be a finite set. A function f: 2^X \to \mathbb{R} is called submodular if for all subsets S \subset T \subset X and all x \in X \setminus T,

\displaystyle f(S \cup \{ x \}) - f(S) \geq f(T \cup \{ x \}) - f(T)

In other words, if f measures “benefit,” then the marginal benefit of adding x to S is at least as high as the marginal benefit of adding it to T. Since S \subset T and x are all arbitrary, this is as general as one could possibly make it.

Before we start doing things with submodular functions, let’s explore some basic properties. The first is an equivalent definition of submodularity

Proposition: f is submodular if and only if for all A, B \subset X, it holds that

\displaystyle f(A \cap B) + f(A \cup B) \leq f(A) + f(B).

Proof. If we assume f has the condition from this proposition, then we can set A=T, B=S \cup \{ x \}, and the formula just works out. Conversely, if we have the condition from the definition, then using the fact that A \cap B \subset B we can inductively apply the inequality to each element of A \setminus B to get

\displaystyle f(A \cup B) - f(B) \leq f(A) - f(A \cap B)

\square

Next, we can tweak and combine submodular functions to get more submodular functions. In particular, non-negative linear combinations of sub-modular functions are submodular. In other words, if f_1, \dots, f_k are submodular on the same set X, and \alpha_1, \dots, \alpha_k are all non-negative reals, then \alpha_1 f_1 + \dots + \alpha_k f_k is also a submodular function on X. It’s an easy exercise in applying the definition to see why this is true. This is important because when we’re designing objectives to maximize, we can design them by making some simple submodular pieces, and then picking an appropriate combination of those pieces.

The second property we need to impose on a submodular function is monotonicity. That is, as your sets get more elements added to them, their value under f only goes up. In other words, f is monotone when S \subset T then f(S) \leq f(T). An interesting property of functions that are both submodular and monotone is that the truncation of such a function is also submodular and monotone. In other words, \textup{min}(f(S), c) is still submodular when f is monotone submodular and c is a constant.

Submodularity and Monotonicity Give 1 – 1/e

The wonderful thing about submodular functions is that we have a lot of great algorithmic guarantees for working with them. We’ll prove right now that the coverage problem (while it might be hard to solve in general) can be approximated pretty well by the greedy algorithm.

Here’s the algorithmic setup. I give you a finite set X and an efficient black-box to evaluate f(S) for any subset S \subset X you want. I promise you that f is monotone and submodular. Now I give you an integer k between 1 and the size of X, and your task is to quickly find a set S of size k for which f(S) is maximal among all subsets of size k. That is, you design an algorithm that will work for any k, X, f and runs in polynomial time in the sizes of X, k.

In general this problem is NP-hard, meaning you’re not going to find a solution that works in the worst case (if you do, don’t call me; just claim your million dollar prize). So how well can we approximate the optimal value for f(S) by a different set of size k? The beauty is that, if your function is monotone and submodular, you can guarantee to get within 63% of the optimum. The hope (and reality) is that in practice it will often perform much better, but still this is pretty good! More formally,

Theorem: Let f be a monotone, submodular, non-negative function on X. The greedy algorithm, which starts with S as the empty set and at every step picks an element x which maximizes the marginal benefit f(S \cup \{ x \}) - f(S), provides a set S that achieves a (1- 1/e)-approximation of the optimum.

We’ll prove this in just a little bit more generality, and the generality is quite useful. If we call S_1, S_2, \dots, S_l the sets chosen by the greedy algorithm (where now we might run the greedy algorithm for l > k steps), then for all l, k, we have

\displaystyle f(S_l) \geq \left ( 1 - e^{-l/k} \right ) \max_{T: |T| \leq k} f(T)

This allows us to run the algorithm for more than k steps to get a better approximation by sets of larger size, and quantify how much better the guarantee on that approximation would be. It’s like an algorithmic way of hedging your risk. So let’s prove it.

Proof. Let’s set up some notation first. Fix your l and k, call S_i the set chosen by the greedy algorithm at step i, and call S^* the optimal subset of size k. Further call \textup{OPT} the value of the best set f(S^*). Call x_1^*, \dots, x_k^* the elements of S^* (the order is irrelevant). Now for every i < l monotonicity gives us f(S^*) \leq f(S^* \cup S_i). We can unravel this into a sum of marginal gains of adding single elements. The first step is

\displaystyle f(S^* \cup S_i) = f(S^* \cup S_i) - f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i) + f(\{ x_1^*, \dots, x_{k-1}^* \} \cup S_i)

The second step removes x_{k-1}^*, from the last term, the third removes x_{k-2}^*, and so on until we have removed all of S^* and get this sum

\displaystyle f(S^* \cup S_i) = f(S_i) + \sum_{j=1}^k \left ( f(S_i \cup \{ x_1^*, \dots, x_j^* \}) - f(S_i \cup \{ x_1^*, \dots, x_{j-1}^* \} ) \right )

Now, applying submodularity, we can change all of these marginal benefits of “adding one more S^* element to S_i already with some S^* stuff” to “adding one more S^* element to just S_i.” In symbols, the equation above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_i \cup \{ x \}) - f(S_i)

and because S_{i+1} is greedily chosen to maximize the benefit of adding a single element, so the above is at most

\displaystyle f(S_i) + \sum_{x \in S^*} f(S_{i+1}) - f(S_i) = f(S_i) + k(f(S_{i+1}) - f(S_i))

Chaining all of these together, we have f(S^*) - f(S_i) \leq k(f(S_{i+1}) - f(S_i)). If we call a_{i} = f(S^*) - f(S_i), then this inequality can be rewritten as a_{i+1} \leq (1 - 1/k) a_{i}. Now by induction we can relate a_l \leq (1 - 1/k)^l a_0. Now use the fact that a_0 \leq f(S^*) and the common inequality 1-x \leq e^{-x} to get

\displaystyle a_l = f(S^*) - f(S_l) \leq e^{-l/k} f(S^*)

And rearranging gives f(S_l) \geq (1 - e^{-l/k}) f(S^*).

\square

Setting l=k gives the approximation bound we promised. But note that allowing the greedy algorithm to run longer can give much stronger guarantees, though it requires you to sacrifice the cardinality constraint. 1 - 1/e is about 63%, but doubling the size of S gives about an 86% approximation guarantee. This is great for people in the real world, because you can quantify the gains you’d get by relaxing the constraints imposed on you (which are rarely set in stone).

So this is really great! We have quantifiable guarantees on a stupidly simple algorithm, and the setting is super general. And so if you have your problem and you manage to prove your function is submodular (this is often the hardest part), then you are likely to get this nice guarantee.

Extensions and Variations

This result on monotone submodular functions is just one part of a vast literature on finding approximation algorithms for submodular functions in various settings. In closing this post we’ll survey some of the highlights and provide references.

What we did in this post was maximize a monotone submodular function subject to a cardinality constraint |S| \leq k. There are three basic variations we could do: we could drop constraints and see whether we can still get guarantees, we could look at minimization instead of maximization, and we could modify the kinds of constraints we impose on the solution.

There are a ton of different kinds of constraints, and we’ll discuss two. The first is where you need to get a certain value f(S) \geq q, and you want to find the smallest set that achieves this value. Laurence Wolsey (who proved a lot of these theorems) showed in 1982 that a slight variant of the greedy algorithm can achieve a set whose size is a multiplicative factor of 1 + \log (\max_x f(\{ x \})) worse than the optimum.

The second kind of constraint is a generalization of a cardinality constraint called a knapsack constraint. This means that each item x \in X has a cost, and you have a finite budget with which to spend on elements you add to S. One might expect this natural extension of the greedy algorithm to work: pick the element which maximizes the ratio of increasing the value of f to the cost (within your available budget). Unfortunately this algorithm can perform arbitrarily poorly, but there are two fun caveats. The first is that if you do both this augmented greedy algorithm and the greedy algorithm that ignores costs, then at least one of these can’t do too poorly. Specifically, one of them has to get at least a 30% approximation. This was shown by Leskovec et al in 2007. The second is that if you’re willing to spend more time in your greedy step by choosing the best subset of size 3, then you can get back to the 1-1/e approximation. This was shown by Sviridenko in 2004.

Now we could try dropping the monotonicity constraint. In this setting cardinality constraints are also superfluous, because it could be that the very large sets have low values. Now it turns out that if f has no other restrictions (in particular, if it’s allowed to be negative), then even telling whether there’s a set S with f(S) > 0 is NP-hard, but the optimum could be arbitrarily large and positive when it exists. But if you require that f is non-negative, then you can get a 1/3-approximation, if you’re willing to add randomness you can get 2/5 in expectation, and with more subtle constraints you can get up to a 1/2 approximation. Anything better is NP-hard. Fiege, Mirrokni, and Vondrak have a nice FOCS paper on this.

Next, we could remove the monotonicity property and try to minimize the value of f(S). It turns out that this problem always has an efficient solution, but the only algorithm I have heard of to solve it involves a very sophisticated technique called the ellipsoid algorithm. This is heavily related to linear programming and convex optimization, something which I hope to cover in more detail on this blog.

Finally, there are many interesting variations in the algorithmic procedure. For example, one could require that the elements are provided in some order (the streaming setting), and you have to pick at each step whether to put the element in your set or not. Alternatively, the objective functions might not be known ahead of time and you have to try to pick elements to jointly maximize them as they are revealed. These two settings have connections to bandit learning problems, which we’ve covered before on this blog. See this survey of Krause and Golovin for more on the connections, which also contains the main proof used in this post.

Indeed, despite the fact that many of the big results were proved in the 80’s, the analysis of submodular functions is still a big research topic. There was even a paper posted just the other day on the arXiv about it’s relation to ad serving! And wouldn’t you know, they proved a (1-1/e)-approximation for their setting. There’s just something about 1-1/e.

Until next time!