The Čech Complex and the Vietoris-Rips Complex

It’s about time we got back to computational topology. Previously in this series we endured a lightning tour of the fundamental group and homology, then we saw how to compute the homology of a simplicial complex using linear algebra.

What we really want to do is talk about the inherent shape of data. Homology allows us to compute some qualitative features of a given shape, i.e., find and count the number of connected components or a given shape, or the number of “2-dimensional holes” it has. This is great, but data doesn’t come in a form suitable for computing homology. Though they may have originated from some underlying process that follows nice rules, data points are just floating around in space with no obvious connection between them.

Here is a cool example of Thom Yorke, the lead singer of the band Radiohead, whose face was scanned with a laser scanner for their music video “House of Cards.”

radiohead-still

Radiohead’s Thom Yorke in the music video for House of Cards (click the image to watch the video).

Given a point cloud such as the one above, our long term goal (we’re just getting started in this post) is to algorithmically discover what the characteristic topological features are in the data. Since homology is pretty coarse, we might detect the fact that the point cloud above looks like a hollow sphere with some holes in it corresponding to nostrils, ears, and the like. The hope is that if the data set isn’t too corrupted by noise, then it’s a good approximation to the underlying space it is sampled from. By computing the topological features of a point cloud we can understand the process that generated it, and Science can proceed.

But it’s not always as simple as Thom Yorke’s face. It turns out the producers of the music video had to actually degrade the data to get what you see above, because their lasers were too precise and didn’t look artistic enough! But you can imagine that if your laser is mounted on a car on a bumpy road, or tracking some object in the sky, or your data comes from acoustic waves traveling through earth, you’re bound to get noise. Or more realistically, if your data comes from thousands of stock market prices then the process generating the data is super mysterious. It changes over time, it may not follow any discernible pattern (though speculators may hope it does), and you can’t hope to visualize the entire dataset in any useful way.

But with persistent homology, so the claim goes, you’d get a good qualitative understanding of the dataset. Your results would be resistant to noise inherent in the data. It also wouldn’t be sensitive to the details of your data cleaning process. And with a dash of ingenuity, you can come up with a reasonable mathematical model of the underlying generative process. You could use that model to design algorithms, make big bucks, discover new drugs, recognize pictures of cats, or whatever tickles your fancy.

But our first problem is to resolve the input data type error. We want to use homology to describe data, but our data is a point cloud and homology operates on simplicial complexes. In this post we’ll see two ways one can do this, and see how they’re related.

The Čech complex

Let’s start with the Čech complex. Given a point set X in some metric space and a number \varepsilon > 0, the Čech complex C_\varepsilon is the simplicial complex whose simplices are formed as follows. For each subset S \subset X of points, form a (\varepsilon/2)-ball around each point in S, and include S as a simplex (of dimension |S|) if there is a common point contained in all of the balls in S. This structure obviously satisfies the definition of a simplicial complex: any sub-subset S' \subset S of a simplex S will be also be a simplex. Here is an example of the epsilon balls.

Image credit: Robert Ghrist

An example of a point cloud (left) and a corresponding choice of (epsilon/2)-balls. To get the Cech complex, we add a k-simplex any time we see a subset of k points with common intersection.  [Image credit: Robert Ghrist]

Let me superscript the Čech complex to illustrate the pieces. Specifically, we’ll let C_\varepsilon^{j} denote all the simplices of dimension up to j. In particular, C_\varepsilon^1 is a graph where an edge is placed between x,y if d(x,y) < \varepsilon, and C_{\varepsilon}^2 places triangles (2-simplices) on triples of points whose balls have a three-way intersection.

A topologist will have a minor protest here: the simplicial complex is supposed to resemble the structure inherent in the underlying points, but how do we know that this abstract simplicial complex (which is really hard to visualize!) resembles the topological space we used to make it? That is, X was sitting in some metric space, and the union of these epsilon-balls forms some topological space X(\varepsilon) that is close in structure to X. But is the Čech complex C_\varepsilon close to X(\varepsilon)? Do they have the same topological structure? It’s not a trivial theorem to prove, but it turns out to be true.

The Nerve Theorem: The homotopy types of X(\varepsilon) and C_\varepsilon are the same.

We won’t remind the readers about homotopy theory, but suffice it to say that when two topological spaces have the same homotopy type, then homology can’t distinguish them. In other words, if homotopy type is too coarse for a discriminator for our dataset, then persistent homology will fail us for sure.

So this theorem is a good sanity check. If we want to learn about our point cloud, we can pick a \varepsilon and study the topology of the corresponding Čech complex C_\varepsilon. The reason this is called the “Nerve Theorem” is because one can generalize it to an arbitrary family of convex sets. Given some family F of convex sets, the nerve is the complex obtained by adding simplices for mutually overlapping subfamilies in the same way. The nerve theorem is actually more general, it says that with sufficient conditions on the family F being “nice,” the resulting Čech complex has the same topological structure as F.

The problem is that Čech complexes are tough to compute. To tell whether there are any 10-simplices (without additional knowledge) you have to inspect all subsets of size 10. In general computing the entire complex requires exponential time in the size of X, which is extremely inefficient. So we need a different kind of complex, or at least a different representation to compensate.

The Vietoris-Rips complex

The Vietoris-Rips complex is essentially the same as the Čech complex, except instead of adding a d-simplex when there is a common point of intersection of all the (\varepsilon/2)-balls, we just do so when all the balls have pairwise intersections. We’ll denote the Vietoris-Rips complex with parameter \varepsilon as VR_{\varepsilon}.

Here is an example to illustrate: if you give me three points that are the vertices of an equilateral triangle of side length 1, and I draw (1/2)-balls around each point, then they will have all three pairwise intersections but no common point of intersection.

intersection

Three balls which intersect pairwise, but have no point of triple intersection. With appropriate parameters, the Cech and V-R complexes are different.

So in this example the Vietoris-Rips complex is a graph with a 2-simplex, while the Čech complex is just a graph.

One obvious question is: do we still get the benefits of the nerve theorem with Vietoris-Rips complexes? The answer is no, obviously, because the Vietoris-Rips complex and Čech complex in this triangle example have totally different topology! But everything’s not lost. What we can do instead is compare Vietoris-Rips and Čech complexes with related parameters.

Theorem: For all \varepsilon > 0, the following inclusions hold

\displaystyle C_{\varepsilon} \subset VR_{2 \varepsilon} \subset C_{2\varepsilon}

So if the Čech complexes for both \varepsilon and 2\varepsilon are good approximations of the underlying data, then so is the Vietoris-Rips complex. In fact, you can make this chain of inclusions slightly tighter, and if you’re interested you can see Theorem 2.5 in this recent paper of de Silva and Ghrist.

Now your first objection should be that computing a Vietoris-Rips complex still requires exponential time, because you have to scan all subsets for the possibility that they form a simplex. It’s true, but one nice thing about the Vietoris-Rips complex is that it can be represented implicitly as a graph. You just include an edge between two points if their corresponding balls overlap. Once we want to compute the actual simplices in the complex we have to scan for cliques in the graph, so that sucks. But it turns out that computing the graph is the first step in other more efficient methods for computing (or approximating) the VR complex.

Let’s go ahead and write a (trivial) program that computes the graph representation of the Vietoris-Rips complex of a given data set.

import numpy
def naiveVR(points, epsilon):
   points = [numpy.array(x) for x in points]   
   vrComplex = [(x,y) for (x,y) in combinations(points, 2) if norm(x - y) < 2*epsilon]
   return numpy.array(vrComplex)

Let’s try running it on a modestly large example: the first frame of the Radiohead music video. It’s got about 12,000 points in \mathbb{R}^4 (x,y,z,intensity), and sadly it takes about twenty minutes. There are a couple of ways to make it more efficient. One is to use specially-crafted data structures for computing threshold queries (i.e., find all points within \varepsilon of this point). But those are only useful for small thresholds, and we’re interested in sweeping over a range of thresholds. Another is to invoke approximations of the data structure which give rise to “approximate” Vietoris-Rips complexes.

Other stuff

In a future post we’ll implement a method for speeding up the computation of the Vietoris-Rips complex, since this is the primary bottleneck for topological data analysis. But for now the conceptual idea of how Čech complexes and Vietoris-Rips complexes can be used to turn point clouds into simplicial complexes in reasonable ways.

Before we close we should mention that there are other ways to do this. I’ve chosen the algebraic flavor of topological data analysis due to my familiarity with algebra and the work based on this approach. The other approaches have a more geometric flavor, and are based on the Delaunay triangulation, a hallmark of computational geometry algorithms. The two approaches I’ve heard of are called the alpha complex and the flow complex. The downside of these approaches is that, because they are based on the Delaunay triangulation, they have poor scaling in the dimension of the data. Because high dimensional data is crucial, many researchers have been spending their time figuring out how to speed up approximations of the V-R complex. See these slides of Afra Zomorodian for an example.

Until next time!

What does it mean for an algorithm to be fair?

In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.

Warning: algorithms can facilitate illegal discrimination!

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have a historical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

Payday-Loans

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if we knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning (FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

Methods of Proof — Diagonalization

A while back we featured a post about why learning mathematics can be hard for programmers, and I claimed a major issue was not understanding the basic methods of proof (the lingua franca between intuition and rigorous mathematics). I boiled these down to the “basic four,” direct implication, contrapositive, contradiction, and induction. But in mathematics there is an ever growing supply of proof methods. There are books written about the “probabilistic method,” and I recently went to a lecture where the “linear algebra method” was displayed. There has been recent talk of a “quantum method” for proving theorems unrelated to quantum mechanics, and many more.

So in continuing our series of methods of proof, we’ll move up to some of the more advanced methods of proof. And in keeping with the spirit of the series, we’ll spend most of our time discussing the structural form of the proofs. This time, diagonalization.

Diagonalization

Perhaps one of the most famous methods of proof after the basic four is proof by diagonalization. Why do they call it diagonalization? Because the idea behind diagonalization is to write out a table that describes how a collection of objects behaves, and then to manipulate the “diagonal” of that table to get a new object that you can prove isn’t in the table.

The simplest and most famous example of this is the proof that there is no bijection between the natural numbers and the real numbers. We defined injections, and surjections and bijections, in two earlier posts in this series, but for new readers a bijection is just a one-to-one mapping between two collections of things. For example, one can construct a bijection between all positive integers and all even positive integers by mapping n to 2n. If there is a bijection between two (perhaps infinite) sets, then we say they have the same size or cardinality. And so to say there is no bijection between the natural numbers and the real numbers is to say that one of these two sets (the real numbers) is somehow “larger” than the other, despite both being infinite in size. It’s deep, it used to be very controversial, and it made the method of diagonalization famous. Let’s see how it works.

Theorem: There is no bijection from the natural numbers \mathbb{N} to the real numbers \mathbb{R}.

Proof. Suppose to the contrary (i.e., we’re about to do proof by contradiction) that there is a bijection f: \mathbb{N} \to \mathbb{R}. That is, you give me a positive integer k and I will spit out f(k), with the property that different k give different f(k), and every real number is hit by some natural number k (this is just what it means to be a one-to-one mapping).

First let me just do some setup. I claim that all we need to do is show that there is no bijection between \mathbb{N} and the real numbers between 0 and 1. In particular, I claim there is a bijection from (0,1) to all real numbers, so if there is a bijection from \mathbb{N} \to (0,1) then we could combine the two bijections. To show there is a bijection from (0,1) \to \mathbb{R}, I can first make a bijection from the open interval (0,1) to the interval (-\infty, 0) \cup (1, \infty) by mapping x to 1/x. With a little bit of extra work (read, messy details) you can extend this to all real numbers. Here’s a sketch: make a bijection from (0,1) to (0,2) by doubling; then make a bijection from (0,2) to all real numbers by using the (0,1) part to get (-\infty, 0) \cup (1, \infty), and use the [1,2) part to get [0,1] by subtracting 1 (almost! To be super rigorous you also have to argue that the missing number 1 doesn’t change the cardinality, or else write down a more complicated bijection; still, the idea should be clear).

Okay, setup is done. We just have to show there is no bijection between (0,1) and the natural numbers.

The reason I did all that setup is so that I can use the fact that every real number in (0,1) has an infinite binary decimal expansion whose only nonzero digits are after the decimal point. And so I’ll write down the expansion of f(1) as a row in a table (an infinite row), and below it I’ll write down the expansion of f(2), below that f(3), and so on, and the decimal points will line up. The table looks like this.

firsttableThe d‘s above are either 0 or 1. I need to be a bit more detailed in my table, so I’ll index the digits of f(1) by b_{1,1}, b_{1,2}, b_{1,3}, \dots, the digits of f(2) by b_{2,1}, b_{2,2}, b_{2,3}, \dots, and so on. This makes the table look like this

secondtable

It’s a bit harder to read, but trust me the notation is helpful.

Now by the assumption that f is a bijection, I’m assuming that every real number shows up as a number in this table, and no real number shows up twice. So if I could construct a number that I can prove is not in the table, I will arrive at a contradiction: the table couldn’t have had all real numbers to begin with! And that will prove there is no bijection between the natural numbers and the real numbers.

Here’s how I’ll come up with such a number N (this is the diagonalization part). It starts with 0., and it’s first digit after the decimal is 1-b_{1,1}. That is, we flip the bit b_{1,1} to get the first digit of N. The second digit is 1-b_{2,2}, the third is 1-b_{3,3}, and so on. In general, digit i is 1-b_{i,i}.

Now we show that N isn’t in the table. If it were, then it would have to be N = f(m) for some m, i.e. be the m-th row in the table. Moreover, by the way we built the table, the m-th digit of N would be b_{m,m}. But we defined N so that it’s m-th digit was actually 1-b_{m,m}. This is very embarrassing for N (it’s a contradiction!). So N isn’t in the table.

\square

It’s the kind of proof that blows your mind the first time you see it, because it says that there is more than one kind of infinity. Not something you think about every day, right?

The Halting Problem

The second example we’ll show of a proof by diagonalization is the Halting Theorem, proved originally by Alan Turing, which says that there are some problems that computers can’t solve, even if given unbounded space and time to perform their computations. The formal mathematical model is called a Turing machine, but for simplicity you can think of “Turing machines” and “algorithms described in words” as the same thing. Or if you want it can be “programs written in programming language X.” So we’ll use the three words “Turing machine,” “algorithm,” and “program” interchangeably.

The proof works by actually defining a problem and proving it can’t be solved. The problem is called the halting problem, and it is the problem of deciding: given a program P and an input x to that program, will P ever stop running when given x as input? What I mean by “decide” is that any program that claims to solve the halting problem is itself required to halt for every possible input with the correct answer. A “halting problem solver” can’t loop infinitely!

So first we’ll give the standard proof that the halting problem can’t be solved, and then we’ll inspect the form of the proof more closely to see why it’s considered a diagonalization argument.

Theorem: The halting program cannot be solved by Turing machines.

Proof. Suppose to the contrary that T is a program that solves the halting problem. We’ll use T as a black box to come up with a new program I’ll call meta-T, defined in pseudo-python as follows.

def metaT(P):
   run T on (P,P)
   if T says that P halts:
      loop infinitely
   else:
      halt and output "success!"

In words, meta-T accepts as input the source code of a program P, and then uses T to tell if P halts (when given its own source code as input). Based on the result, it behaves the opposite of P; if P halts then meta-T loops infinitely and vice versa. It’s a little meta, right?

Now let’s do something crazy: let’s run meta-T on itself! That is, run

metaT(metaT)

So meta. The question is what is the output of this call? The meta-T program uses T to determine whether meta-T halts when given itself as input. So let’s say that the answer to this question is “yes, it does halt.” Then by the definition of meta-T, the program proceeds to loop forever. But this is a problem, because it means that metaT(metaT) (which is the original thing we ran) actually does not halt, contradicting T‘s answer! Likewise, if T says that metaT(metaT) should loop infinitely, that will cause meta-T to halt, a contradiction. So T cannot be correct, and the halting problem can’t be solved.

\square

This theorem is deep because it says that you can’t possibly write a program to which can always detect bugs in other programs. Infinite loops are just one special kind of bug.

But let’s take a closer look and see why this is a proof by diagonalization. The first thing we need to convince ourselves is that the set of all programs is countable (that is, there is a bijection from \mathbb{N} to the set of all programs). This shouldn’t be so hard to see: you can list all programs in lexicographic order, since the set of all strings is countable, and then throw out any that are not syntactically valid programs. Likewise, the set of all inputs, really just all strings, is countable.

The second thing we need to convince ourselves of is that a problem corresponds to an infinite binary string. To do this, we’ll restrict our attention to problems with yes/no answers, that is where the goal of the program is to output a single bit corresponding to yes or no for a given input. Then if we list all possible inputs in increasing lexicographic order, a problem can be represented by the infinite list of bits that are the correct outputs to each input.

For example, if the problem is to determine whether a given binary input string corresponds to an even number, the representation might look like this:

010101010101010101...

Of course this all depends on the details of how one encodes inputs, but the point is that if you wanted to you could nail all this down precisely. More importantly for us we can represent the halting problem as an infinite table of bits. If the columns of the table are all programs (in lex order), and the rows of the table correspond to inputs (in lex order), then the table would have at entry (x,P) a 1 if P(x) halts and a 0 otherwise.


haltingtable

here b_{i,j} is 1 if P_j(x_i) halts and 0 otherwise. The table encodes the answers to the halting problem for all possible inputs.

Now we assume for contradiction sake that some program solves the halting problem, i.e. that every entry of the table is computable. Now we’ll construct the answers output by meta-T by flipping each bit of the diagonal of the table. The point is that meta-T corresponds to some row of the table, because there is some input string that is interpreted as the source code of meta-T. Then we argue that the entry of the table for (\textup{meta-}T, \textup{meta-}T) contradicts its definition, and we’re done!

So these are two of the most high-profile uses of the method of diagonalization. It’s a great tool for your proving repertoire.

Until next time!

The Many Faces of Set Cover

A while back Peter Norvig posted a wonderful pair of articles about regex golf. The idea behind regex golf is to come up with the shortest possible regular expression that matches one given list of strings, but not the other.

“Regex Golf,” by Randall Munroe.

In the first article, Norvig runs a basic algorithm to recreate and improve the results from the comic, and in the second he beefs it up with some improved search heuristics. My favorite part about this topic is that regex golf can be phrased in terms of a problem called set cover. I noticed this when reading the comic, and was delighted to see Norvig use that as the basis of his algorithm.

The set cover problem shows up in other places, too. If you have a database of items labeled by users, and you want to find the smallest set of labels to display that covers every item in the database, you’re doing set cover. I hear there are applications in biochemistry and biology but haven’t seen them myself.

If you know what a set is (just think of the “set” or “hash set” type from your favorite programming language), then set cover has a simple definition.

Definition (The Set Cover Problem): You are given a finite set U called a “universe” and sets S_1, \dots, S_n each of which is a subset of U. You choose some of the S_i to ensure that every x \in U is in one of your chosen sets, and you want to minimize the number of S_i you picked.

It’s called a “cover” because the sets you pick “cover” every element of U. Let’s do a simple. Let U = \{ 1,2,3,4,5 \} and

\displaystyle S_1 = \{ 1,3,4 \}, S_2 = \{ 2,3,5 \}, S_3 = \{ 1,4,5 \}, S_4 = \{ 2,4 \}

Then the smallest possible number of sets you can pick is 2, and you can achieve this by picking both S_1, S_2 or both S_2, S_3. The connection to regex golf is that you pick U to be the set of strings you want to match, and you pick a set of regexes that match some of the strings in U but none of the strings you want to avoid matching (I’ll call them V). If w is such a regex, then you can form the set S_w of strings that w matches. Then if you find a small set cover with the strings w_1, \dots, w_t, then you can “or” them together to get a single regex w_1 \mid w_2 \mid \dots \mid w_t that matches all of U but none of V.

Set cover is what’s called NP-hard, and one implication is that we shouldn’t hope to find an efficient algorithm that will always give you the shortest regex for every regex golf problem. But despite this, there are approximation algorithms for set cover. What I mean by this is that there is a regex-golf algorithm A that outputs a subset of the regexes matching all of U, and the number of regexes it outputs is such-and-such close to the minimum possible number. We’ll make “such-and-such” more formal later in the post.

What made me sad was that Norvig didn’t go any deeper than saying, “We can try to approximate set cover, and the greedy algorithm is pretty good.” It’s true, but the ideas are richer than that! Set cover is a simple example to showcase interesting techniques from theoretical computer science. And perhaps ironically, in Norvig’s second post a header promised the article would discuss the theory of set cover, but I didn’t see any of what I think of as theory. Instead he partially analyzes the structure of the regex golf instances he cares about. This is useful, but not really theoretical in any way unless he can say something universal about those instances.

I don’t mean to bash Norvig. His articles were great! And in-depth theory was way beyond scope. So this post is just my opportunity to fill in some theory gaps. We’ll do three things:

  1. Show formally that set cover is NP-hard.
  2. Prove the approximation guarantee of the greedy algorithm.
  3. Show another (very different) approximation algorithm based on linear programming.

Along the way I’ll argue that by knowing (or at least seeing) the details of these proofs, one can get a better sense of what features to look for in the set cover instance you’re trying to solve. We’ll also see how set cover depicts the broader themes of theoretical computer science.

NP-hardness

The first thing we should do is show that set cover is NP-hard. Intuitively what this means is that we can take some hard problem P and encode instances of P inside set cover problems. This idea is called a reduction, because solving problem P will “reduce” to solving set cover, and the method we use to encode instance of P as set cover problems will have a small amount of overhead. This is one way to say that set cover is “at least as hard as” P.

The hard problem we’ll reduce to set cover is called 3-satisfiability (3-SAT). In 3-SAT, the input is a formula whose variables are either true or false, and the formula is expressed as an OR of a bunch of clauses, each of which is an AND of three variables (or their negations). This is called 3-CNF form. A simple example:

\displaystyle (x \vee y \vee \neg z) \wedge (\neg x \vee w \vee y) \wedge (z \vee x \vee \neg w)

The goal of the algorithm is to decide whether there is an assignment to the variables which makes the formula true. 3-SAT is one of the most fundamental problems we believe to be hard and, roughly speaking, by reducing it to set cover we include set cover in a class called NP-complete, and if any one of these problems can be solved efficiently, then they all can (this is the famous P versus NP problem, and an efficient algorithm would imply P equals NP).

So a reduction would consist of the following: you give me a formula \varphi in 3-CNF form, and I have to produce (in a way that depends on \varphi!) a universe U and a choice of subsets S_i \subset U in such a way that

\varphi has a true assignment of variables if and only if the corresponding set cover problem has a cover using k sets.

In other words, I’m going to design a function f from 3-SAT instances to set cover instances, such that x is satisfiable if and only if f(x) has a set cover with k sets.

Why do I say it only for k sets? Well, if you can always answer this question then I claim you can find the minimum size of a set cover needed by doing a binary search for the smallest value of k. So finding the minimum size of a set cover reduces to the problem of telling if theres a set cover of size k.

Now let’s do the reduction from 3-SAT to set cover.

If you give me \varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m where each C_i is a clause and the variables are denoted x_1, \dots, x_n, then I will choose as my universe U to be the set of all the clauses and indices of the variables (these are all just formal symbols). i.e.

\displaystyle U = \{ C_1, C_2, \dots, C_m, 1, 2, \dots, n \}

The first part of U will ensure I make all the clauses true, and the last part will ensure I don’t pick a variable to be both true and false at the same time.

To show how this works I have to pick my subsets. For each variable x_i, I’ll make two sets, one called S_{x_i} and one called S_{\neg x_i}. They will both contain i in addition to the clauses which they make true when the corresponding literal is true (by literal I just mean the variable or its negation). For example, if C_j uses the literal \neg x_7, then S_{\neg x_7} will contain C_j but S_{x_7} will not. Finally, I’ll set k = n, the number of variables.

Now to prove this reduction works I have to prove two things: if my starting formula has a satisfying assignment I have to show the set cover problem has a cover of size k. Indeed, take the sets S_{y} for all literals y that are set to true in a satisfying assignment. There can be at most n true literals since half are true and half are false, so there will be at most n sets, and these sets clearly cover all of U because every literal has to be satisfied by some literal or else the formula isn’t true.

The reverse direction is similar: if I have a set cover of size n, I need to use it to come up with a satisfying truth assignment for the original formula. But indeed, the sets that get chosen can’t include both a S_{x_i} and its negation set S_{\neg x_i}, because there are n of the elements \{1, 2, \dots, n \} \subset U, and each i is only in the two S_{x_i}, S_{\neg x_i}. Just by counting if I cover all the indices i, I already account for n sets! And finally, since I have covered all the clauses, the literals corresponding to the sets I chose give exactly a satisfying assignment.

Whew! So set cover is NP-hard because I encoded this logic problem 3-SAT within its rules. If we think 3-SAT is hard (and we do) then set cover must also be hard. So if we can’t hope to solve it exactly we should try to approximate the best solution.

The greedy approach

The method that Norvig uses in attacking the meta-regex golf problem is the greedy algorithm. The greedy algorithm is exactly what you’d expect: you maintain a list L of the subsets you’ve picked so far, and at each step you pick the set S_i that maximizes the number of new elements of U that aren’t already covered by the sets in L. In python pseudocode:

def greedySetCover(universe, sets):
   chosenSets = set()
   leftToCover = universe.copy()
   unchosenSets = sets

   covered = lambda s: leftToCover & s

   while universe != 0:
      if len(chosenSets) == len(sets):
         raise Exception("No set cover possible")
      
      nextSet = max(unchosenSets, key=lambda s: len(covered(s)))
      unchosenSets.remove(nextSet)
      chosenSets.add(nextSet)
      leftToCover -= nextSet
      
   return chosenSets

This is what theory has to say about the greedy algorithm:

Theorem: If it is possible to cover U by the sets in F = \{ S_1, \dots, S_n \}, then the greedy algorithm always produces a cover that at worst has size O(\log(n)) \textup{OPT}, where \textup{OPT} is the size of the smallest cover. Moreover, this is asymptotically the best any algorithm can do.

One simple fact we need from calculus is that the following sum is asymptotically the same as \log(n):

\displaystyle H(n) = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \log(n) + O(1)

Proof. [adapted from Wan] Let’s say the greedy algorithm picks sets T_1, T_2, \dots, T_k in that order. We’ll set up a little value system for the elements of U. Specifically, the value of each T_i is 1, and in step i we evenly distribute this unit value across all newly covered elements of T_i. So for T_1 each covered element gets value 1/|T_1|, and if T_2 covers four new elements, each gets a value of 1/4. One can think of this “value” as a price, or energy, or unit mass, or whatever. It’s just an accounting system (albeit a clever one) we use to make some inequalities clear later.

In general call the value v_x of element x \in U the value assigned to x at the step where it’s first covered. In particular, the number of sets chosen by the greedy algorithm k is just \sum_{x \in U} v_x. We’re just bunching back together the unit value we distributed for each step of the algorithm.

Now we want to compare the sets chosen by greedy to the optimal choice. Call a smallest set cover C_{\textup{OPT}}. Let’s stare at the following inequality.

\displaystyle \sum_{x \in U} v_x \leq \sum_{S \in C_{\textup{OPT}}} \sum_{x \in S} v_x

It’s true because each x counts for a v_x at most once in the left hand side, and in the right hand side the sets in C_{\textup{OPT}} must hit each x at least once but may hit some x more than once. Also remember the left hand side is equal to k.

Now we want to show that the inner sum on the right hand side, \sum_{x \in S} v_x, is at most H(|S|). This will in fact prove the entire theorem: because each set S_i has size at most n, the inequality above will turn into

\displaystyle k \leq |C_{\textup{OPT}}| H(|S|) \leq |C_{\textup{OPT}}| H(n)

And so k \leq \textup{OPT} \cdot O(\log(n)), which is the statement of the theorem.

So we want to show that \sum_{x \in S} v_x \leq H(|S|). For each j define \delta_j(S) to be the number of elements in S not covered in T_1, \cup \dots \cup T_j. Notice that \delta_{j-1}(S) - \delta_{j}(S) is the number of elements of S that are covered for the first time in step j. If we call t_S the smallest integer j for which \delta_j(S) = 0, we can count up the differences up to step t_S, we get

\sum_{x \in S} v_x = \sum_{i=1}^{t_S} (\delta_{i-1}(S) - \delta_i(S)) \cdot \frac{1}{T_i - (T_1 \cup \dots \cup T_{i-1})}

The rightmost term is just the cost assigned to the relevant elements at step i. Moreover, because T_i covers more new elements than S (by definition of the greedy algorithm), the fraction above is at most 1/\delta_{i-1}(S). The end is near. For brevity I’ll drop the (S) from \delta_j(S).

\displaystyle \begin{aligned} \sum_{x \in S} v_x & \leq \sum_{i=1}^{t_S} (\delta_{i-1} - \delta_i) \frac{1}{\delta_{i-1}} \\ & \leq \sum_{i=1}^{t_S} (\frac{1}{1 + \delta_i} + \frac{1}{2+\delta_i} \dots + \frac{1}{\delta_{i-1}}) \\ & = \sum_{i=1}^{t_S} H(\delta_{i-1}) - H(\delta_i) \\ &= H(\delta_0) - H(\delta_{t_S}) = H(|S|) \end{aligned}

And that proves the claim.

\square

I have three postscripts to this proof:

  1. This is basically the exact worst-case approximation that the greedy algorithm achieves. In fact, Petr Slavik proved in 1996 that the greedy gives you a set of size exactly (\log n - \log \log n + O(1)) \textup{OPT} in the worst case.
  2. This is also the best approximation that any set cover algorithm can achieve, provided that P is not NP. This result was basically known in 1994, but it wasn’t until 2013 and the use of some very sophisticated tools that the best possible bound was found with the smallest assumptions.
  3. In the proof we used that |S| \leq n to bound things, but if we knew that our sets S_i (i.e. subsets matched by a regex) had sizes bounded by, say, B, the same proof would show that the approximation factor is \log(B) instead of \log n. However, in order for that to be useful you need B to be a constant, or at least to grow more slowly than any polynomial in n, since e.g. \log(n^{0.1}) = 0.1 \log n. In fact, taking a second look at Norvig’s meta regex golf problem, some of his instances had this property! Which means the greedy algorithm gives a much better approximation ratio for certain meta regex golf problems than it does for the worst case general problem. This is one instance where knowing the proof of a theorem helps us understand how to specialize it to our interests.
norvig-table

Norvig’s frequency table for president meta-regex golf. The left side counts the size of each set (defined by a regex)

The linear programming approach

So we just said that you can’t possibly do better than the greedy algorithm for approximating set cover. There must be nothing left to say, job well done, right? Wrong! Our second analysis, based on linear programming, shows that instances with special features can have better approximation results.

In particular, if we’re guaranteed that each element x \in U occurs in at most B of the sets S_i, then the linear programming approach will give a B-approximation, i.e. a cover whose size is at worst larger than OPT by a multiplicative factor of B. In the case that B is constant, we can beat our earlier greedy algorithm.

The technique is now a classic one in optimization, called LP-relaxation (LP stands for linear programming). The idea is simple. Most optimization problems can be written as integer linear programs, that is there you have n variables x_1, \dots, x_n \in \{ 0, 1 \} and you want to maximize (or minimize) a linear function of the x_i subject to some linear constraints. The thing you’re trying to optimize is called the objective. While in general solving integer linear programs is NP-hard, we can relax the “integer” requirement to 0 \leq x_i \leq 1, or something similar. The resulting linear program, called the relaxed program, can be solved efficiently using the simplex algorithm or another more complicated method.

The output of solving the relaxed program is an assignment of real numbers for the x_i that optimizes the objective function. A key fact is that the solution to the relaxed linear program will be at least as good as the solution to the original integer program, because the optimal solution to the integer program is a valid candidate for the optimal solution to the linear program. Then the idea is that if we use some clever scheme to round the x_i to integers, we can measure how much this degrades the objective and prove that it doesn’t degrade too much when compared to the optimum of the relaxed program, which means it doesn’t degrade too much when compared to the optimum of the integer program as well.

If this sounds wishy washy and vague don’t worry, we’re about to make it super concrete for set cover.

We’ll make a binary variable x_i for each set S_i in the input, and x_i = 1 if and only if we include it in our proposed cover. Then the objective function we want to minimize is \sum_{i=1}^n x_i. If we call our elements X = \{ e_1, \dots, e_m \}, then we need to write down a linear constraint that says each element e_j is hit by at least one set in the proposed cover. These constraints have to depend on the sets S_i, but that’s not a problem. One good constraint for element e_j is

\displaystyle \sum_{i : e_j \in S_i} x_i \geq 1

In words, the only way that an e_j will not be covered is if all the sets containing it have their x_i = 0. And we need one of these constraints for each j. Putting it together, the integer linear program is

The integer program for set cover.

The integer program for set cover.

Once we understand this formulation of set cover, the relaxation is trivial. We just replace the last constraint with inequalities.

setcoverlp

For a given candidate assignment x to the x_i, call Z(x) the objective value (in this case \sum_i x_i). Now we can be more concrete about the guarantees of this relaxation method. Let \textup{OPT}_{\textup{IP}} be the optimal value of the integer program and x_{\textup{IP}} a corresponding assignment to x_i achieving the optimum. Likewise let \textup{OPT}_{\textup{LP}}, x_{\textup{LP}} be the optimal things for the linear relaxation. We will prove:

Theorem: There is a deterministic algorithm that rounds x_{\textup{LP}} to integer values x so that the objective value Z(x) \leq B \textup{OPT}_{\textup{IP}}, where B is the maximum number of sets that any element e_j occurs in. So this gives a B-approximation of set cover.

Proof. Let B be as described in the theorem, and call y = x_{\textup{LP}} to make the indexing notation easier. The rounding algorithm is to set x_i = 1 if y_i \geq 1/B and zero otherwise.

To prove the theorem we need to show two things hold about this new candidate solution x:

  1. The choice of all S_i for which x_i = 1 covers every element.
  2. The number of sets chosen (i.e. Z(x)) is at most B times more than \textup{OPT}_{\textup{LP}}.

Since \textup{OPT}_{\textup{LP}} \leq \textup{OPT}_{\textup{IP}}, so if we can prove number 2 we get Z(x) \leq B \textup{OPT}_{\textup{LP}} \leq B \textup{OPT}_{\textup{IP}}, which is the theorem.

So let’s prove 1. Fix any j and we’ll show that element e_j is covered by some set in the rounded solution. Call B_j the number of times element e_j occurs in the input sets. By definition B_j \leq B, so 1/B_j \geq 1/B. Recall y was the optimal solution to the relaxed linear program, and so it must be the case that the linear constraint for each e_j is satisfied: \sum_{i : e_j \in S_i} x_i \geq 1. We know that there are B_j terms and they sums to at least 1, so not all terms can be smaller than 1/B_j (otherwise they’d sum to something less than 1). In other words, some variable x_i in the sum is at least 1/B_j \geq 1/B, and so x_i is set to 1 in the rounded solution, corresponding to a set S_i that contains e_j. This finishes the proof of 1.

Now let’s prove 2. For each j, we know that for each x_i = 1, the corresponding variable y_i \geq 1/B. In particular 1 \leq y_i B. Now we can simply bound the sum.

\displaystyle \begin{aligned} Z(x) = \sum_i x_i &\leq \sum_i x_i (B y_i) \\ &\leq B \sum_{i} y_i \\ &= B \cdot \textup{OPT}_{\textup{LP}} \end{aligned}

The second inequality is true because some of the x_i are zero, but we can ignore them when we upper bound and just include all the y_i. This proves part 2 and the theorem.

\square

I’ve got some more postscripts to this proof:

  1. The proof works equally well when the sets are weighted, i.e. your cost for picking a set is not 1 for every set but depends on some arbitrarily given constants w_i \geq 0.
  2. We gave a deterministic algorithm rounding y to x, but one can get the same result (with high probability) using a randomized algorithm. The idea is to flip a coin with bias y_i roughly \log(n) times and set x_i = 1 if and only if the coin lands heads at least once. The guarantee is no better than what we proved, but for some other problems randomness can help you get approximations where we don’t know of any deterministic algorithms to get the same guarantees. I can’t think of any off the top of my head, but I’m pretty sure they’re out there.
  3. For step 1 we showed that at least one term in the inequality for e_j would be rounded up to 1, and this guaranteed we covered all the elements. A natural question is: why not also round up at most one term of each of these inequalities? It might be that in the worst case you don’t get a better guarantee, but it would be a quick extra heuristic you could use to post-process a rounded solution.
  4. Solving linear programs is slow. There are faster methods based on so-called “primal-dual” methods that use information about the dual of the linear program to construct a solution to the problem. Goemans and Williamson have a nice self-contained chapter on their website about this with a ton of applications.

Additional Reading

Williamson and Shmoys have a large textbook called The Design of Approximation Algorithms. One problem is that this field is like a big heap of unrelated techniques, so it’s not like the book will build up some neat theoretical foundation that works for every problem. Rather, it’s messy and there are lots of details, but there are definitely diamonds in the rough, such as the problem of (and algorithms for) coloring 3-colorable graphs with “approximately 3” colors, and the infamous unique games conjecture.

I wrote a post a while back giving conditions which, if a problem satisfies those conditions, the greedy algorithm will give a constant-factor approximation. This is much better than the worst case \log(n)-approximation we saw in this post. Moreover, I also wrote a post about matroids, which is a characterization of problems where the greedy algorithm is actually optimal.

Set cover is one of the main tools that IBM’s AntiVirus software uses to detect viruses. Similarly to the regex golf problem, they find a set of strings that occurs source code in some viruses but not (usually) in good programs. Then they look for a small set of strings that covers all the viruses, and their virus scan just has to search binaries for those strings. Hopefully the size of your set cover is really small compared to the number of viruses you want to protect against. I can’t find a reference that details this, but that is understandable because it is proprietary software.

Until next time!