# The Many Faces of Set Cover

A while back Peter Norvig posted a wonderful pair of articles about regex golf. The idea behind regex golf is to come up with the shortest possible regular expression that matches one given list of strings, but not the other.

“Regex Golf,” by Randall Munroe.

In the first article, Norvig runs a basic algorithm to recreate and improve the results from the comic, and in the second he beefs it up with some improved search heuristics. My favorite part about this topic is that regex golf can be phrased in terms of a problem called set cover. I noticed this when reading the comic, and was delighted to see Norvig use that as the basis of his algorithm.

The set cover problem shows up in other places, too. If you have a database of items labeled by users, and you want to find the smallest set of labels to display that covers every item in the database, you’re doing set cover. I hear there are applications in biochemistry and biology but haven’t seen them myself.

If you know what a set is (just think of the “set” or “hash set” type from your favorite programming language), then set cover has a simple definition.

Definition (The Set Cover Problem): You are given a finite set $U$ called a “universe” and sets $S_1, \dots, S_n$ each of which is a subset of $U$. You choose some of the $S_i$ to ensure that every $x \in U$ is in one of your chosen sets, and you want to minimize the number of $S_i$ you picked.

It’s called a “cover” because the sets you pick “cover” every element of $U$. Let’s do a simple. Let $U = \{ 1,2,3,4,5 \}$ and

$\displaystyle S_1 = \{ 1,3,4 \}, S_2 = \{ 2,3,5 \}, S_3 = \{ 1,4,5 \}, S_4 = \{ 2,4 \}$

Then the smallest possible number of sets you can pick is 2, and you can achieve this by picking both $S_1, S_2$ or both $S_2, S_3$. The connection to regex golf is that you pick $U$ to be the set of strings you want to match, and you pick a set of regexes that match some of the strings in $U$ but none of the strings you want to avoid matching (I’ll call them $V$). If $w$ is such a regex, then you can form the set $S_w$ of strings that $w$ matches. Then if you find a small set cover with the strings $w_1, \dots, w_t$, then you can “or” them together to get a single regex $w_1 \mid w_2 \mid \dots \mid w_t$ that matches all of $U$ but none of $V$.

Set cover is what’s called NP-hard, and one implication is that we shouldn’t hope to find an efficient algorithm that will always give you the shortest regex for every regex golf problem. But despite this, there are approximation algorithms for set cover. What I mean by this is that there is a regex-golf algorithm $A$ that outputs a subset of the regexes matching all of $U$, and the number of regexes it outputs is such-and-such close to the minimum possible number. We’ll make “such-and-such” more formal later in the post.

What made me sad was that Norvig didn’t go any deeper than saying, “We can try to approximate set cover, and the greedy algorithm is pretty good.” It’s true, but the ideas are richer than that! Set cover is a simple example to showcase interesting techniques from theoretical computer science. And perhaps ironically, in Norvig’s second post a header promised the article would discuss the theory of set cover, but I didn’t see any of what I think of as theory. Instead he partially analyzes the structure of the regex golf instances he cares about. This is useful, but not really theoretical in any way unless he can say something universal about those instances.

I don’t mean to bash Norvig. His articles were great! And in-depth theory was way beyond scope. So this post is just my opportunity to fill in some theory gaps. We’ll do three things:

1. Show formally that set cover is NP-hard.
2. Prove the approximation guarantee of the greedy algorithm.
3. Show another (very different) approximation algorithm based on linear programming.

Along the way I’ll argue that by knowing (or at least seeing) the details of these proofs, one can get a better sense of what features to look for in the set cover instance you’re trying to solve. We’ll also see how set cover depicts the broader themes of theoretical computer science.

## NP-hardness

The first thing we should do is show that set cover is NP-hard. Intuitively what this means is that we can take some hard problem $P$ and encode instances of $P$ inside set cover problems. This idea is called a reduction, because solving problem $P$ will “reduce” to solving set cover, and the method we use to encode instance of $P$ as set cover problems will have a small amount of overhead. This is one way to say that set cover is “at least as hard as” $P$.

The hard problem we’ll reduce to set cover is called 3-satisfiability (3-SAT). In 3-SAT, the input is a formula whose variables are either true or false, and the formula is expressed as an OR of a bunch of clauses, each of which is an AND of three variables (or their negations). This is called 3-CNF form. A simple example:

$\displaystyle (x \vee y \vee \neg z) \wedge (\neg x \vee w \vee y) \wedge (z \vee x \vee \neg w)$

The goal of the algorithm is to decide whether there is an assignment to the variables which makes the formula true. 3-SAT is one of the most fundamental problems we believe to be hard and, roughly speaking, by reducing it to set cover we include set cover in a class called NP-complete, and if any one of these problems can be solved efficiently, then they all can (this is the famous P versus NP problem, and an efficient algorithm would imply P equals NP).

So a reduction would consist of the following: you give me a formula $\varphi$ in 3-CNF form, and I have to produce (in a way that depends on $\varphi$!) a universe $U$ and a choice of subsets $S_i \subset U$ in such a way that

$\varphi$ has a true assignment of variables if and only if the corresponding set cover problem has a cover using $k$ sets.

In other words, I’m going to design a function $f$ from 3-SAT instances to set cover instances, such that $x$ is satisfiable if and only if $f(x)$ has a set cover with $k$ sets.

Why do I say it only for $k$ sets? Well, if you can always answer this question then I claim you can find the minimum size of a set cover needed by doing a binary search for the smallest value of $k$. So finding the minimum size of a set cover reduces to the problem of telling if theres a set cover of size $k$.

Now let’s do the reduction from 3-SAT to set cover.

If you give me $\varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m$ where each $C_i$ is a clause and the variables are denoted $x_1, \dots, x_n$, then I will choose as my universe $U$ to be the set of all the clauses and indices of the variables (these are all just formal symbols). i.e.

$\displaystyle U = \{ C_1, C_2, \dots, C_m, 1, 2, \dots, n \}$

The first part of $U$ will ensure I make all the clauses true, and the last part will ensure I don’t pick a variable to be both true and false at the same time.

To show how this works I have to pick my subsets. For each variable $x_i$, I’ll make two sets, one called $S_{x_i}$ and one called $S_{\neg x_i}$. They will both contain $i$ in addition to the clauses which they make true when the corresponding literal is true (by literal I just mean the variable or its negation). For example, if $C_j$ uses the literal $\neg x_7$, then $S_{\neg x_7}$ will contain $C_j$ but $S_{x_7}$ will not. Finally, I’ll set $k = n$, the number of variables.

Now to prove this reduction works I have to prove two things: if my starting formula has a satisfying assignment I have to show the set cover problem has a cover of size $k$. Indeed, take the sets $S_{y}$ for all literals $y$ that are set to true in a satisfying assignment. There can be at most $n$ true literals since half are true and half are false, so there will be at most $n$ sets, and these sets clearly cover all of $U$ because every literal has to be satisfied by some literal or else the formula isn’t true.

The reverse direction is similar: if I have a set cover of size $n$, I need to use it to come up with a satisfying truth assignment for the original formula. But indeed, the sets that get chosen can’t include both a $S_{x_i}$ and its negation set $S_{\neg x_i}$, because there are $n$ of the elements $\{1, 2, \dots, n \} \subset U$, and each $i$ is only in the two $S_{x_i}, S_{\neg x_i}$. Just by counting if I cover all the indices $i$, I already account for $n$ sets! And finally, since I have covered all the clauses, the literals corresponding to the sets I chose give exactly a satisfying assignment.

Whew! So set cover is NP-hard because I encoded this logic problem 3-SAT within its rules. If we think 3-SAT is hard (and we do) then set cover must also be hard. So if we can’t hope to solve it exactly we should try to approximate the best solution.

## The greedy approach

The method that Norvig uses in attacking the meta-regex golf problem is the greedy algorithm. The greedy algorithm is exactly what you’d expect: you maintain a list $L$ of the subsets you’ve picked so far, and at each step you pick the set $S_i$ that maximizes the number of new elements of $U$ that aren’t already covered by the sets in $L$. In python pseudocode:

def greedySetCover(universe, sets):
chosenSets = set()
leftToCover = universe.copy()
unchosenSets = sets

covered = lambda s: leftToCover & s

while universe != 0:
if len(chosenSets) == len(sets):
raise Exception("No set cover possible")

nextSet = max(unchosenSets, key=lambda s: len(covered(s)))
unchosenSets.remove(nextSet)
leftToCover -= nextSet

return chosenSets


This is what theory has to say about the greedy algorithm:

Theorem: If it is possible to cover $U$ by the sets in $F = \{ S_1, \dots, S_n \}$, then the greedy algorithm always produces a cover that at worst has size $O(\log(n)) \textup{OPT}$, where $\textup{OPT}$ is the size of the smallest cover. Moreover, this is asymptotically the best any algorithm can do.

One simple fact we need from calculus is that the following sum is asymptotically the same as $\log(n)$:

$\displaystyle H(n) = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \log(n) + O(1)$

Proof. [adapted from Wan] Let’s say the greedy algorithm picks sets $T_1, T_2, \dots, T_k$ in that order. We’ll set up a little value system for the elements of $U$. Specifically, the value of each $T_i$ is 1, and in step $i$ we evenly distribute this unit value across all newly covered elements of $T_i$. So for $T_1$ each covered element gets value $1/|T_1|$, and if $T_2$ covers four new elements, each gets a value of 1/4. One can think of this “value” as a price, or energy, or unit mass, or whatever. It’s just an accounting system (albeit a clever one) we use to make some inequalities clear later.

In general call the value $v_x$ of element $x \in U$ the value assigned to $x$ at the step where it’s first covered. In particular, the number of sets chosen by the greedy algorithm $k$ is just $\sum_{x \in U} v_x$. We’re just bunching back together the unit value we distributed for each step of the algorithm.

Now we want to compare the sets chosen by greedy to the optimal choice. Call a smallest set cover $C_{\textup{OPT}}$. Let’s stare at the following inequality.

$\displaystyle \sum_{x \in U} v_x \leq \sum_{S \in C_{\textup{OPT}}} \sum_{x \in S} v_x$

It’s true because each $x$ counts for a $v_x$ at most once in the left hand side, and in the right hand side the sets in $C_{\textup{OPT}}$ must hit each $x$ at least once but may hit some $x$ more than once. Also remember the left hand side is equal to $k$.

Now we want to show that the inner sum on the right hand side, $\sum_{x \in S} v_x$, is at most $H(|S|)$. This will in fact prove the entire theorem: because each set $S_i$ has size at most $n$, the inequality above will turn into

$\displaystyle k \leq |C_{\textup{OPT}}| H(|S|) \leq |C_{\textup{OPT}}| H(n)$

And so $k \leq \textup{OPT} \cdot O(\log(n))$, which is the statement of the theorem.

So we want to show that $\sum_{x \in S} v_x \leq H(|S|)$. For each $j$ define $\delta_j(S)$ to be the number of elements in $S$ not covered in $T_1, \cup \dots \cup T_j$. Notice that $\delta_{j-1}(S) - \delta_{j}(S)$ is the number of elements of $S$ that are covered for the first time in step $j$. If we call $t_S$ the smallest integer $j$ for which $\delta_j(S) = 0$, we can count up the differences up to step $t_S$, we get

$\sum_{x \in S} v_x = \sum_{i=1}^{t_S} (\delta_{i-1}(S) - \delta_i(S)) \cdot \frac{1}{T_i - (T_1 \cup \dots \cup T_{i-1})}$

The rightmost term is just the cost assigned to the relevant elements at step $i$. Moreover, because $T_i$ covers more new elements than $S$ (by definition of the greedy algorithm), the fraction above is at most $1/\delta_{i-1}(S)$. The end is near. For brevity I’ll drop the $(S)$ from $\delta_j(S)$.

\displaystyle \begin{aligned} \sum_{x \in S} v_x & \leq \sum_{i=1}^{t_S} (\delta_{i-1} - \delta_i) \frac{1}{\delta_{i-1}} \\ & \leq \sum_{i=1}^{t_S} (\frac{1}{1 + \delta_i} + \frac{1}{2+\delta_i} \dots + \frac{1}{\delta_{i-1}}) \\ & = \sum_{i=1}^{t_S} H(\delta_{i-1}) - H(\delta_i) \\ &= H(\delta_0) - H(\delta_{t_S}) = H(|S|) \end{aligned}

And that proves the claim.

$\square$

I have three postscripts to this proof:

1. This is basically the exact worst-case approximation that the greedy algorithm achieves. In fact, Petr Slavik proved in 1996 that the greedy gives you a set of size exactly $(\log n - \log \log n + O(1)) \textup{OPT}$ in the worst case.
2. This is also the best approximation that any set cover algorithm can achieve, provided that P is not NP. This result was basically known in 1994, but it wasn’t until 2013 and the use of some very sophisticated tools that the best possible bound was found with the smallest assumptions.
3. In the proof we used that $|S| \leq n$ to bound things, but if we knew that our sets $S_i$ (i.e. subsets matched by a regex) had sizes bounded by, say, $B$, the same proof would show that the approximation factor is $\log(B)$ instead of $\log n$. However, in order for that to be useful you need $B$ to be a constant, or at least to grow more slowly than any polynomial in $n$, since e.g. $\log(n^{0.1}) = 0.1 \log n$. In fact, taking a second look at Norvig’s meta regex golf problem, some of his instances had this property! Which means the greedy algorithm gives a much better approximation ratio for certain meta regex golf problems than it does for the worst case general problem. This is one instance where knowing the proof of a theorem helps us understand how to specialize it to our interests.

Norvig’s frequency table for president meta-regex golf. The left side counts the size of each set (defined by a regex)

## The linear programming approach

So we just said that you can’t possibly do better than the greedy algorithm for approximating set cover. There must be nothing left to say, job well done, right? Wrong! Our second analysis, based on linear programming, shows that instances with special features can have better approximation results.

In particular, if we’re guaranteed that each element $x \in U$ occurs in at most $B$ of the sets $S_i$, then the linear programming approach will give a $B$-approximation, i.e. a cover whose size is at worst larger than OPT by a multiplicative factor of $B$. In the case that $B$ is constant, we can beat our earlier greedy algorithm.

The technique is now a classic one in optimization, called LP-relaxation (LP stands for linear programming). The idea is simple. Most optimization problems can be written as integer linear programs, that is there you have $n$ variables $x_1, \dots, x_n \in \{ 0, 1 \}$ and you want to maximize (or minimize) a linear function of the $x_i$ subject to some linear constraints. The thing you’re trying to optimize is called the objective. While in general solving integer linear programs is NP-hard, we can relax the “integer” requirement to $0 \leq x_i \leq 1$, or something similar. The resulting linear program, called the relaxed program, can be solved efficiently using the simplex algorithm or another more complicated method.

The output of solving the relaxed program is an assignment of real numbers for the $x_i$ that optimizes the objective function. A key fact is that the solution to the relaxed linear program will be at least as good as the solution to the original integer program, because the optimal solution to the integer program is a valid candidate for the optimal solution to the linear program. Then the idea is that if we use some clever scheme to round the $x_i$ to integers, we can measure how much this degrades the objective and prove that it doesn’t degrade too much when compared to the optimum of the relaxed program, which means it doesn’t degrade too much when compared to the optimum of the integer program as well.

If this sounds wishy washy and vague don’t worry, we’re about to make it super concrete for set cover.

We’ll make a binary variable $x_i$ for each set $S_i$ in the input, and $x_i = 1$ if and only if we include it in our proposed cover. Then the objective function we want to minimize is $\sum_{i=1}^n x_i$. If we call our elements $X = \{ e_1, \dots, e_m \}$, then we need to write down a linear constraint that says each element $e_j$ is hit by at least one set in the proposed cover. These constraints have to depend on the sets $S_i$, but that’s not a problem. One good constraint for element $e_j$ is

$\displaystyle \sum_{i : e_j \in S_i} x_i \geq 1$

In words, the only way that an $e_j$ will not be covered is if all the sets containing it have their $x_i = 0$. And we need one of these constraints for each $j$. Putting it together, the integer linear program is

The integer program for set cover.

Once we understand this formulation of set cover, the relaxation is trivial. We just replace the last constraint with inequalities.

For a given candidate assignment $x$ to the $x_i$, call $Z(x)$ the objective value (in this case $\sum_i x_i$). Now we can be more concrete about the guarantees of this relaxation method. Let $\textup{OPT}_{\textup{IP}}$ be the optimal value of the integer program and $x_{\textup{IP}}$ a corresponding assignment to $x_i$ achieving the optimum. Likewise let $\textup{OPT}_{\textup{LP}}, x_{\textup{LP}}$ be the optimal things for the linear relaxation. We will prove:

Theorem: There is a deterministic algorithm that rounds $x_{\textup{LP}}$ to integer values $x$ so that the objective value $Z(x) \leq B \textup{OPT}_{\textup{IP}}$, where $B$ is the maximum number of sets that any element $e_j$ occurs in. So this gives a $B$-approximation of set cover.

Proof. Let $B$ be as described in the theorem, and call $y = x_{\textup{LP}}$ to make the indexing notation easier. The rounding algorithm is to set $x_i = 1$ if $y_i \geq 1/B$ and zero otherwise.

To prove the theorem we need to show two things hold about this new candidate solution $x$:

1. The choice of all $S_i$ for which $x_i = 1$ covers every element.
2. The number of sets chosen (i.e. $Z(x)$) is at most $B$ times more than $\textup{OPT}_{\textup{LP}}$.

Since $\textup{OPT}_{\textup{LP}} \leq \textup{OPT}_{\textup{IP}}$, so if we can prove number 2 we get $Z(x) \leq B \textup{OPT}_{\textup{LP}} \leq B \textup{OPT}_{\textup{IP}}$, which is the theorem.

So let’s prove 1. Fix any $j$ and we’ll show that element $e_j$ is covered by some set in the rounded solution. Call $B_j$ the number of times element $e_j$ occurs in the input sets. By definition $B_j \leq B$, so $1/B_j \geq 1/B$. Recall $y$ was the optimal solution to the relaxed linear program, and so it must be the case that the linear constraint for each $e_j$ is satisfied: $\sum_{i : e_j \in S_i} x_i \geq 1$. We know that there are $B_j$ terms and they sums to at least 1, so not all terms can be smaller than $1/B_j$ (otherwise they’d sum to something less than 1). In other words, some variable $x_i$ in the sum is at least $1/B_j \geq 1/B$, and so $x_i$ is set to 1 in the rounded solution, corresponding to a set $S_i$ that contains $e_j$. This finishes the proof of 1.

Now let’s prove 2. For each $j$, we know that for each $x_i = 1$, the corresponding variable $y_i \geq 1/B$. In particular $1 \leq y_i B$. Now we can simply bound the sum.

\displaystyle \begin{aligned} Z(x) = \sum_i x_i &\leq \sum_i x_i (B y_i) \\ &\leq B \sum_{i} y_i \\ &= B \cdot \textup{OPT}_{\textup{LP}} \end{aligned}

The second inequality is true because some of the $x_i$ are zero, but we can ignore them when we upper bound and just include all the $y_i$. This proves part 2 and the theorem.

$\square$

1. The proof works equally well when the sets are weighted, i.e. your cost for picking a set is not 1 for every set but depends on some arbitrarily given constants $w_i \geq 0$.
2. We gave a deterministic algorithm rounding $y$ to $x$, but one can get the same result (with high probability) using a randomized algorithm. The idea is to flip a coin with bias $y_i$ roughly $\log(n)$ times and set $x_i = 1$ if and only if the coin lands heads at least once. The guarantee is no better than what we proved, but for some other problems randomness can help you get approximations where we don’t know of any deterministic algorithms to get the same guarantees. I can’t think of any off the top of my head, but I’m pretty sure they’re out there.
3. For step 1 we showed that at least one term in the inequality for $e_j$ would be rounded up to 1, and this guaranteed we covered all the elements. A natural question is: why not also round up at most one term of each of these inequalities? It might be that in the worst case you don’t get a better guarantee, but it would be a quick extra heuristic you could use to post-process a rounded solution.
4. Solving linear programs is slow. There are faster methods based on so-called “primal-dual” methods that use information about the dual of the linear program to construct a solution to the problem. Goemans and Williamson have a nice self-contained chapter on their website about this with a ton of applications.

Williamson and Shmoys have a large textbook called The Design of Approximation Algorithms. One problem is that this field is like a big heap of unrelated techniques, so it’s not like the book will build up some neat theoretical foundation that works for every problem. Rather, it’s messy and there are lots of details, but there are definitely diamonds in the rough, such as the problem of (and algorithms for) coloring 3-colorable graphs with “approximately 3″ colors, and the infamous unique games conjecture.

I wrote a post a while back giving conditions which, if a problem satisfies those conditions, the greedy algorithm will give a constant-factor approximation. This is much better than the worst case $\log(n)$-approximation we saw in this post. Moreover, I also wrote a post about matroids, which is a characterization of problems where the greedy algorithm is actually optimal.

Set cover is one of the main tools that IBM’s AntiVirus software uses to detect viruses. Similarly to the regex golf problem, they find a set of strings that occurs source code in some viruses but not (usually) in good programs. Then they look for a small set of strings that covers all the viruses, and their virus scan just has to search binaries for those strings. Hopefully the size of your set cover is really small compared to the number of viruses you want to protect against. I can’t find a reference that details this, but that is understandable because it is proprietary software.

Until next time!

# Finding the majority element of a stream

Problem: Given a massive data stream of $n$ values in $\{ 1, 2, \dots, m \}$ and the guarantee that one value occurs more than $n/2$ times in the stream, determine exactly which value does so.

Solution: (in Python)

def majority(stream):
held = next(stream)
counter = 1

for item in stream:
if item == held:
counter += 1
elif counter == 0:
held = item
counter = 1
else:
counter -= 1

return held


Discussion: Let’s prove correctness. Say that $s$ is the unknown value that occurs more than $n/2$ times. The idea of the algorithm is that if you could pair up elements of your stream so that distinct values are paired up, and then you “kill” these pairs, then $s$ will always survive. The way this algorithm pairs up the values is by holding onto the most recent value that has no pair (implicitly, by keeping a count how many copies of that value you saw). Then when you come across a new element, you decrement the counter and implicitly account for one new pair.

Let’s analyze the complexity of the algorithm. Clearly the algorithm only uses a single pass through the data. Next, if the stream has size $n$, then this algorithm uses $O(\log(n) + \log(m))$ space. Indeed, if the stream entirely consists of a single value (say, a stream of all 1’s) then the counter will be $n$ at the end, which takes $\log(n)$ bits to store. On the other hand, if there are $m$ possible values then storing the largest requires $\log(m)$ bits.

Finally, the guarantee that one value occurs more than $n/2$ times is necessary. If it is not the case the algorithm could output anything (including the most infrequent element!). And moreover, if we don’t have this guarantee then every algorithm that solves the problem must use at least $\Omega(n)$ space in the worst case. In particular, say that $m=n$, and the first $n/2$ items are all distinct and the last $n/2$ items are all the same one, the majority value $s$. If you do not know $s$ in advance, then you must keep at least one bit of information to know which symbols occurred in the first half of the stream because any of them could be $s$. So the guarantee allows us to bypass that barrier.

This algorithm can be generalized to detect $k$ items with frequency above some threshold $n/(k+1)$ using space $O(k \log n)$. The idea is to keep $k$ counters instead of one, adding new elements when any counter is zero. When you see an element not being tracked by your $k$ counters (which are all positive), you decrement all the counters by 1. This is like a $k$-to-one matching rather than a pairing.

# Making Hybrid Images

The Mona Lisa

Leonardo da Vinci’s Mona Lisa is one of the most famous paintings of all time. And there has always been a discussion around her enigmatic smile. He used a trademark Renaissance technique called sfumato, which involves many thin layers of glaze mixed with subtle pigments. The striking result is that when you look directly at Mona Lisa’s smile, it seems to disappear. But when you look at the background your peripherals see a smiling face.

One could spend decades studying the works of these masters from various perspectives, but if we want to hone in on the disappearing nature of that smile, mathematics can provide valuable insights. Indeed, though he may not have known the relationship between his work and da Vinci’s, hundreds of years later Salvador Dali did the artist’s equivalent of mathematically isolating the problem with his painting, “Gala Contemplating the Mediterranean Sea.”

Gala Contemplating the Mediterranean Sea (Salvador Dali, 1976)

Here you see a woman in the foreground, but step back quite far from the picture and there is a (more or less) clear image of Abraham Lincoln. Here the question of gaze is the blaring focus of the work. Now of course Dali and da Vinci weren’t scribbling down equations and computing integrals; their artistic expression was much less well-defined. But we the artistically challenged have tools of our own: mathematics, science, and programming.

In 2006 Aude Oliva, Antonio Torralba, and Philippe. G. Schyns used those tools to merge the distance of Dali and the faded smiles of da Vinci into one cohesive idea. In their 2006 paper they presented the notion of a “hybrid image,” presented below.

The Mona Lisas of Science

If you look closely, you’ll see three women, each of which looks the teensiest bit strange, like they might be trying to suppress a smile, but none of them are smiling. Blur your eyes or step back a few meters, and they clearly look happy. The effect is quite dramatic. At the risk of being overly dramatic, these three women are literally modern day versions of Mona Lisa, the “Mona Lisas of Science,” if you will.

Another, perhaps more famous version of their technique, since it was more widely publicized, is their “Marilyn Einstein,” which up close is Albert Einstein and from far away is Marilyn Monroe.

Marilyn Einstein

This one gets to the heart of the question of what the eye sees at close range versus long range. And it turns out that you can address this question (and create brilliant works of art like the ones above) with some basic Fourier analysis.

## Intuitive Fourier analysis (and references)

The basic idea of Fourier analysis is the idea that smooth functions are hard to understand, and realization of how great it would be if we could decompose them into simpler pieces. Decomposing complex things into simpler parts is one of the main tools in all of mathematics, and Fourier analysis is one of the clearest examples of its application.

In particular, the things we care about are functions $f(x)$ with specific properties I won’t detail here like “smoothness” and “finiteness.” And the building blocks are the complex exponential functions

$\displaystyle e^{2 \pi i kx}$

where $k$ can be any integer. If you have done some linear algebra (and ignore this if you haven’t), then I can summarize the idea succinctly by saying the complex exponentials form an orthonormal basis for the vector space of square-integrable functions.

Back in colloquial language, what the Fourier theorem says is that any function of the kind we care about can be broken down into (perhaps infinitely many) pieces of this form called Fourier coefficients (I’m abusing the word “coefficient” here). The way it’s breaking down is also pleasingly simple: it’s a linear combination. Informally that means you’re just adding up all the complex exponentials with specific weights for each one. Mathematically, the conversion from the function to its Fourier coefficients is called the Fourier transform, and the set of all Fourier coefficients together is called the Fourier spectrum. So if you want to learn about your function $f$, or more importantly modify it in some way, you can inspect and modify its spectrum instead. The reason this is useful is that Fourier coefficients have very natural interpretations in sound and images, as we’ll see for the latter.

We wrote $f(x)$ and the complex exponential as a function of one real variable, but you can do the same thing for two variables (or a hundred!). And, if you’re willing to do some abusing and ignore the complexness of complex numbers, then you can visualize “complex exponentials in two variables” as images of stripes whose orientation and thickness correspond to two parameters (i.e., the $k$ in the offset equation becomes two coefficients). The video below shows how such complex exponentials can be used to build up an image of striking detail. The left frame shows which complex exponential is currently being added, and the right frame shows the layers all put together. I think the result is quite beautiful.

This just goes to show how powerful da Vinci’s idea of fine layering is: it’s as powerful as possible because it can create any image!

Now for digital images like the one above, everything is finite. So rather than have an infinitely precise function and a corresponding infinite set of Fourier coefficients, you get a finite list of sampled values (pixels) and a corresponding grid of Fourier coefficients. But the important and beautiful theorem is, and I want to emphasize how groundbreakingly important this is:

If you give me an image (or any function!) I can compute the decomposition very efficiently.

And the same theorem lets you go the other way: if you give me the decomposition, I can compute the original function’s samples quite easily. The algorithm to do this is called the Fast Fourier transform, and if any piece of mathematics or computer science has a legitimate claim to changing the world, it’s the Fast Fourier transform. It’s hard to pinpoint specific applications, because the transform is so ubiquitous across science and engineering, but we definitely would not have cell phones, satellites, internet, or electronics anywhere near as small as we do without the Fourier transform and the ability to compute it quickly.

Constructing hybrid images is one particularly nice example of manipulating the Fourier spectrum of two images, and then combining them back into a single image. That’s what we’ll do now.

As a side note, by the nature of brevity, the discussion above is a big disservice to the mathematics involved. I summarized and abused in ways that mathematicians would object to. If you want to see a much better treatment of the material, this blog has a long series of posts developing Fourier transforms and their discrete analogues from scratch. See our four primers, which lead into the main content posts where we implement the Fast Fourier transform in Python and use it to apply digital watermarks to an image. Note that in those posts, as in this one, all of the materials and code used are posted on this blog’s Github page.

## High and low frequencies

For images, interpreting ranges of Fourier coefficients is easy to do. You can imagine the coefficients lying on a grid in the plane like so:

Each dot in this grid corresponds to how “intense” the Fourier coefficient is. That is, it’s the magnitude of the (complex) coefficient of the corresponding complex exponential. Now the points that are closer to the origin correspond informally to the broad, smooth changes in the image. These are called “low frequency” coefficients. And points that are further away correspond to sharp changes and edges, and are likewise called “high frequency” components. So the if you wanted to “hybridize” two images, you’d pick ones with complementary intensities in these regions. That’s why Einstein (with all his wiry hair and wrinkles) and Monroe (with smooth features) are such good candidates. That’s also why, when we layered the Fourier components one by one in the video from earlier, we see the fuzzy shapes emerge before the fine details.

Moreover, we can “extract” the high frequency Fourier components by simply removing the low frequency ones. It’s a bit more complicated than that, since you want the transition from “something” to “nothing” to be smooth in sone sense. A proper discussion of this would go into sampling and the Nyquist frequency, but that’s beyond the scope of this post. Rather, we’ll just define a family of “filtering functions” without motivation and observe that they work well.

Definition: The Gaussian filter function with variance $\sigma$ and center $(a, b)$ is the function

$\displaystyle g(x,y) = e^{-\frac{(x - a)^2 + (y - b)^2}{2 \sigma^2}}$

It looks like this

image credit Wikipedia

In particular, at zero the function is 1 and it gradually drops to zero as you get farther away. The parameter $\sigma$ controls the rate at which it vanishes, and in the picture above the center is set to $(0,0)$.

Now what we’ll do is take our image, compute its spectrum, and multiply coordinatewise with a certain Gaussian function. If we’re trying to get rid of high-frequency components (called a “low-pass filter” because it lets the low frequencies through), we can just multiply the Fourier coefficients directly by the filter values $g(x,y)$, and if we’re doing a “high-pass filter” we multiply by $1 - g(x,y)$.

Before we get to the code, here’s an example of a low-pass filter. First, take this image of Marilyn Monroe

Now compute its Fourier transform

Apply the low-pass filter

And reverse the Fourier transform to get an image

In fact, this is a common operation in programs like photoshop for blurring an image (it’s called a Gaussian blur for obvious reasons). Here’s the python code to do this. You can download it along with all of the other resources used in making this post on this blog’s Github page.

import numpy
from numpy.fft import fft2, ifft2, fftshift, ifftshift
from scipy import misc
from scipy import ndimage
import math

def makeGaussianFilter(numRows, numCols, sigma, highPass=True):
centerI = int(numRows/2) + 1 if numRows % 2 == 1 else int(numRows/2)
centerJ = int(numCols/2) + 1 if numCols % 2 == 1 else int(numCols/2)

def gaussian(i,j):
coefficient = math.exp(-1.0 * ((i - centerI)**2 + (j - centerJ)**2) / (2 * sigma**2))
return 1 - coefficient if highPass else coefficient

return numpy.array([[gaussian(i,j) for j in range(numCols)] for i in range(numRows)])

def filterDFT(imageMatrix, filterMatrix):
shiftedDFT = fftshift(fft2(imageMatrix))
filteredDFT = shiftedDFT * filterMatrix
return ifft2(ifftshift(filteredDFT))

def lowPass(imageMatrix, sigma):
n,m = imageMatrix.shape
return filterDFT(imageMatrix, makeGaussianFilter(n, m, sigma, highPass=False))

def highPass(imageMatrix, sigma):
n,m = imageMatrix.shape
return filterDFT(imageMatrix, makeGaussianFilter(n, m, sigma, highPass=True))

if __name__ == "__main__":
lowPassedMarilyn = lowPass(marilyn, 20)
misc.imsave("low-passed-marilyn.png", numpy.real(lowPassedMarilyn))


The first function samples the values from a Gaussian function with the specified parameters, discretizing the function and storing the values in a matrix. Then the filterDFT function applies the filter by doing coordinatewise multiplication (note these are all numpy arrays). We can do the same thing with a high-pass filter, producing the edgy image below

And if we compute the average of these two images, we basically get back to the original.

So the only difference between this and a hybrid image is that you take the low-passed part of one image and the high-passed part of another. Then the art is in balancing the parameters so as to make the averaged image look right. Indeed, with the following picture of Einstein and the above shot of Monroe, we can get a pretty good recreation of the Oliva-Torralba-Schyns piece. I think with more tinkering it could be even better (I did barely any centering/aligning/resizing to the original images).

Albert Einstein, Marilyn Monroe, and their hybridization.

And here’s the code for it

def hybridImage(highFreqImg, lowFreqImg, sigmaHigh, sigmaLow):
highPassed = highPass(highFreqImg, sigmaHigh)
lowPassed = lowPass(lowFreqImg, sigmaLow)

return highPassed + lowPassed


Interestingly enough, doing it in reverse doesn’t give quite as pleasing results, but it still technically works. So there’s something particularly important that the high-passed image does have a lot of high-frequency components, and vice versa for the low pass.

You can see some of the other hybrid images Oliva et al constructed over at their web gallery.

## Next Steps

How can we take this idea further? There are a few avenues I can think of. The most obvious one would be to see how this extends to video. Could one come up with generic parameters so that when two videos are hybridized (frame by frame, using this technique) it is only easy to see one at close distance? Or else, could we apply a three-dimensional transform to a video and modify that in some principled way? I think one would not likely find anything astounding, but who knows?

Second would be to look at the many other transforms we have at our disposal. How does manipulating the spectra of these transforms affect the original image, and can you make images that are hybridized in senses other than this one?

And finally, can we bring this idea down in dimension to work with one-dimensional signals? In particular, can we hybridize music? It could usher in a new generation of mashup songs that sound different depending on whether you wear earmuffs :)

Until next time!

# Linear Programming and the Most Affordable Healthy Diet — Part 1

Optimization is by far one of the richest ways to apply computer science and mathematics to the real world. Everybody is looking to optimize something: companies want to maximize profits, factories want to maximize efficiency, investors want to minimize risk, the list just goes on and on. The mathematical tools for optimization are also some of the richest mathematical techniques. They form the cornerstone of an entire industry known as operations research, and advances in this field literally change the world.

The mathematical field is called combinatorial optimization, and the name comes from the goal of finding optimal solutions more efficiently than an exhaustive search through every possibility. This post will introduce the most central problem in all of combinatorial optimization, known as the linear program. Even better, we know how to efficiently solve linear programs, so in future posts we’ll write a program that computes the most affordable diet while meeting the recommended health standard.

## Generalizing a Specific Linear Program

Most optimization problems have two parts: an objective function, the thing we want to maximize or minimize, and constraints, rules we must abide by to ensure we get a valid solution. As a simple example you may want to minimize the amount of time you spend doing your taxes (objective function), but you certainly can’t spend a negative amount of time on them (a constraint).

The following more complicated example is the centerpiece of this post. Most people want to minimize the amount of money spent on food. At the same time, one needs to maintain a certain level of nutrition. For males ages 19-30, the United States National Institute for Health recommends 3.7 liters of water per day, 1,000 milligrams of calcium per day, 90 milligrams of vitamin C per day, etc.

We can set up this nutrition problem mathematically, just using a few toy variables. Say we had the option to buy some combination of oranges, milk, and broccoli. Some rough estimates [1] give the following content/costs of these foods. For 0.272 USD you can get 100 grams of orange, containing a total of 53.2mg of calcium, 40mg of vitamin C, and 87g of water. For 0.100 USD you can get 100 grams of whole milk, containing 276mg of calcium, 0mg of vitamin C, and 87g of water. Finally, for 0.381 USD you can get 100 grams of broccoli containing 47mg of calcium, 89.2mg of vitamin C, and 91g of water. Here’s a table summarizing this information:

Nutritional content and prices for 100g of three foods

Food         calcium(mg)     vitamin C(mg)      water(g)   price(USD/100g)
Broccoli     47              89.2               91         0.381
Whole milk   276             0                  87         0.100
Oranges      40              53.2               87         0.272

Some observations: broccoli is more expensive but gets the most of all three nutrients, whole milk doesn’t have any vitamin C but gets a ton of calcium for really cheap, and oranges are a somewhere in between. So you could probably tinker with the quantities and figure out what the cheapest healthy diet is. The problem is what happens when we incorporate hundreds or thousands of food items and tens of nutrient recommendations. This simple example is just to help us build up a nice formality.

So let’s continue doing that. If we denote by $b$ the number of 100g units of broccoli we decide to buy, and $m$ the amount of milk and $r$ the amount of oranges, then we can write the daily cost of food as

$\displaystyle \text{cost}(b,m,r) = 0.381 b + 0.1 m + 0.272 r$

In the interest of being compact (and again, building toward the general linear programming formulation) we can extract the price information into a single cost vector $c = (0.381, 0.1, 0.272)$, and likewise write our variables as a vector $x = (b,m,r)$. We’re implicitly fixing an ordering on the variables that is maintained throughout the problem, but the choice of ordering doesn’t matter. Now the cost function is just the inner product (dot product) of the cost vector and the variable vector $\left \langle c,x \right \rangle$. For some reason lots of people like to write this as $c^Tx$, where $c^T$ denotes the transpose of a matrix, and we imagine that $c$ and $x$ are matrices of size $3 \times 1$. I’ll stick to using the inner product bracket notation.

Now for each type of food we get a specific amount of each nutrient, and the sum of those nutrients needs to be bigger than the minimum recommendation. For example, we want at least 1,000 mg of calcium per day, so we require that $1000 \leq 47b + 276m + 40r$. Likewise, we can write out a table of the constraints by looking at the columns of our table above.

$\displaystyle \begin{matrix} 91b & + & 87m & + & 87r & \geq & 3700 & \text{(water)}\\ 47b & + & 276m & + & 40r & \geq & 1000 & \text{(calcium)} \\ 89.2b & + & 0m & + & 53.2r & \geq & 90 & \text{(vitamin C)} \end{matrix}$

In the same way that we extracted the cost data into a vector to separate it from the variables, we can extract all of the nutrient data into a matrix $A$, and the recommended minimums into a vector $v$. Traditionally the letter $b$ is used for the minimums vector, but for now we’re using $b$ for broccoli.

$A = \begin{pmatrix} 91 & 87 & 87 \\ 47 & 276 & 40 \\ 89.2 & 0 & 53.2 \end{pmatrix}$

$v = \begin{pmatrix} 3700 \\ 1000 \\ 90 \end{pmatrix}$

And now the constraint is that $Ax \geq v$, where the $\geq$ means “greater than or equal to in every coordinate.” So now we can write down the more general form of the problem for our specific matrices and vectors. That is, our problem is to minimize $\left \langle c,x \right \rangle$ subject to the constraint that $Ax \geq v$. This is often written in offset form to contrast it with variations we’ll see in a bit:

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to the constraint } Ax \geq v$

In general there’s no reason you can’t have a “negative” amount of one variable. In this problem you can’t buy negative broccoli, so we’ll add the constraints to ensure the variables are nonnegative. So our final form is

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to } Ax \geq v \\ \text{and } x \geq 0$

In general, if you have an $m \times n$ matrix $A$, a “minimums” vector $v \in \mathbb{R}^m$, and a cost vector $c \in \mathbb{R}^n$, the problem of finding the vector $x$ that minimizes the cost function while meeting the constraints is called a linear programming problem or simply a linear program.

To satiate the reader’s burning curiosity, the solution for our calcium/vitamin C problem is roughly $x = (1.01, 41.47, 0)$. That is, you should have about 100g of broccoli and 4.2kg of milk (like 4 liters), and skip the oranges entirely. The daily cost is about 4.53 USD. If this seems awkwardly large, it’s because there are cheaper ways to get water than milk.

100g of broccoli (image source: 100-grams.blogspot.com)

## Duality

Now that we’ve seen the general form a linear program and a cute example, we can ask the real meaty question: is there an efficient algorithm that solves arbitrary linear programs? Despite how widely applicable these problems seem, the answer is yes!

But before we can describe the algorithm we need to know more about linear programs. For example, say you have some vector $x$ which satisfies your constraints. How can you tell if it’s optimal? Without such a test we’d have no way to know when to terminate our algorithm. Another problem is that we’ve phrased the problem in terms of minimization, but what about problems where we want to maximize things? Can we use the same algorithm that finds minima to find maxima as well?

Both of these problems are neatly answered by the theory of duality. In mathematics in general, the best way to understand what people mean by “duality” is that one mathematical object uniquely determines two different perspectives, each useful in its own way. And typically a duality theorem provides one with an efficient way to transform one perspective into the other, and relate the information you get from both perspectives. A theory of duality is considered beautiful because it gives you truly deep insight into the mathematical object you care about.

In linear programming duality is between maximization and minimization. In particular, every maximization problem has a unique “dual” minimization problem, and vice versa. The really interesting thing is that the variables you’re trying to optimize in one form correspond to the contraints in the other form! Here’s how one might discover such a beautiful correspondence. We’ll use a made up example with small numbers to make things easy.

So you have this optimization problem

$\displaystyle \begin{matrix} \text{minimize} & 4x_1+3x_2+9x_3 & \\ \text{subject to} & x_1+x_2+x_3 & \geq 6 \\ & 2x_1+x_3 & \geq 2 \\ & x_2+x_3 & \geq 1 & \\ & x_1,x_2,x_3 & \geq 0 \end{matrix}$

Just for giggles let’s write out what $A$ and $c$ are.

$\displaystyle A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix}, c = (4,3,9), v = (6,2,1)$

Say you want to come up with a lower bound on the optimal solution to your problem. That is, you want to know that you can’t make $4x_1 + 3x_2 + 9x_3$ smaller than some number $m$. The constraints can help us derive such lower bounds. In particular, every variable has to be nonnegative, so we know that $4x_1 + 3x_2 + 9x_3 \geq x_1 + x_2 + x_3 \geq 6$, and so 6 is a lower bound on our optimum. Likewise,

\displaystyle \begin{aligned}4x_1+3x_2+9x_3 & \geq 4x_1+4x_3+3x_2+3x_3 \\ &=2(2x_1 + x_3)+3(x_2+x_3) \\ & \geq 2 \cdot 2 + 3 \cdot 1 \\ &=7\end{aligned}

and that’s an even better lower bound than 6. We could try to write this approach down in general: find some numbers $y_1, y_2, y_3$ that we’ll use for each constraint to form

$\displaystyle y_1(\text{constraint 1}) + y_2(\text{constraint 2}) + y_3(\text{constraint 3})$

To make it a valid lower bound we need to ensure that the coefficients of each of the $x_i$ are smaller than the coefficients in the objective function (i.e. that the coefficient of $x_1$ ends up less than 4). And to make it the best lower bound possible we want to maximize what the right-hand-size of the inequality would be: $y_1 6 + y_2 2 + y_3 1$. If you write out these equations and the constraints you get our “lower bound” problem written as

$\displaystyle \begin{matrix} \text{maximize} & 6y_1 + 2y_2 + y_3 & \\ \text{subject to} & y_1 + 2y_2 & \leq 4 \\ & y_1 + y_3 & \leq 3 \\ & y_1+y_2 + y_3 & \leq 9 \\ & y_1,y_2,y_3 & \geq 0 \end{matrix}$

And wouldn’t you know, the matrix providing the constraints is $A^T$, and the vectors $c$ and $v$ switched places.

$\displaystyle A^T = \begin{pmatrix} 1 & 2 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix}$

This is no coincidence. All linear programs can be transformed in this way, and it would be a useful exercise for the reader to turn the above maximization problem back into a minimization problem by the same technique (computing linear combinations of the constraints to make upper bounds). You’ll be surprised to find that you get back to the original minimization problem! This is part of what makes it “duality,” because the dual of the dual is the original thing again. Often, when we fix the “original” problem, we call it the primal form to distinguish it from the dual form. Usually the primal problem is the one that is easy to interpret.

(Note: because we’re done with broccoli for now, we’re going to use $b$ to denote the constraint vector that used to be $v$.)

Now say you’re given the data of a linear program for minimization, that is the vectors $c, b$ and matrix $A$ for the problem, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b; x \geq 0$.” We can make a general definition: the dual linear program is the maximization problem “maximize $\left \langle b, y \right \rangle$ subject to $A^T y \leq c, y \geq 0$.” Here $y$ is the new set of variables and the superscript T denotes the transpose of the matrix. The constraint for the dual is often written $y^T A \leq c^T$, again identifying vectors with a single-column matrices, but I find the swamp of transposes pointless and annoying (why do things need to be columns?).

Now we can actually prove that the objective function for the dual provides a bound on the objective function for the original problem. It’s obvious from the work we’ve done, which is why it’s called the weak duality theorem.

Weak Duality Theorem: Let $c, A, b$ be the data of a linear program in the primal form (the minimization problem) whose objective function is $\left \langle c, x \right \rangle$. Recall that the objective function of the dual (maximization) problem is $\left \langle b, y \right \rangle$. If $x,y$ are feasible solutions (satisfy the constraints of their respective problems), then

$\left \langle b, y \right \rangle \leq \left \langle c, x \right \rangle$

In other words, the maximum of the dual is a lower bound on the minimum of the primal problem and vice versa. Moreover, any feasible solution for one provides a bound on the other.

Proof. The proof is pleasingly simple. Just inspect the quantity $\left \langle A^T y, x \right \rangle = \left \langle y, Ax \right \rangle$. The constraints from the definitions of the primal and dual give us that

$\left \langle y, b \right \rangle \leq \left \langle y, Ax \right \rangle = \left \langle A^Ty, x \right \rangle \leq \left \langle c,x \right \rangle$

The inequalities follow from the linear algebra fact that if the $u$ in $\left \langle u,v \right \rangle$ is nonnegative, then you can only increase the size of the product by increasing the components of $v$. This is why we need the nonnegativity constraints.

In fact, the world is much more pleasing. There is a theorem that says the two optimums are equal!

Strong Duality Theorem: If there are any solutions $x,y$ to the primal (minimization) problem and the dual (maximization) problem, respectively, then the two problems also have optimal solutions $x^*, y^*$, and two candidate solutions $x^*, y^*$ are optimal if and only if they produce equal objective values $\left \langle c, x^* \right \rangle = \left \langle y^*, b \right \rangle$.

The proof of this theorem is a bit more convoluted than the weak duality theorem, and the key technique is a lemma of Farkas and its variations. See the second half of these notes for a full proof. The nice thing is that this theorem gives us a way to tell if an algorithm to solve linear programs is done: maintain a pair of feasible solutions to the primal and dual problems, improve them by some rule, and stop when the two solutions give equal objective values. The hard part, then, is finding a principled and guaranteed way to improve a given pair of solutions.

On the other hand, you can also prove the strong duality theorem by inventing an algorithm that provably terminates. We’ll see such an algorithm, known as the simplex algorithm in the next post. Sneak peek: it’s a lot like Gaussian elimination. Then we’ll use the algorithm (or an equivalent industry-strength version) to solve a much bigger nutrition problem.

In fact, you can do a bit better than the strong duality theorem, in terms of coming up with a stopping condition for a linear programming algorithm. You can observe that an optimal solution implies further constraints on the relationship between the primal and the dual problems. In particular, this is called the complementary slackness conditions, and they essentially say that if an optimal solution to the primal has a positive variable then the corresponding constraint in the dual problem must be tight (is an equality) to get an optimal solution to the dual. The contrapositive says that if some constraint is slack, or a strict inequality, then either the corresponding variable is zero or else the solution is not optimal. More formally,

Theorem (Complementary Slackness Conditions): Let $A, c, b$ be the data of the primal form of a linear program, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b, x \geq 0$.” Then $x^*, y^*$ are optimal solutions to the primal and dual problems if any only if all of the following conditions hold.

• $x^*, y^*$ are both feasible for their respective problems.
• Whenever $x^*_i > 0$ the corresponding constraint $A^T_i y^* = c_i$ is an equality.
• Whenever $y^*_j > 0$ the corresponding constraint $A_j x^* = b_j$ is an equality.

Here we denote by $M_i$ the $i$-th row of the matrix $M$ and $v_i$ to denote the $i$-th entry of a vector. Another way to write the condition using vectors instead of English is

$\left \langle x^*, A^T y^* - c \right \rangle = 0$
$\left \langle y^*, Ax^* - b \right \rangle$

The proof follows from the duality theorems, and just involves pushing around some vector algebra. See section 6.2 of these notes.

One can interpret complementary slackness in linear programs in a lot of different ways. For us, it will simply be a termination condition for an algorithm: one can efficiently check all of these conditions for the nonzero variables and stop if they’re all satisfied or if we find a variable that violates a slackness condition. Indeed, in more mature optimization analyses, the slackness condition that is more egregiously violated can provide evidence for where a candidate solution can best be improved. For a more intricate and detailed story about how to interpret the complementary slackness conditions, see Section 4 of these notes by Joel Sobel.

Finally, before we close we should note there are geometric ways to think about linear programming. I have my preferred visualization in my head, but I have yet to find a suitable animation on the web that replicates it. Here’s one example in two dimensions. The set of constraints define a convex geometric region in the plane

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

Now the optimization function $f(x) = \left \langle c,x \right \rangle$ is also a linear function, and if you fix some output value $y = f(x)$ this defines a line in the plane. As $y$ changes, the line moves along its normal vector (that is, all these fixed lines are parallel). Now to geometrically optimize the target function, we can imagine starting with the line $f(x) = 0$, and sliding it along its normal vector in the direction that keeps it in the feasible region. We can keep sliding it in this direction, and the maximum of the function is just the last instant that this line intersects the feasible region. If none of the constraints are parallel to the family of lines defined by $f$, then this is guaranteed to occur at a vertex of the feasible region. Otherwise, there will be a family of optima lying anywhere on the line segment of last intersection.

In higher dimensions, the only change is that the lines become affine subspaces of dimension $n-1$. That means in three dimensions you’re sliding planes, in four dimensions you’re sliding 3-dimensional hyperplanes, etc. The facts about the last intersection being a vertex or a “line segment” still hold. So as we’ll see next time, successful algorithms for linear programming in practice take advantage of this observation by efficiently traversing the vertices of this convex region. We’ll see this in much more detail in the next post.

Until then!