So in continuing our series of methods of proof, we’ll move up to some of the more advanced methods of proof. And in keeping with the spirit of the series, we’ll spend most of our time discussing the structural form of the proofs. This time, diagonalization.

Perhaps one of the most famous methods of proof after the basic four is proof by diagonalization. Why do they call it diagonalization? Because the idea behind diagonalization is to write out a table that describes how a collection of objects behaves, and then to manipulate the “diagonal” of that table to get a new object that you can prove isn’t in the table.

The simplest and most famous example of this is the proof that there is no bijection between the natural numbers and the real numbers. We defined injections, and surjections and bijections, in two earlier posts in this series, but for new readers a bijection is just a one-to-one mapping between two collections of things. For example, one can construct a bijection between all positive integers and all *even* positive integers by mapping to . If there is a bijection between two (perhaps infinite) sets, then we say they have the same size or *cardinality*. And so to say there is no bijection between the natural numbers and the real numbers is to say that one of these two sets (the real numbers) is somehow “larger” than the other, despite both being infinite in size. It’s deep, it used to be very controversial, and it made the method of diagonalization famous. Let’s see how it works.

**Theorem: **There is no bijection from the natural numbers to the real numbers .

*Proof.* Suppose to the contrary (i.e., we’re about to do proof by contradiction) that there is a bijection . That is, you give me a positive integer and I will spit out , with the property that different give different , and every real number is hit by some natural number (this is just what it means to be a one-to-one mapping).

First let me just do some setup. I claim that all we need to do is show that there is no bijection between and the real numbers *between 0 and 1*. In particular, I claim there is a bijection from to all real numbers, so if there is a bijection from then we could combine the two bijections. To show there is a bijection from , I can first make a bijection from the open interval to the interval by mapping to . With a little bit of extra work (read, messy details) you can extend this to all real numbers. Here’s a sketch: make a bijection from to by doubling; then make a bijection from to all real numbers by using the part to get , and use the part to get by subtracting 1 (almost! To be super rigorous you also have to argue that the missing number 1 doesn’t change the cardinality, or else write down a more complicated bijection; still, the idea should be clear).

Okay, setup is done. We just have to show there is no bijection between and the natural numbers.

The reason I did all that setup is so that I can use the fact that every real number in has an infinite binary decimal expansion whose only nonzero digits are after the decimal point. And so I’ll write down the expansion of as a row in a table (an infinite row), and below it I’ll write down the expansion of , below that , and so on, and the decimal points will line up. The table looks like this.

The ‘s above are either 0 or 1. I need to be a bit more detailed in my table, so I’ll index the digits of by , the digits of by , and so on. This makes the table look like this

It’s a bit harder to read, but trust me the notation is helpful.

Now by the assumption that is a bijection, I’m assuming that *every* real number shows up as a number in this table, and no real number shows up twice. So if I could construct a number that I can prove is not in the table, I will arrive at a contradiction: the table couldn’t have had all real numbers to begin with! And that will prove there is no bijection between the natural numbers and the real numbers.

Here’s how I’ll come up with such a number (this is the diagonalization part). It starts with 0., and it’s first digit after the decimal is . That is, we flip the bit to get the first digit of . The second digit is , the third is , and so on. In general, digit is .

Now we show that isn’t in the table. If it were, then it would have to be for some , i.e. be the -th row in the table. Moreover, by the way we built the table, the -th digit of would be . But we *defined * so that it’s -th digit was actually . This is very embarrassing for (it’s a contradiction!). So isn’t in the table.

It’s the kind of proof that blows your mind the first time you see it, because it says that *there is more than one kind of infinity*. Not something you think about every day, right?

The second example we’ll show of a proof by diagonalization is the Halting Theorem, proved originally by Alan Turing, which says that there are some problems that computers can’t solve, even if given unbounded space and time to perform their computations. The formal mathematical model is called a Turing machine, but for simplicity you can think of “Turing machines” and “algorithms described in words” as the same thing. Or if you want it can be “programs written in programming language X.” So we’ll use the three words “Turing machine,” “algorithm,” and “program” interchangeably.

The proof works by actually defining a problem and proving it can’t be solved. The problem is called *the halting problem*, and it is the problem of deciding: given a program and an input to that program, will ever stop running when given as input? What I mean by “decide” is that any program that claims to solve the halting problem is itself required to halt for every possible input with the correct answer. A “halting problem solver” can’t loop infinitely!

So first we’ll give the standard proof that the halting problem can’t be solved, and then we’ll inspect the form of the proof more closely to see why it’s considered a diagonalization argument.

**Theorem: **The halting program cannot be solved by Turing machines.

*Proof. *Suppose to the contrary that is a program that solves the halting problem. We’ll use as a black box to come up with a new program I’ll call meta-, defined in pseudo-python as follows.

def metaT(P): run T on (P,P) if T says that P halts: loop infinitely else: halt and output "success!"

In words, meta- accepts as input the source code of a program , and then uses to tell if halts (when given its own source code as input). Based on the result, it behaves the *opposite* of ; if halts then meta- loops infinitely and vice versa. It’s a little meta, right?

Now let’s do something crazy: let’s run meta- on itself! That is, run

metaT(metaT)

So meta. The question is what is the output of this call? The meta- program uses to determine whether meta- halts when given itself as input. So let’s say that the answer to this question is “yes, it does halt.” Then by the definition of meta-, the program proceeds to loop forever. But this is a problem, because it means that `metaT(metaT)`

(which is the original thing we ran) actually does not halt, contradicting ‘s answer! Likewise, if says that `metaT(metaT)`

should loop infinitely, that will cause meta- to halt, a contradiction. So cannot be correct, and the halting problem can’t be solved.

This theorem is deep because it says that you can’t possibly write a program to which can always detect bugs in other programs. Infinite loops are just one special kind of bug.

But let’s take a closer look and see why this is a proof by diagonalization. The first thing we need to convince ourselves is that the set of all programs is countable (that is, there is a bijection from to the set of all programs). This shouldn’t be so hard to see: you can list all programs in lexicographic order, since the set of all strings is countable, and then throw out any that are not syntactically valid programs. Likewise, the set of all inputs, really just all strings, is countable.

The second thing we need to convince ourselves of is that a *problem* corresponds to an infinite binary string. To do this, we’ll restrict our attention to problems with yes/no answers, that is where the goal of the program is to output a single bit corresponding to yes or no for a given input. Then if we list all possible inputs in increasing lexicographic order, a problem can be represented by the infinite list of bits that are the correct outputs to each input.

For example, if the problem is to determine whether a given binary input string corresponds to an even number, the representation might look like this:

`010101010101010101...`

Of course this all depends on the details of how one encodes inputs, but the point is that if you wanted to you could nail all this down precisely. More importantly for us we can represent the halting problem as an infinite *table* of bits. If the columns of the table are all programs (in lex order), and the rows of the table correspond to inputs (in lex order), then the table would have at entry a 1 if halts and a 0 otherwise.

here is 1 if halts and 0 otherwise. The table encodes the answers to the halting problem for all possible inputs.

Now we assume for contradiction sake that some program solves the halting problem, i.e. that every entry of the table is computable. Now we’ll construct the answers output by meta- by flipping each bit of the diagonal of the table. The point is that meta- corresponds to some *row* of the table, because there is some input string that is interpreted as the source code of meta-. Then we argue that the entry of the table for contradicts its definition, and we’re done!

So these are two of the most high-profile uses of the method of diagonalization. It’s a great tool for your proving repertoire.

Until next time!

]]>

An algorithm can “solve” a classification task using labeled examples drawn from some distribution if it can achieve accuracy that is arbitrarily close to perfect on the distribution, and it can meet this goal with arbitrarily high probability, where it’s runtime and the number of examples needed scales efficiently with all the parameters (accuracy, confidence, size of an example). Moreover, the algorithm needs to succeed no matter what distribution generates the examples.

You can think of this as a game between the algorithm designer and an adversary. First, the learning problem is fixed and everyone involved knows what the task is. Then the algorithm designer has to pick an algorithm. Then the adversary, *knowing the chosen algorithm,* chooses a nasty distribution over examples that are fed to the learning algorithm. The algorithm designer “wins” if the algorithm produces a hypothesis with low error on when given samples from . And our goal is to prove that the algorithm designer can pick a single algorithm that is extremely likely to win no matter what the adversary picks.

We’ll momentarily restate this with a more precise definition, because in this post we will compare it to a slightly different model, which is called the *weak PAC-learning *model. It’s essentially the same as PAC, except it only requires the algorithm to have accuracy that is *slightly better than random guessing*. That is, the algorithm will output a classification function which will correctly classify a random label with probability at least for some small, but fixed, . The quantity (the Greek “eta”) is called the *edge* as in “the edge over random guessing.” We call an algorithm that produces such a hypothesis a *weak learner*, and in contrast we’ll call a successful algorithm in the usual PAC model a *strong learner*.

The amazing fact is that **strong learning and weak learning are equivalent!** Of course a weak learner is not the same thing as a strong learner. What we mean by “equivalent” is that:

A problem can be weak-learned if and only if it can be strong-learned.

So they are *computationally* the same. One direction of this equivalence is trivial: if you have a strong learner for a classification task then it’s automatically a weak learner for the same task. The reverse is much harder, and this is the crux: there is an algorithm for transforming a weak learner into a strong learner! Informally, we “boost” the weak learning algorithm by feeding it examples from carefully constructed distributions, and then take a majority vote. This “reduction” from strong to weak learning is where all the magic happens.

In this post we’ll get into the depths of this boosting technique. We’ll review the model of PAC-learning, define what it means to be a weak learner, “organically” come up with the AdaBoost algorithm from some intuitive principles, prove that AdaBoost reduces error on the training data, and then run it on data. It turns out that despite the origin of boosting being a purely theoretical question, boosting algorithms have had a wide impact on practical machine learning as well.

As usual, all of the code and data used in this post is available on this blog’s Github page.

Before we get into the details, here’s a bit of history and context. PAC learning was introduced by Leslie Valiant in 1984, laying the foundation for a flurry of innovation. In 1988 Michael Kearns posed the question of whether one can “boost” a weak learner to a strong learner. Two years later Rob Schapire published his landmark paper “The Strength of Weak Learnability” closing the theoretical question by providing the first “boosting” algorithm. Schapire and Yoav Freund worked together for the next few years to produce a simpler and more versatile algorithm called AdaBoost, and for this they won the Gödel Prize, one of the highest honors in theoretical computer science. AdaBoost is also the standard boosting algorithm used in practice, though there are enough variants to warrant a book on the subject.

I’m going to define and prove that AdaBoost works in this post, and implement it and test it on some data. But first I want to give some high level discussion of the technique, and afterward the goal is to make that wispy intuition rigorous.

The central technique of AdaBoost has been discovered and rediscovered in computer science, and recently it was recognized abstractly in its own right. It is called the **Multiplicative Weights Update Algorithm** (MWUA), and it has applications in everything from learning theory to combinatorial optimization and game theory. The idea is to

- Maintain a nonnegative weight for the elements of some set,
- Draw a random element proportionally to the weights,
- So something with the chosen element, and based on the outcome of the “something…”
- Update the weights and repeat.

The “something” is usually a black box algorithm like “solve this simple optimization problem.” The output of the “something” is interpreted as a reward or penalty, and the weights are updated according to the severity of the penalty (the details of how this is done differ depending on the goal). In this light one can interpret MWUA as minimizing regret with respect to the best alternative element one could have chosen in hindsight. In fact, this was precisely the technique we used to attack the adversarial bandit learning problem (the Exp3 algorithm is a multiplicative weight scheme). See this lengthy technical survey of Arora and Kale for a research-level discussion of the algorithm and its applications.

Now let’s remind ourselves of the formal definition of PAC. If you’ve read the previous post on the PAC model, this next section will be redundant.

In PAC-learning you are trying to give labels to data from some set . There is a distribution producing data from , and it’s used for everything: to provide data the algorithm uses to learn, to measure your accuracy, and every other time you might get samples from . You as the algorithm designer don’t know what is, and a successful learning algorithm has to work *no matter what* *is*. There’s some unknown function called the *target concept,* which assigns a label to each data point in . The target is the function we’re trying to “learn.” When the algorithm draws an example from , it’s allowed to query the label and use all of the labels it’s seen to come up with some *hypothesis* that is used for new examples that the algorithm may not have seen before.* *The problem is “solved” if has low error on all of .

To give a concrete example let’s do spam emails. Say that is the set of all emails, and is the distribution over emails that get sent to my personal inbox. A PAC-learning algorithm would take all my emails, along with my classification of which are spam and which are not spam (plus and minus 1). The algorithm would produce a hypothesis that can be used to label new emails, and if the algorithm is truly a PAC-learner, then our guarantee is that with high probability (over the randomness in which emails I receive) the algorithm will produce an that has low error on the entire distribution of emails that get sent to me (relative to my personal spam labeling function).

Of course there are practical issues with this model. I don’t have a consistent function for calling things spam, the distribution of emails I get and my labeling function can change over time, and emails don’t come according to a distribution with independent random draws. But that’s the theoretical model, and we can hope that algorithms we devise for this model happen to work well in practice.

Here’s the formal definition of the error of a hypothesis produced by the learning algorithm:

It’s read “The error of with respect to the concept we’re trying to learn and the distribution is the probability over drawn from that the hypothesis produces the wrong label.” We can now define PAC-learning formally, introducing the parameters for “probably” and for “approximately.” Let me say it informally first:

An algorithm PAC-learns if, for any and any distribution , with probability at least the hypothesis produced by the algorithm has error at most .

To flush out the other things hiding, here’s the full definition.

**Definition (PAC): **An algorithm is said to PAC-learn the concept class over the set if, for any distribution over and for any and for any target concept , the probability that produces a hypothesis of error at most is at least . In symbols, . Moreover, must run in time polynomial in and , where is the size of an element .

The reason we need a class of concepts (instead of just one target concept) is that otherwise we could just have a constant algorithm that outputs the correct labeling function. Indeed, when we get a problem we ask whether there *exists* an algorithm that can solve it. I.e., a problem is “PAC-learnable” if there is some algorithm that learns it as described above. With just one target concept there can exist an algorithm to solve the problem by hard-coding a description of the concept in the source code. So we need to have some “class of possible answers” that the algorithm is searching through so that the algorithm actually has a job to do.

We call an algorithm that gets this guarantee a *strong learner. *A *weak learner *has the same definition, except that we replace by the weak error bound: for *some fixed* *.* the error . So we don’t require the algorithm to achieve *any *desired accuracy, it just has to get some accuracy slightly better than random guessing, which we don’t get to choose. As we will see, the value of influences the convergence of the boosting algorithm. One important thing to note is that is a constant independent of , the size of an example, and , the number of examples. In particular, we need to avoid the “degenerate” possibility that so that as our learning problem scales the quality of the weak learner degrades toward 1/2. We want it to be *bounded *away from 1/2.

So just to clarify all the parameters floating around, will always be the “probably” part of PAC, is the error bound (the “approximately” part) for strong learners, and is the error bound for weak learners.

Now before we prove that you can “boost” a weak learner to a strong learner, we should have some idea of what a weak learner is. Informally, it’s just a ‘rule of thumb’ that you can somehow guarantee does a little bit better than random guessing.

In practice, however, people sort of just make things up and they work. It’s kind of funny, but until recently nobody has really studied what makes a “good weak learner.” They just use an example like the one we’re about to show, and as long as they get a good error rate they don’t care if it has any mathematical guarantees. Likewise, they don’t expect the final “boosted” algorithm to do arbitrarily well, they just want low error rates.

The weak learner we’ll use in this post produces “decision stumps.” If you know what a decision tree is, then a decision stump is trivial: it’s a decision tree where the whole tree is just one node. If you don’t know what a decision tree is, a decision stump is a classification rule of the form:

Pick some feature and some value of that feature , and output label if the input example has value for feature , and output label otherwise.

Concretely, a decision stump might mark an email spam if it contains the word “viagra.” Or it might deny a loan applicant a loan if their credit score is less than some number.

Our weak learner produces a decision stump by simply looking through all the features and all the values of the features until it finds a decision stump that has the best error rate. It’s brute force, baby! Actually we’ll do something a little bit different. We’ll make our data numeric and look for a threshold of the feature value to split positive labels from negative labels. Here’s the Python code we’ll use in this post for boosting. This code was part of a collaboration with my two colleagues Adam Lelkes and Ben Fish. As usual, all of the code used in this post is available on Github.

First we make a class for a decision stump. The attributes represent a feature, a threshold value for that feature, and a choice of labels for the two cases. The classify function shows how simple the hypothesis is.

class Stump: def __init__(self): self.gtLabel = None self.ltLabel = None self.splitThreshold = None self.splitFeature = None def classify(self, point): if point[self.splitFeature] >= self.splitThreshold: return self.gtLabel else: return self.ltLabel def __call__(self, point): return self.classify(point)

Then for a fixed feature index we’ll define a function that computes the best threshold value for that index.

def minLabelErrorOfHypothesisAndNegation(data, h): posData, negData = ([(x, y) for (x, y) in data if h(x) == 1], [(x, y) for (x, y) in data if h(x) == -1]) posError = sum(y == -1 for (x, y) in posData) + sum(y == 1 for (x, y) in negData) negError = sum(y == 1 for (x, y) in posData) + sum(y == -1 for (x, y) in negData) return min(posError, negError) / len(data) def bestThreshold(data, index, errorFunction): '''Compute best threshold for a given feature. Returns (threshold, error)''' thresholds = [point[index] for (point, label) in data] def makeThreshold(t): return lambda x: 1 if x[index] >= t else -1 errors = [(threshold, errorFunction(data, makeThreshold(threshold))) for threshold in thresholds] return min(errors, key=lambda p: p[1])

Here we allow the user to provide a generic error function that the weak learner tries to minimize, but in our case it will just be `minLabelErrorOfHypothesisAndNegation`

. In words, our threshold function will label an example as if feature has value greater than the threshold and otherwise. But we might want to do the opposite, labeling above the threshold and below. The `bestThreshold`

function doesn’t care, it just wants to know which threshold value is the best. Then we compute what the right hypothesis is in the next function.

def buildDecisionStump(drawExample, errorFunction=defaultError): # find the index of the best feature to split on, and the best threshold for # that index. A labeled example is a pair (example, label) and drawExample() # accepts no arguments and returns a labeled example. data = [drawExample() for _ in range(500)] bestThresholds = [(i,) + bestThreshold(data, i, errorFunction) for i in range(len(data[0][0]))] feature, thresh, _ = min(bestThresholds, key = lambda p: p[2]) stump = Stump() stump.splitFeature = feature stump.splitThreshold = thresh stump.gtLabel = majorityVote([x for x in data if x[0][feature] >= thresh]) stump.ltLabel = majorityVote([x for x in data if x[0][feature] < thresh]) return stump

It’s a little bit inefficient but no matter. To illustrate the PAC framework we emphasize that the weak learner needs nothing except the ability to draw from a distribution. It does so, and then it computes the best threshold and creates a new stump reflecting that. The `majorityVote`

function just picks the most common label of examples in the list. Note that drawing 500 samples is arbitrary, and in general we might increase it to increase the success probability of finding a good hypothesis. In fact, when proving PAC-learning theorems the number of samples drawn often depends on the accuracy and confidence parameters . We omit them here for simplicity.

So suppose we have a weak learner for a concept class , and for any concept from it can produce with probability at least a hypothesis with error bound . How can we modify this algorithm to get a strong learner? Here is an idea: we can maintain a large number of separate instances of the weak learner , run them on our dataset, and then combine their hypotheses with a majority vote. In code this might look like the following python snippet. For now examples are binary vectors and the labels are , so the sign of a real number will be its label.

def boost(learner, data, rounds=100): m = len(data) learners = [learner(random.choice(data, m/rounds)) for _ in range(rounds)] def hypothesis(example): return sign(sum(1/rounds * h(example) for h in learners)) return hypothesis

This is a bit too simplistic: what if the majority of the weak learners are wrong? In fact, with an overly naive mindset one might imagine a scenario in which the different instances of have high disagreement, so is the prediction going to depend on which random subset the learner happens to get? We can do better: instead of taking a majority vote we can take a *weighted *majority vote. That is, give the weak learner a random subset of your data, and then test its hypothesis on the data to get a good estimate of its error. Then you can use this error to say whether the hypothesis is any good, and give good hypotheses high weight and bad hypotheses low weight (proportionally to the error). Then the “boosted” hypothesis would take a *weighted* majority vote of all your hypotheses on an example. This might look like the following.

# data is a list of (example, label) pairs def error(hypothesis, data): return sum(1 for x,y in data if hypothesis(x) != y) / len(data) def boost(learner, data, rounds=100): m = len(data) weights = [0] * rounds learners = [None] * rounds for t in range(rounds): learners[t] = learner(random.choice(data, m/rounds)) weights[t] = 1 - error(learners[t], data) def hypothesis(example): return sign(sum(weight * h(example) for (h, weight) in zip(learners, weights))) return hypothesis

This might be better, but we can do something even cleverer. Rather than use the estimated error just to say something about the hypothesis, we can identify the mislabeled examples in a round and somehow *encourage* to do better at classifying those examples in later rounds. This turns out to be the key insight, and it’s why the algorithm is called AdaBoost (Ada stands for “adaptive”). We’re adaptively modifying the distribution over the training data we feed to based on which data learns “easily” and which it does not. So as the boosting algorithm runs, the distribution given to has more and more probability weight on the examples that misclassified. And, this is the key, has the guarantee that it will weak learn *no matter what the distribution over the data is*. Of course, it’s error is also measured relative to the adaptively chosen distribution, and the crux of the argument will be relating this error to the error on the original distribution we’re trying to strong learn.

To implement this idea in mathematics, we will start with a fixed sample drawn from and assign a weight to each . Call the true label of an example. Initially, set to be 1. Since our dataset can have repetitions, normalizing the to a probability distribution gives an estimate of . Now we’ll pick some “update” parameter (this is intentionally vague). Then we’ll repeat the following procedure for some number of rounds .

- Renormalize the to a probability distribution.
- Train the weak learner , and provide it with a simulated distribution that draws examples according to their weights . The weak learner outputs a hypothesis .
- For every example mislabeled by , update by replacing it with .
- For every correctly labeled example replace with .

At the end our final hypothesis will be a weighted majority vote of all the , where the weights depend on the amount of error in each round. Note that when the weak learner misclassifies an example we *increase *the weight of that example, which means we’re increasing the likelihood it will be drawn in future rounds. In particular, in order to maintain good accuracy the weak learner will eventually have to produce a hypothesis that fixes its mistakes in previous rounds. Likewise, when examples are correctly classified, we reduce their weights. So examples that are “easy” to learn are given lower emphasis. And that’s it. That’s the prize-winning idea. It’s elegant, powerful, and easy to understand. The rest is working out the values of all the parameters and proving it does what it’s supposed to.

Let’s jump straight into a Python program that performs boosting.

First we pick a data representation. Examples are pairs whose type is the tuple `(object, int)`

. Our labels will be valued. Since our algorithm is entirely black-box, we don’t need to assume anything about how the examples are represented. Our dataset is just a list of labeled examples, and the weights are floats. So our boosting function prototype looks like this

# boost: [(object, int)], learner, int -> (object -> int) # boost the given weak learner into a strong learner def boost(examples, weakLearner, rounds): ...

And a weak learner, as we saw for decision stumps, has the following function prototype.

# weakLearner: (() -> (list, label)) -> (list -> label) # accept as input a function that draws labeled examples from a distribution, # and output a hypothesis list -> label def weakLearner(draw): ... return hypothesis

Assuming we have a weak learner, we can fill in the rest of the boosting algorithm with some mysterious details. First, a helper function to compute the weighted error of a hypothesis on some exmaples. It also returns the correctness of the hypothesis on each example which we’ll use later.

# compute the weighted error of a given hypothesis on a distribution # return all of the hypothesis results and the error def weightedLabelError(h, examples, weights): hypothesisResults = [h(x)*y for (x,y) in examples] # +1 if correct, else -1 return hypothesisResults, sum(w for (z,w) in zip(hypothesisResults, weights) if z < 0)

Next we have the main boosting algorithm. Here `draw`

is a function that accepts as input a list of floats that sum to 1 and picks an index proportional to the weight of the entry at that index.

def boost(examples, weakLearner, rounds): distr = normalize([1.] * len(examples)) hypotheses = [None] * rounds alpha = [0] * rounds for t in range(rounds): def drawExample(): return examples[draw(distr)] hypotheses[t] = weakLearner(drawExample) hypothesisResults, error = computeError(hypotheses[t], examples, distr) alpha[t] = 0.5 * math.log((1 - error) / (.0001 + error)) distr = normalize([d * math.exp(-alpha[t] * h) for (d,h) in zip(distr, hypothesisResults)]) print("Round %d, error %.3f" % (t, error)) def finalHypothesis(x): return sign(sum(a * h(x) for (a, h) in zip(alpha, hypotheses))) return finalHypothesis

The code is almost clear. For each round we run the weak learner on our hand-crafted distribution. We compute the error of the resulting hypothesis on that distribution, and then we update the distribution in this mysterious way depending on some alphas and logs and exponentials. In particular, we use the expression , the product of the true label and predicted label, as computed in `weightedLabelError`

. As the comment says, this will either be or depending on whether the predicted label is correct or incorrect, respectively. The choice of those strange logarithms and exponentials are the result of some optimization: they allow us to minimize training error as quickly as possible (we’ll see this in the proof to follow). The rest of this section will prove that this works when the weak learner is correct. One small caveat: in the proof we will assume the error of the hypothesis is not zero (because a weak learner is not supposed to return a perfect hypothesis!), but in practice we want to avoid dividing by zero so we add the small 0.0001 to avoid that. As a quick self-check: why wouldn’t we just stop in the middle and output that “perfect” hypothesis? (What distribution is it “perfect” over? It might not be the original distribution!)

If we wanted to define the algorithm in pseudocode (which helps for the proof) we would write it this way. Given rounds, start with being the uniform distribution over labeled input examples , where has label . Say there are input examples.

- For each :
- Let be the weak learning algorithm run on .
- Let be the error of on .
- Let .
- Update each entry of by the rule , where is chosen to normalize to a distribution.

- Output as the final hypothesis the sign of , i.e. .

Now let’s prove this works. That is, we’ll prove the error on the input dataset (the training set) decreases exponentially quickly in the number of rounds. Then we’ll run it on an example and save generalization error for the next post. Over many years this algorithm and tweaked so that the proof is very straightforward.

**Theorem:** If AdaBoost is given a weak learner and stopped on round , and the edge over random choice satisfies , then the training error of the AdaBoost is at most .

*Proof. * Let be the number of examples given to the boosting algorithm. First, we derive a closed-form expression for in terms of the normalization constants . Expanding the recurrence relation gives

Because the starting distribution is uniform, and combining the products into a sum of the exponents, this simplifies to

Next, we show that the training error is bounded by the product of the normalization terms . This part has always seemed strange to me, that the training error of boosting depends on the factors you need to normalize a distribution. But it’s just a different perspective on the multiplicative weights scheme. If we didn’t explicitly normalize the distribution at each step, we’d get nonnegative weights (which we could convert to a distribution just for the sampling step) and the training error would depend on the product of the weight updates in each step. Anyway let’s prove it.

The training error is defined to be . This can be written with an indicator function as follows:

Because the sign of determines its prediction, the product is negative when is incorrect. Now we can do a strange thing, we’re going to upper bound the indicator function (which is either zero or one) by . This works because if predicts correctly then the indicator function is zero while the exponential is greater than zero. On the other hand if is incorrect the exponential is greater than one because when . So we get

and rearranging the formula for from the first part gives

Since the forms a distribution, it sums to 1 and we can factor the out. So the training error is just bounded by the .

The last step is to bound the product of the normalization factors. It’s enough to show that . The normalization constant is just defined as the sum of the numerator of the terms in step D. i.e.

We can split this up into the correct and incorrect terms (that contribute to or in the exponent) to get

But by definition the sum of the incorrect part of is and for the correct part. So we get

Finally, since this is an upper bound we want to pick so as to minimize this expression. With a little calculus you can see the we chose in the algorithm pseudocode achieves the minimum, and this simplifies to . Plug in to get and use the calculus fact that to get as desired.

This is fine and dandy, it says that if you have a true weak learner then the training error of AdaBoost vanishes exponentially fast in the number of boosting rounds. But what about generalization error? What we really care about is whether the hypothesis produced by boosting has low error on the original distribution as a whole, not just the training sample we started with.

One might expect that if you run boosting for more and more rounds, then it will eventually overfit the training data and its generalization accuracy will degrade. However, in practice this is not the case! The longer you boost, even if you get down to zero training error, the *better *generalization tends to be. For a long time this was sort of a mystery, and we’ll resolve the mystery in the sequel to this post. For now, we’ll close by showing a run of AdaBoost on some real world data.

The “adult” dataset is a standard dataset taken from the 1994 US census. It tracks a number of demographic and employment features (including gender, age, employment sector, etc.) and the goal is to predict whether an individual makes over $50k per year. Here are the first few lines from the training set.

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K 53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K 28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K 37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K

We perform some preprocessing of the data, so that the categorical examples turn into binary features. You can see the full details in the github repository for this post; here are the first few post-processed lines (my newlines added).

>>> from data import adult >>> train, test = adult.load() >>> train[:3] [((39, 1, 0, 0, 0, 0, 0, 1, 0, 0, 13, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2174, 0, 40, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), -1), ((50, 1, 0, 1, 0, 0, 0, 0, 0, 0, 13, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), -1), ((38, 1, 1, 0, 0, 0, 0, 0, 0, 0, 9, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 40, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), -1)]

Now we can run boosting on the training data, and compute its error on the test data.

>>> from boosting import boost >>> from data import adult >>> from decisionstump import buildDecisionStump >>> train, test = adult.load() >>> weakLearner = buildDecisionStump >>> rounds = 20 >>> h = boost(train, weakLearner, rounds) Round 0, error 0.199 Round 1, error 0.231 Round 2, error 0.308 Round 3, error 0.380 Round 4, error 0.392 Round 5, error 0.451 Round 6, error 0.436 Round 7, error 0.459 Round 8, error 0.452 Round 9, error 0.432 Round 10, error 0.444 Round 11, error 0.447 Round 12, error 0.450 Round 13, error 0.454 Round 14, error 0.505 Round 15, error 0.476 Round 16, error 0.484 Round 17, error 0.500 Round 18, error 0.493 Round 19, error 0.473 >>> error(h, train) 0.153343 >>> error(h, test) 0.151711

This isn’t too shabby. I’ve tried running boosting for more rounds (a hundred) and the error doesn’t seem to improve by much. This implies that finding the best decision stump is not a weak learner (or at least it fails for this dataset), and we can see that indeed the training errors across rounds roughly tend to 1/2.

Though we have not compared our results above to any baseline, AdaBoost seems to work pretty well. This is kind of a meta point about theoretical computer science research. One spends years trying to devise algorithms that work in theory (and finding conditions under which we can get good algorithms in theory), but when it comes to practice we can’t do anything but hope the algorithms will work well. It’s kind of amazing that something like Boosting works in practice. It’s not clear to me that weak learners should exist at all, even for a given real world problem. But the results speak for themselves.

Next time we’ll get a bit deeper into the theory of boosting. We’ll derive the notion of a “margin” that quantifies the confidence of boosting in its prediction. Then we’ll describe (and maybe prove) a theorem that says if the “minimum margin” of AdaBoost on the training data is large, then the generalization error of AdaBoost on the entire distribution is small. The notion of a margin is actually quite a deep one, and it shows up in another famous machine learning technique called the Support Vector Machine. In fact, it’s part of some recent research I’ve been working on as well. More on that in the future.

If you’re dying to learn more about Boosting, but don’t want to wait for me, check out the book Boosting: Foundations and Algorithms, by Freund and Schapire.

Until next time!

]]>

So Math ∩ Programming was actually born on June 12th 2011, but since my schedule is about to get super busy (more on that below) I’m writing my birthday post a month early.

First things first: I’m conducting a survey of my readers! If you want to help shape the future of Math ∩ Programming, please go take the survey. The results will be private, but I may at some point release some statistics about responses to some of the questions.

So my life is busy this summer. Here’s a few of the things I’ll be doing.

- I’m giving a poster at the Network Science 2015 workshop at Snowbird, Utah.
- I’m attending the huge ACM FCRC 2015 conference extravaganza. There will be 12 conferences in one convention center! I’d have reason to go to three or four of them, but they’re all timewise overlapping so I’ll probably approach it like a buffet and try a little piece of everything.
- I’m going to present a little paper at the ICML Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT ML, the best acronym) in France, and we’re working on beefing it up into a more substantial paper.
- I recently submitted my application for Hungarian citizenship, which I have by birthright but even after being submitted requires a lot of paperwork in a language I don’t speak very well.
- I’m rushing to hit some May and June paper deadlines, since…
**I’m getting married in June!**So exciting!

With regards to my blog, I have a big fat list of promises I have not delivered on. Including,

- Finishing the series on linear programming.
- Getting to any interesting quantum algorithms.
- Machine-learny things like support vector machines, boosting, VC-dimension, community detection methods, etc.
- Topology-related stuff and persistent homology.
- Following the big blog map I drew last year.

The reason is because I’ve been spending some time on other blog-related projects I plan to announce in the coming months, as well as spending more and more time on a growing list of research projects, and of course on wedding planning. So sadly this will be a very slow summer for M∩P.

And finally, starting in the Fall I will be on the academic job market. I’ll likely post again when September comes around, but my anticipated graduation date is May 2016. So that’s going to be quite the roller coaster I’m sure.

]]>

In the first article, Norvig runs a basic algorithm to recreate and improve the results from the comic, and in the second he beefs it up with some improved search heuristics. My favorite part about this topic is that regex golf can be phrased in terms of a problem called *set cover.* I noticed this when reading the comic, and was delighted to see Norvig use that as the basis of his algorithm.

The set cover problem shows up in other places, too. If you have a database of items labeled by users, and you want to find the smallest set of labels to display that covers every item in the database, you’re doing set cover. I hear there are applications in biochemistry and biology but haven’t seen them myself.

If you know what a set is (just think of the “set” or “hash set” type from your favorite programming language), then set cover has a simple definition.

**Definition (The Set Cover Problem): **You are given a finite set called a “universe” and sets each of which is a subset of . You choose some of the to ensure that every is in one of your chosen sets, and you want to minimize the number of you picked.

It’s called a “cover” because the sets you pick “cover” every element of . Let’s do a simple. Let and

Then the smallest possible number of sets you can pick is 2, and you can achieve this by picking both or both . The connection to regex golf is that you pick to be the set of strings you want to match, and you pick a set of regexes that match *some* of the strings in but *none* of the strings you want to avoid matching (I’ll call them ). If is such a regex, then you can form the set of strings that matches. Then if you find a small set cover with the strings , then you can “or” them together to get a single regex that matches all of but none of .

Set cover is what’s called **NP-hard,** and one implication is that we shouldn’t hope to find an efficient algorithm that will always give you the shortest regex for every regex golf problem. But despite this, there are *approximation algorithms* for set cover. What I mean by this is that there is a regex-golf algorithm that outputs a subset of the regexes matching all of , and the number of regexes it outputs is such-and-such close to the minimum possible number. We’ll make “such-and-such” more formal later in the post.

What made me sad was that Norvig didn’t go any deeper than saying, “We can try to approximate set cover, and the greedy algorithm is pretty good.” It’s true, but the ideas are richer than that! Set cover is a simple example to **showcase interesting techniques** from theoretical computer science. And perhaps ironically, in Norvig’s second post a header promised the article would discuss the theory of set cover, but I didn’t see any of what I think of as theory. Instead he partially analyzes the structure of the regex golf instances he cares about. This is useful, but not really theoretical in any way unless he can say something universal about those instances.

I don’t mean to bash Norvig. His articles were great! And in-depth theory was way beyond scope. So this post is just my opportunity to fill in some theory gaps. We’ll do three things:

- Show formally that set cover is NP-hard.
- Prove the approximation guarantee of the greedy algorithm.
- Show another (very different) approximation algorithm based on linear programming.

Along the way I’ll argue that by knowing (or at least seeing) the details of these proofs, one can get a better sense of what features to look for in the set cover instance you’re trying to solve. We’ll also see how set cover depicts the broader themes of theoretical computer science.

The first thing we should do is show that set cover is NP-hard. Intuitively what this means is that we can take some hard problem and **encode instances of** **inside set cover problems. **This idea is called a reduction, because solving problem will “reduce” to solving set cover, and the method we use to encode instance of as set cover problems will have a small amount of overhead. This is one way to say that set cover is “at least as hard as” .

The hard problem we’ll reduce to set cover is called **3-satisfiability (3-SAT). **In 3-SAT, the input is a formula whose variables are either true or false, and the formula is expressed as an OR of a bunch of clauses, each of which is an AND of three variables (or their negations). This is called 3-CNF form. A simple example:

The goal of the algorithm is to decide whether there is an assignment to the variables which makes the formula true. 3-SAT is one of the most fundamental problems we believe to be hard and, roughly speaking, by reducing it to set cover we include set cover in a class called NP-complete, and if any *one* of these problems can be solved efficiently, then they all can (this is the famous P versus NP problem, and an efficient algorithm would imply P equals NP).

So a reduction would consist of the following: you give me a formula in 3-CNF form, and I have to produce (in a way that depends on !) a universe and a choice of subsets in such a way that

has a true assignment of variables **if and only if **the corresponding set cover problem has a cover using sets.

In other words, I’m going to design a function from 3-SAT instances to set cover instances, such that is satisfiable if and only if has a set cover with sets.

Why do I say it only for sets? Well, if you can always answer this question then I claim you can find the minimum size of a set cover needed by doing a binary search for the smallest value of . So finding the minimum size of a set cover reduces to the problem of telling if theres a set cover of size .

Now let’s do the reduction from 3-SAT to set cover.

If you give me where each is a clause and the variables are denoted , then I will choose as my universe to be the set of all the clauses and indices of the variables (these are all just formal symbols). i.e.

The first part of will ensure I make all the clauses true, and the last part will ensure I don’t pick a variable to be both true and false at the same time.

To show how this works I have to pick my subsets. For each variable , I’ll make two sets, one called and one called . They will both contain in addition to the clauses which they make true when the corresponding literal is true (by literal I just mean the variable or its negation). For example, if uses the literal , then will contain but will not. Finally, I’ll set , the number of variables.

Now to prove this reduction works I have to prove two things: if my starting formula has a satisfying assignment I have to show the set cover problem has a cover of size . Indeed, take the sets for all literals that are set to true in a satisfying assignment. There can be at most true literals since half are true and half are false, so there will be at most sets, and these sets clearly cover all of because every literal has to be satisfied by some literal or else the formula isn’t true.

The reverse direction is similar: if I have a set cover of size , I need to use it to come up with a satisfying truth assignment for the original formula. But indeed, the sets that get chosen can’t include *both *a *and* its negation set , because there are of the elements , and each is only in the two . Just by counting if I cover all the indices , I already account for sets! And finally, since I have covered all the clauses, the literals corresponding to the sets I chose give exactly a satisfying assignment.

Whew! So set cover is NP-hard because I encoded this logic problem 3-SAT within its rules. If we think 3-SAT is hard (and we do) then set cover must also be hard. So if we can’t hope to solve it exactly we should try to approximate the best solution.

The method that Norvig uses in attacking the meta-regex golf problem is the greedy algorithm. The greedy algorithm is exactly what you’d expect: you maintain a list of the subsets you’ve picked so far, and at each step you pick the set that maximizes the number of new elements of that aren’t already covered by the sets in . In python pseudocode:

def greedySetCover(universe, sets): chosenSets = set() leftToCover = universe.copy() unchosenSets = sets covered = lambda s: leftToCover & s while universe != 0: if len(chosenSets) == len(sets): raise Exception("No set cover possible") nextSet = max(unchosenSets, key=lambda s: len(covered(s))) unchosenSets.remove(nextSet) chosenSets.add(nextSet) leftToCover -= nextSet return chosenSets

This is what theory has to say about the greedy algorithm:

**Theorem:** If it is possible to cover by the sets in , then the greedy algorithm always produces a cover that at worst has size , where is the size of the smallest cover. Moreover, this is asymptotically the best any algorithm can do.

One simple fact we need from calculus is that the following sum is asymptotically the same as :

*Proof. *[adapted from Wan] Let’s say the greedy algorithm picks sets in that order. We’ll set up a little value system for the elements of . Specifically, the value of each is 1, and in step we evenly distribute this unit value across all *newly covered* elements of . So for each covered element gets value , and if covers four new elements, each gets a value of 1/4. One can think of this “value” as a price, or energy, or unit mass, or whatever. It’s just an accounting system (albeit a clever one) we use to make some inequalities clear later.

In general call the value of element the value assigned to at the step where it’s first covered. In particular, the number of sets chosen by the greedy algorithm is just . We’re just bunching back together the unit value we distributed for each step of the algorithm.

Now we want to compare the sets chosen by greedy to the optimal choice. Call a smallest set cover . Let’s stare at the following inequality.

It’s true because each counts for a at most once in the left hand side, and in the right hand side the sets in must hit each at least once but may hit some more than once. Also remember the left hand side is equal to .

Now we want to show that the inner sum on the right hand side, , is at most . This will in fact prove the entire theorem: because each set has size at most , the inequality above will turn into

And so , which is the statement of the theorem.

So we want to show that . For each define to be the number of elements in not covered in . Notice that is the number of elements of that are covered for the first time in step . If we call the smallest integer for which , we can count up the differences up to step , we get

The rightmost term is just the cost assigned to the relevant elements at step . Moreover, because covers more new elements than (by definition of the greedy algorithm), the fraction above is at most . The end is near. For brevity I’ll drop the from .

And that proves the claim.

I have three postscripts to this proof:

- This is basically the
*exact*worst-case approximation that the greedy algorithm achieves. In fact, Petr Slavik proved in 1996 that the greedy gives you a set of size exactly in the worst case. - This is also the best approximation that
*any set cover algorithm*can achieve, provided that P is not NP. This result was basically known in 1994, but it wasn’t until 2013 and the use of some very sophisticated tools that the best possible bound was found with the smallest assumptions. - In the proof we used that to bound things, but if we knew that our sets (i.e. subsets matched by a regex) had sizes bounded by, say, , the same proof would show that the approximation factor is instead of . However, in order for that to be useful you need to be a constant, or at least to grow more slowly than any polynomial in , since e.g. . In fact, taking a second look at Norvig’s meta regex golf problem,
**some of his instances had this property!**Which means the greedy algorithm gives a much better approximation ratio for certain meta regex golf problems than it does for the worst case general problem. This is one instance where knowing the proof of a theorem helps us understand how to specialize it to our interests.

So we just said that you can’t possibly do better than the greedy algorithm for approximating set cover. There must be nothing left to say, job well done, right? Wrong! Our second analysis, based on linear programming, shows that instances with special features can have better approximation results.

In particular, if we’re guaranteed that each element occurs in at most of the sets , then the linear programming approach will give a -approximation, i.e. a cover whose size is at worst larger than OPT by a multiplicative factor of . In the case that is constant, we can beat our earlier greedy algorithm.

The technique is now a classic one in optimization, called LP-relaxation (LP stands for linear programming). The idea is simple. Most optimization problems can be written as *integer linear programs*, that is there you have variables and you want to maximize (or minimize) a linear function of the subject to some linear constraints. The thing you’re trying to optimize is called the *objective. *While in general solving integer linear programs is NP-hard, we can relax the “integer” requirement to , or something similar. The resulting linear program, called the *relaxed program*, can be solved efficiently using the simplex algorithm or another more complicated method.

The output of solving the relaxed program is an assignment of real numbers for the that optimizes the objective function. A key fact is that the solution to the relaxed linear program will be *at least as good* as the solution to the original integer program, because the optimal solution to the integer program is a valid candidate for the optimal solution to the linear program. Then the idea is that if we use some clever scheme to round the to integers, we can measure how much this degrades the objective and prove that it doesn’t degrade too much when compared to the optimum of the relaxed program, which means it doesn’t degrade too much when compared to the optimum of the integer program as well.

If this sounds wishy washy and vague don’t worry, we’re about to make it super concrete for set cover.

We’ll make a binary variable for each set in the input, and if and only if we include it in our proposed cover. Then the objective function we want to minimize is . If we call our elements , then we need to write down a linear constraint that says each element is hit by at least one set in the proposed cover. These constraints have to depend on the sets , but that’s not a problem. One good constraint for element is

In words, the only way that an will *not *be covered is if all the sets containing it have their . And we need one of these constraints for each . Putting it together, the integer linear program is

Once we understand this formulation of set cover, the relaxation is trivial. We just replace the last constraint with inequalities.

For a given candidate assignment to the , call the objective value (in this case ). Now we can be more concrete about the guarantees of this relaxation method. Let be the optimal value of the integer program and a corresponding assignment to achieving the optimum. Likewise let be the optimal things for the linear relaxation. We will prove:

**Theorem:** There is a deterministic algorithm that rounds to integer values so that the objective value , where is the maximum number of sets that any element occurs in. So this gives a -approximation of set cover.

*Proof. *Let be as described in the theorem, and call to make the indexing notation easier. The rounding algorithm is to set if and zero otherwise.

To prove the theorem we need to show two things hold about this new candidate solution :

- The choice of all for which covers every element.
- The number of sets chosen (i.e. ) is at most times more than .

Since , so if we can prove number 2 we get , which is the theorem.

So let’s prove 1. Fix any and we’ll show that element is covered by some set in the rounded solution. Call the number of times element occurs in the input sets. By definition , so . Recall was the optimal solution to the relaxed linear program, and so it must be the case that the linear constraint for each is satisfied: . We know that there are terms and they sums to at least 1, so not all terms can be smaller than (otherwise they’d sum to something less than 1). In other words, some variable in the sum is at least , and so is set to 1 in the rounded solution, corresponding to a set that contains . This finishes the proof of 1.

Now let’s prove 2. For each , we know that for each , the corresponding variable . In particular . Now we can simply bound the sum.

The second inequality is true because some of the are zero, but we can ignore them when we upper bound and just include all the . This proves part 2 and the theorem.

I’ve got some more postscripts to this proof:

- The proof works equally well when the sets are
*weighted*, i.e. your cost for picking a set is not 1 for every set but depends on some arbitrarily given constants . - We gave a deterministic algorithm rounding to , but one can get the same result (with high probability) using a randomized algorithm. The idea is to flip a coin with bias roughly times and set if and only if the coin lands heads at least once. The guarantee is no better than what we proved, but for some other problems randomness can help you get approximations where we don’t know of any deterministic algorithms to get the same guarantees. I can’t think of any off the top of my head, but I’m pretty sure they’re out there.
- For step 1 we showed that at least one term in the inequality for would be rounded up to 1, and this guaranteed we covered all the elements. A natural question is: why not also round up
*at most one*term of each of these inequalities? It might be that in the worst case you don’t get a better guarantee, but it would be a quick extra heuristic you could use to post-process a rounded solution. - Solving linear programs is slow. There are faster methods based on so-called “primal-dual” methods that use information about the dual of the linear program to construct a solution to the problem. Goemans and Williamson have a nice self-contained chapter on their website about this with a ton of applications.

Williamson and Shmoys have a large textbook called The Design of Approximation Algorithms. One problem is that this field is like a big heap of unrelated techniques, so it’s not like the book will build up some neat theoretical foundation that works for every problem. Rather, it’s messy and there are lots of details, but there are definitely diamonds in the rough, such as the problem of (and algorithms for) coloring 3-colorable graphs with “approximately 3″ colors, and the infamous unique games conjecture.

I wrote a post a while back giving conditions which, if a problem satisfies those conditions, the greedy algorithm will give a constant-factor approximation. This is much better than the worst case -approximation we saw in this post. Moreover, I also wrote a post about matroids, which is a characterization of problems where the greedy algorithm is actually optimal.

Set cover is one of the main tools that IBM’s AntiVirus software uses to detect viruses. Similarly to the regex golf problem, they find a set of strings that occurs source code in some viruses but not (usually) in good programs. Then they look for a small set of strings that covers all the viruses, and their virus scan just has to search binaries for those strings. Hopefully the size of your set cover is really small compared to the number of viruses you want to protect against. I can’t find a reference that details this, but that is understandable because it is proprietary software.

Until next time!

]]>

Markov chain Monte Carlo (MCMC) is a technique for estimating by simulation the expectation of a statistic in a complex model. Successive random selections form a Markov chain, the stationary distribution of which is the target distribution. It is particularly useful for the evaluation of posterior distributions in complex Bayesian models. In the Metropolis–Hastings algorithm, items are selected from an arbitrary “proposal” distribution and are retained or not according to an acceptance rule. The Gibbs sampler is a special case in which the proposal distributions are conditional distributions of single components of a vector parameter. Various special cases and applications are considered.

I can only vaguely understand what the author is saying here (and really only because I know ahead of time what MCMC is). There are certainly references to more advanced things than what I’m going to cover in this post. But it seems very difficult to find an explanation of Markov Chain Monte Carlo *without *all any superfluous jargon. The “bullshit” here is the implicit claim of an author that such jargon is needed. Maybe it is to explain advanced applications (like attempts to do “inference in Bayesian networks”), but it is certainly not needed to define or analyze the basic ideas.

So to counter, here’s my own explanation of Markov Chain Monte Carlo, inspired by the treatment of John Hopcroft and Ravi Kannan.

Markov Chain Monte Carlo is a technique to solve the problem of *sampling from a complicated distribution. *Let me explain by the following imaginary scenario. Say I have a magic box which can estimate probabilities of baby names very well. I can give it a string like “Malcolm” and it will tell me the exact probability that you will choose this name for your next child. So there’s a distribution over all names, it’s very specific to your preferences, and for the sake of argument say this distribution is fixed and you don’t get to tamper with it.

Now comes the problem: I want to *efficiently draw* a name from this distribution . This is the problem that Markov Chain Monte Carlo aims to solve. Why is it a problem? Because I have no idea what process you use to pick a name, so I can’t simulate that process myself. Here’s another method you could try: generate a name uniformly at random, ask the machine for , and then flip a biased coin with probability and use if the coin lands heads. The problem with this is that there are exponentially many names! The variable here is the number of bits needed to write down a name . So either the probabilities will be exponentially small and I’ll be flipping for a very long time to get a single name, or else there will only be a few names with nonzero probability and it will take me exponentially many draws to find them. Inefficiency is the death of me.

So this is a serious problem! Let’s restate it formally just to be clear.

**Definition (The sampling problem): ** Let be a distribution over a finite set . You are given black-box access to the probability distribution function which outputs the probability of drawing according to . Design an efficient randomized algorithm which outputs an element of so that the probability of outputting is approximately . More generally, output a sample of elements from drawn according to .

Assume that has access to only fair random coins, though this allows one to efficiently simulate flipping a biased coin of any desired probability.

Notice that with such an algorithm we’d be able to do things like estimate the expected value of some random variable . We could take a large sample via the solution to the sampling problem, and then compute the average value of on that sample. This is what a Monte Carlo method does when sampling is easy. In fact, the Markov Chain solution to the sampling problem will allow us to do the sampling *and* the estimation of in one fell swoop if you want.

But the core problem is really a sampling problem, and “Markov Chain Monte Carlo” would be more accurately called the “Markov Chain Sampling Method.” So let’s see why a Markov Chain could possibly help us.

Markov Chain is essentially a fancy term for a random walk on a graph.

You give me a directed graph , and for each edge you give me a number . In order to make a random walk make sense, the need to satisfy the following constraint:

For any vertex , the set all values on outgoing edges must sum to 1, i.e. form a probability distribution.

If this is satisfied then we can take a random walk on according to the probabilities as follows: start at some vertex . Then pick an outgoing edge at random according to the probabilities on the outgoing edges, and follow it to . Repeat if possible.

I say “if possible” because an arbitrary graph will not necessarily have any outgoing edges from a given vertex. We’ll need to impose some additional conditions on the graph in order to apply random walks to Markov Chain Monte Carlo, but in any case the idea of randomly walking is well-defined, and we call the whole object a *Markov chain.*

Here is an example where the vertices in the graph correspond to emotional states.

In statistics land, they take the “state” interpretation of a random walk very seriously. They call the edge probabilities “state-to-state transitions.”The main theorem we need to do anything useful with Markov chains is the stationary distribution theorem (sometimes called the “Fundamental Theorem of Markov Chains,” and for good reason). What it says intuitively is that for a very long random walk, the probability that you end at some vertex is independent of where you started! All of these probabilities taken together is called the *stationary distribution *of the random walk, and it is uniquely determined by the Markov chain.

However, for the reasons we stated above (“if possible”), the stationary distribution theorem is not true of every Markov chain. The main property we need is that the graph is *strongly connected.* Recall that a directed graph is called connected if, when you ignore direction, there is a path from every vertex to every other vertex. It is called *strongly connected* if you still get paths everywhere when considering direction. If we additionally require the stupid edge-case-catcher that no edge can have zero probability, then strong connectivity (of one component of a graph) is equivalent to the following property:

For every vertex , an infinite random walk started at will return to with probability 1.

In fact it will return infinitely often. This property is called the *persistence *of the state by statisticians. I dislike this term because it appears to describe a property of a vertex, when to me it describes a property of the connected component containing that vertex. In any case, since in Markov Chain Monte Carlo we’ll be picking the graph to walk on (spoiler!) we will ensure the graph is strongly connected by design.

Finally, in order to describe the stationary distribution in a more familiar manner (using linear algebra), we will write the transition probabilities as a matrix where entry if there is an edge and zero otherwise. Here the rows and columns correspond to vertices of , and each *column* forms the probability distribution of going from state to some other state in one step of the random walk. Note is the transpose of the weighted adjacency matrix of the directed weighted graph where the weights are the transition probabilities (the reason I do it this way is because matrix-vector multiplication will have the matrix on the left instead of the right; see below).

This matrix allows me to describe things nicely using the language of linear algebra. In particular if you give me a basis vector interpreted as “the random walk currently at vertex ,” then gives a vector whose -th coordinate is the probability that the random walk would be at vertex after one more step in the random walk. Likewise, if you give me a probability distribution over the vertices, then gives a probability vector interpreted as follows:

If a random walk is in state with probability , then the -th entry of is the probability that after one more step in the random walk you get to vertex .

Interpreted this way, the stationary distribution is a probability distribution such that , in other words is an eigenvector of with eigenvalue 1.

A quick side note for avid readers of this blog: this analysis of a random walk is exactly what we did back in the early days of this blog when we studied the PageRank algorithm for ranking webpages. There we called the matrix “a web matrix,” noted it was column stochastic (as it is here), and appealed to a special case of the Perron-Frobenius theorem to show that there is a unique maximal eigenvalue equal to one (with a dimension one eigenspace) whose eigenvector we used as a sort of “stationary distribution” and the final ranking of web pages. There we described an algorithm to actually find that eigenvector by iterated multiplication by . The following theorem is essentially a variant of this algorithm but works under weaker conditions; for the web matrix we added additional “fake” edges that give the needed stronger conditions.

**Theorem: **Let be a strongly connected graph with associated edge probabilities forming a Markov chain. For a probability vector , define for all , and let be the long-term average . Then:

- There is a unique probability vector with .
- For all , the limit .

*Proof. *Since is a probability vector we just want to show that as . Indeed, we can expand this quantity as

But are unit vectors, so their difference is at most 2, meaning . Now it’s clear that this does not depend on . For uniqueness we will cop out and appeal to the Perron-Frobenius theorem that says any matrix of this form has a unique such (normalized) eigenvector.

One additional remark is that, in addition to computing the stationary distribution by actually computing this average or using an eigensolver, one can analytically solve for it as the inverse of a particular matrix. Define , where is the identity matrix. Let be with a row of ones appended to the bottom and the topmost row removed. Then one can show (quite opaquely) that the last column of is . We leave this as an exercise to the reader, because I’m pretty sure nobody uses this method in practice.

One final remark is about why we need to take an average over all our in the theorem above. There is an extra technical condition one can add to strong connectivity, called *aperiodicity*, which allows one to beef up the theorem so that itself converges to the stationary distribution. Rigorously, aperiodicity is the property that, regardless of where you start your random walk, after some sufficiently large number of steps the random walk has a positive probability of being at every vertex at every subsequent step. As an example of a graph where aperiodicity fails: an undirected cycle on an even number of vertices. In that case there will only be a positive probability of being at certain vertices every *other* step, and averaging those two long term sequences gives the actual stationary distribution.

One way to guarantee that your Markov chain is aperiodic is to ensure there is a positive probability of staying at any vertex. I.e., that your graph has a self-loop. This is what we’ll do in the next section.

Recall that the problem we’re trying to solve is to draw from a distribution over a finite set with probability function . The MCMC method is to construct a Markov chain whose stationary distribution is exactly , even when you just have black-box access to evaluating . That is, you (implicitly) pick a graph and (implicitly) choose transition probabilities for the edges to make the stationary distribution . Then you take a long enough random walk on and output the corresponding to whatever state you land on.

The easy part is coming up with a graph that has the right stationary distribution (in fact, “most” graphs will work). The hard part is to come up with a graph where you can prove that the convergence of a random walk to the stationary distribution is fast in comparison to the size of . Such a proof is beyond the scope of this post, but the “right” choice of a graph is not hard to understand.

The one we’ll pick for this post is called the **Metropolis-Hastings** algorithm. The input is your black-box access to , and the output is a set of rules that implicitly define a random walk on a graph whose vertex set is .

It works as follows: you pick some way to put on a lattice, so that each state corresponds to some vector in . Then you add (two-way directed) edges to all neighboring lattice points. For it would look like this:

And for it would look like this:

You have to be careful here to ensure the vertices you choose for are not disconnected, but in many applications is naturally already a lattice.

Now we have to describe the transition probabilities. Let be the maximum degree of a vertex in this lattice (). Suppose we’re at vertex and we want to know where to go next. We do the following:

- Pick neighbor with probability (there is some chance to stay at ).
- If you picked neighbor and then deterministically go to .
- Otherwise, , and you go to with probability .

We can state the probability weight on edge more compactly as

It is easy to check that this is indeed a probability distribution for each vertex . So we just have to show that is the stationary distribution for this random walk.

Here’s a fact to do that: if a probability distribution with entries for each has the property that for all , the is the stationary distribution. To prove it, fix and take the sum of both sides of that equation over all . The result is exactly the equation , which is the same as . Since the stationary distribution is the unique vector satisfying this equation, has to be it.

Doing this with out chosen is easy, since and are both equal to by applying a tiny bit of algebra to the definition. So we’re done! One can just randomly walk according to these probabilities and get a sample.

The last thing I want to say about MCMC is to show that you can estimate the expected value of a function simultaneously while random-walking through your Metropolis-Hastings graph (or any graph whose stationary distribution is ). By definition the expected value of is .

Now what we can do is compute the average value of just among those states we’ve visited during our random walk. With a little bit of extra work you can show that this quantity will converge to the true expected value of at about the same time that the random walk converges to the stationary distribution. (Here the “about” means we’re off by a constant factor depending on ). In order to prove this you need some extra tools I’m too lazy to write about in this post, but the point is that it works.

The reason I did not start by describing MCMC in terms of estimating the expected value of a function is because the core problem is a sampling problem. Moreover, there are many applications of MCMC that need nothing more than a sample. For example, MCMC can be used to estimate the volume of an arbitrary (maybe high dimensional) convex set. See these lecture notes of Alistair Sinclair for more.

If demand is popular enough, I could implement the Metropolis-Hastings algorithm in code (it wouldn’t be industry-strength, but perhaps illuminating? I’m not so sure…).

Until next time!

]]>