A Quasipolynomial Time Algorithm for Graph Isomorphism: The Details

Update 2015-11-21: Ken Regan and Dick Lipton posted an article with some more details, and a high level overview of how the techniques fit into the larger picture of CS theory.

Update 2015-11-16: Laci has posted the talk on his website. It’s an hour and a half long, and I encourage you to watch it if you have the time :)

Laszlo Babai has claimed an astounding theorem, that the Graph Isomorphism problem can be solved in quasipolynomial time. On Tuesday I was at Babai’s talk on this topic (he has yet to release a preprint), and I’ve compiled my notes here. As in Babai’s talk, familiarity with basic group theory and graph theory is assumed, and if you’re a casual (i.e., math-phobic) reader looking to understand what the fuss is all about, this is probably not the right post for you. This post is research level theoretical computer science. We’re here for the juicy, glorious details.

Note: this blog post will receive periodic updates as my understanding of the details improve.

Laci during his lecture.

Laci during his lecture. Photo taken by me.

Standing room only at Laci's talk.

Standing room only at Laci’s talk. My advisor in the bottom right, my coauthor mid-left with the thumbs. Various famous researchers spottable elsewhere.

Background on Graph Isomorphism

I’ll start by giving a bit of background into why Graph Isomorphism (hereafter, GI) is such a famous problem, and why this result is important. If you’re familiar with graph isomorphism and the basics of complexity theory, skip to the next section where I get into the details.

GI is the following problem: given two graphs G = (V_G, E_G), H = (V_H, E_H), determine whether the graphs are isomorphic, that is, whether there is a bijection f : V_G \to V_H such that u,v are connected in G if and only if f(u), f(v) are connected in H. Informally, GI asks whether it’s easy to tell from two drawings of a graph whether the drawings actually represent the same graph. If you’re wondering why this problem might be hard, check out the following picture of the same graph drawn in three different ways.


Indeed, a priori the worst-case scenario is that one would have to try all n! ways to rearrange the nodes of the first graph and see if one rearrangement achieves the second graph. The best case scenario is that one can solve this problem efficiently, that is, with an algorithm whose worst-case runtime on graphs with n nodes and m edges is polynomial in n and m (this would show that GI is in the class P). However, nobody knows whether there is a polynomial time algorithm for GI, and it’s been a big open question in CS theory for over forty years. This is the direction that Babai is making progress toward, showing that there are efficient algorithms. He didn’t get a polynomial time algorithm, but he got something quite close, which is why everyone is atwitter.

It turns out that telling whether two graphs are isomorphic has practical value in some applications. I hear rumor that chemists use it to search through databases of chemicals for one with certain properties (one way to think of a chemical compound is as a graph). I also hear that some people use graph isomorphism to compare files, do optical character recognition, and analyze social networks, but it seems highly probable to me that GI is not the central workhorse (or even a main workhorse) in these fields. Nevertheless, the common understanding is that pretty much anybody who needs to solve GI on a practical level can do so efficiently. The heuristics work well. Even in Babai’s own words, getting better worst-case algorithms for GI is purely a theoretical enterprise.

So if GI isn’t vastly important for real life problems, why are TCS researchers so excited about it?

Well it’s known that GI is in the class NP, meaning if two graphs are isomorphic you can give me a short proof that I can verify in polynomial time (the proof is just a description of the function f : V_G \to V_H). And if you’ll recall that inside NP there is this class called NP-complete, which are the “hardest” problems in NP. Now most problems in NP that we care about are also NP-complete, but it turns out GI is not known to be NP-complete either. Now, for all we know P = NP and then the question about GI is moot, but in the scenario that most people believe P and NP are different, so it leaves open the question of where GI lies: does it have efficient algorithms or not?

So we have a problem which is in NP, it’s not known to be in P, and it’s not known to be NP-complete. One obvious objection is that it might be neither. In fact, there’s a famous theorem of Ladner that says if P is different from NP, then there must be problems in NP, not in P, and not NP-complete. Such problems are called “NP-intermediate.” It’s perfectly reasonable that GI is one of these problems. But there’s a bit of a catch.

See, Ladner’s theorem doesn’t provide a natural problem which is NP intermediate; what Ladner did in his theorem was assume P is not NP, and then use that assumption to invent a new problem that he could prove is NP intermediate. If you come up with a problem whose only purpose is to prove a theorem, then the problem is deemed unnatural. In fact, there is no known “natural” NP-intermediate problem (assuming P is not NP). The pattern in CS theory is actually that if we find a problem that might be NP-intermediate, someone later finds an efficient algorithm for it or proves it’s NP-complete. There is a small and dwindling list of such problems. I say dwindling because not so long ago the problem of telling whether an integer is prime was in this list. The symptoms are that one first solves the problem on many large classes of special cases (this is true of GI) or one gets a nice quasipolynomial-time algorithm (Babai’s claimed new result), and then finally it falls into P. In fact, there is even stronger evidence against it being NP-complete: if GI were NP-complete, the polynomial hierarchy would collapse. To the layperson, the polynomial hierarchy is abstruse complexity theoretic technical hoo-hah, but suffice it to say that most experts believe the hierarchy does not collapse, so this counts as evidence.

So indeed, it could be that GI will become the first ever problem which is NP-intermediate (assuming P is not NP), but from historical patterns it seems more likely that it will fall into P. So people are excited because it’s tantalizing: everyone believes it should be in P, but nobody can prove it. It’s right at the edge of the current state of knowledge about the theoretical capabilities and limits of computation.

This is the point at which I will start assuming some level of mathematical maturity.

The Main Result

The specific claim about graph isomorphism being made is the following:

Theorem: There is a deterministic algorithm for GI which runs in time 2^{O(\log^c(n))} for some constant c.

This is an improvement over the best previously known algorithm which had runtime 2^{\sqrt{n \log n}}. Note the \sqrt{n} in the exponent has been eliminated, which is a huge difference. Quantities which are exponential in some power of a logarithm are called “quasipolynomial.”

But the main result is actually a quasipolynomial time algorithm for a different, more general problem called string automorphism. In this context, given a set Xstring is a function from X to some finite alphabet (really it is a coloring of X, but we are going to use colorings in the usual sense later so we have to use a new name here). If the set X is given a linear ordering then strings on X really correspond to strings of length |X| over the alphabet. We will call strings x,y \in X.

Now given a set X and a group G acting on X, there is a natural action of G on strings over X, denoted x^\sigma, by permuting the indices x^{\sigma}(i) = x(\sigma(i)). So you can ask the natural question: given two strings x,y and a representation of a group G acting on X by a set of generating permutations of G, is there a \sigma \in G with x^\sigma = y? This problem is called the string isomorphism problem, and it’s clearly in NP.

Now if you call \textup{ISO}_G(x,y) the set of all permutations in G that map x to y, and you call \textup{Aut}_G(x) = \textup{ISO}_G(x,x), then the actual theorem Babai claims to have proved is the following.

Theorem: Given a generating set for a group G of permutations of a set X and a string x, there is a quasipolynomial time algorithm which produces a generating set of the subgroup \textup{Aut}_G(x) of G, i.e. the string automorphisms of x that lie in in G.

It is not completely obvious that GI reduces to the automorphism problem, but I will prove it does in the next section. Furthermore, the overview of Babai’s proof of the theorem follows an outline laid out by Eugene Luks in 1982, which involves a divide-and-conquer method for splitting the analysis of \textup{Aut}_G(x) into simpler and simpler subgroups as they are found.

Luks’s program

Eugene Luks was the first person to incorporate “serious group theory” (Babai’s words) into the study of graph isomorphism. Why would group theory help in a question about graphs? Let me explain with a lemma.

Lemma: GI is polynomial-time reducible to the problem of computing, given a graph X, a list of generators for the automorphism group of G, denoted \textup{Aut}(X).

Proof. Without loss of generality suppose X_1, X_2 are connected graphs. If we want to decide whether X_1, X_2 are isomorphic, we may form the disjoint union X = X_1 \cup X_2. It is easy to see that X_1 and X_2 are isomorphic if and only if some \sigma \in \textup{Aut}(X) swaps X_1 and X_2. Indeed, if any automorphism with this property exists, every generating set of \textup{Aut}(G) must contain one.


Similarly, the string isomorphism problem reduces to the problem of computing a generating set for \textup{Aut}_G(x) using a similar reduction to the one above. As a side note, while \textup{ISO}_G(x,y) can be exponentially large as a set, it is either the empty set, or a coset of \textup{Aut}_G(x) by any element of \textup{ISO}_G(x,y). So there are group-theoretic connections between the automorphism group of a string and the isomorphisms between two strings.

But more importantly, computing the automorphism group of a graph reduces to computing the automorphism subgroup of a particular group for a given string in the following way. Given a graph X on a vertex set V = \{ 1, 2, \dots, v \} write X as a binary string on the set of unordered pairs Z = \binom{V}{2} by mapping (i,j) \to 1 if and only if i and j are connected by an edge. The alphabet size is 2. Then \textup{Aut}(X) (automorphisms of the graph) induces an action on strings as a subgroup G_X of \textup{Aut}(Z) (automorphisms of strings). These induced automorphisms are exactly those which preserve proper encodings of a graph. Moreover, any string automorphism in G_X is an automorphism of X and vice versa. Note that since Z is larger than V by a factor of v^2, the subgroup G_X is much smaller than all of \textup{Aut}(Z).

Moreover, \textup{Aut}(X) sits inside the full symmetry group \textup{Sym}(V) of V, the vertex set of the starting graph, and \textup{Sym}(V) also induces an action G_V on Z. The inclusion is

\displaystyle \textup{Aut}(X) \subset \textup{Sym}(V)


\displaystyle G_X = \textup{Aut}(\textup{Enc}(X)) \subset G_V \subset \textup{Aut}(Z)


Call G = G_V the induced subgroup of permutations of strings-as-graphs. Now we just have some subgroup of permutations G of \textup{Aut}(Z), and we want to find a generating set for \textup{Aut}_G(x) (where x happens to be the encoding of a graph). That is exactly the string automorphism problem. Reduction complete.

Now the basic idea to compute \textup{Aut}_G(x) is to start from the assumption that \textup{Aut}_G(x) = G. We know it’s a subgroup, so it could be all of G; in terms of GI if this assumption were true it would mean the starting graph was the complete graph, but for string automorphism in general G can be whatever. Then we try to refute this belief by finding additional structure in \textup{Aut}_G(x), either by breaking it up into smaller pieces (say, orbits) or by constructing automorphisms in it. That additional structure allows us to break up G in a way that \textup{Aut}_G(x) is a subgroup of the product of the corresponding factors of G.

The analogy that Babai used, which goes back to graphs, is the following. If you have a graph X and you want to know its automorphisms, one thing you can do is to partition the vertices by degree. You know that an automorphism has to preserve the degree of an individual vertex, so in particular you can break up the assumption that \textup{Aut}(X) = \textup{Sym}(V) into the fact that \textup{Aut}(X) must be a subgroup of the product of the symmetry groups of the pieces of the partition; then you recurse. In this way you’ve hugely reduced the total number of automorphisms you need to consider. When the degrees get small enough you can brute-force search for automorphisms (and there is some brute-force searching in putting the pieces back together). But of course this approach fails miserably in the first step you start with a regular graph, so one would need to look for other kinds of structure.

One example is an equitable partition, which is a partition of vertices by degree relative to vertices in other blocks of the partition. So a vertex which has degree 3 but two degree 2 neighbors would be in a different block than a vertex with degree 3 and only 1 neighbor of degree 2. Finding these equitable partitions (which can be done in polynomial time) is one of the central tools used to attack GI. As an example of why it can be very helpful: in many regimes a Erdos-Renyi random graph has asymptotically almost surely a coarsest equitable partition which consists entirely of singletons. This is despite the fact that the degree sequences themselves are tightly constrained with high probability. This means that, if you’re given two Erdos-Renyi random graphs and you want to know whether they’re isomorphic, you can just separately compute the coarsest equitable partition for each one and see if the singleton blocks match up. That is your isomorphism.

Even still, there are many worst case graphs that resist being broken up by an equitable partition. A hard example is known as the Johnson graph, which we’ll return to later.

For strings the sorts of structures to look for are even more refined than equitable partitions of graphs, because the automorphism group of a graph can be partitioned into orbits which preserve the block structure of an equitable partition. But it still turns out that Johnson graphs admit parts of the automorphism group that can’t be broken up much by orbits.

The point is that when some useful substructure is found, it will “make progress” toward the result by breaking the problem into many pieces (say, n^{\log n} pieces) where each piece has size 9/10 the size of the original. So you get a recursion in the amount of time needed which looks like f(n) \leq n^{\log n} f(9n/10). If you call q(n) = n^{\log n} the quasipolynomial factor, then solving the recurrence gives f(n) \leq q(n)^{O(\log n)} which only adds an extra log factor in the exponent. So you keep making progress until the problem size is polylogarithmic, and then you brute force it and put the pieces back together in quasipolynomial time.

Two main lemmas, which are theorems in their own right

This is where the details start to get difficult, in part because Babai jumped back and forth between thinking of the object as a graph and as a string. The point of this in the lecture was to illustrate both where the existing techniques for solving GI (which were in terms of finding canonical graph substructures in graphs) break down.

The central graph-theoretic picture is that of “individualizing” a vertex by breaking it off from an existing equitable partition, which then breaks the equitable partition structure so you need to do some more (polytime) work to further refine it into an equitable partition again. But the point is that you can take all the vertices in a block, pick all possible ways to individualize them by breaking them into smaller blocks. If you traverse these possibilities in a canonical order, you will eventually get down to a partition of singletons, which is your “canonical labeling” of the graph. And if you do this process with two different graphs and you get to different canonical labelings, you had to have started with non-isomorphic graphs.

The problem is that when you get to a coarsest equitable partition, you may end up with blocks of size \sqrt{n}, meaning you have an exponential number of individualizations to check. This is the case with Johnson graphs, and in fact if you have a Johnson graph J(m,t) which has \binom{m}{t} vertices and you individualize fewer than m/10t if them, then you will only get down to blocks of size polynomially smaller than \binom{m}{t}, which is too big if you want to brute force check all individualizations of a block.

The main combinatorial lemma that Babai proves to avoid this problem is that the Johnson graphs are the only obstacle to finding efficient partitions.

Theorem (Babai 15): If X is a regular graph on m vertices, then by individualizing a polylog number of vertices we can find one of the three following things:

  1. A canonical coloring with each color class having at most 90% of all the nodes.
  2. A canonical equipartition of some subset of the vertices that has at least 90% of the nodes (i.e. a big color class from (1)).
  3. A canonically embedded Johnson graph on at least 90% of the nodes.

[Edit: I think that what Babai means by a “canonical coloring” is an equitable partition of the nodes (not to be confused with an equipartition), but I am not entirely sure. I have changed the language to reflect more clearly what was in the talk as opposed to what I think I understood from doing additional reading.]

The first two are apparently the “easy” cases in the sense that they allow for simple recursion that has already been known before Babai’s work. The hard part is establishing the last case (and this is the theorem whose proof sketch he has deferred for two more weeks). But once you have such a Johnson graph your life is much better, because (for a reason I did not understand) you can recurse on a problem of size roughly the square root of the starting size.

In discussing Johnson graphs, Babai said they were a source of “unspeakable misery” for people who want to solve GI quickly. At the same time, it is a “curse and a blessing,” as once you’ve found a Johnson graph embedded in your problem you can recurse to much smaller instances. This routine to find one of these three things is called the “split-or-Johnson” routine.

The analogue for strings (I believe this is true, but I’m a bit fuzzy on this part) is to find a “canonical” k-ary relational structure (where k is polylog in size) with some additional condition on the size of alternating subgroups of the automorphism group of the k-ary relational structure. Then you can “individualize” the points in the base of this relational structure and find analogous partitions and embedded Johnson schemes (a special kind of combinatorial design).

One important fact to note is that the split-or-Johnson routine breaks down at \log^3(n) size, and Babai has counterexamples that say his result is tight, so getting GI in P would have to bypass this barrier with a different idea.

The second crucial lemma has to do with giant homomorphisms, and this is the method by which Babai constructs automorphisms that bound \textup{Aut}_G(x) from below. As opposed to the split-or-Johnson lemma, which finds structure to bound the group from above by breaking it into simpler pieces. Fair warning: one thing I don’t yet understand is how these two routines interact in the final algorithm. My guess is they are essentially run in parallel (or alternate), but that guess is as good as wild.

Definition: A homomorphism \varphi: G \to S_m is called giant if the image of G is either the alternating group A_n or else all of S_m. I.e. \varphi is surjective, or almost so. Let \textup{Stab}_G(x) denote the stabilizer subgroup of x \in G. Then x is called “affected” by \varphi if \varphi|_{\textup{Stab}_g(x)} is not giant.

The central tool in Babai’s algorithm is the dichotomy between points that are affected and those that are not. The ability to decide this property in quasipolynomial time is what makes the divide and conquer work.

There is one deep theorem he uses that relates affected points to giant homomorphisms:

Theorem (Unaffected Stabilizer Theorem): Let \varphi: G \to S_m be a giant homomorphism and U \subset G the set of unaffected elements. Let G_{(U)} be the pointwise stabilizer of U, and suppose that m > \textup{max}(8, 2 + \log_2 n). Then the restriction \varphi : G_{(U)} \to S_m is still giant.

Babai claimed this was a nontrivial theorem, not because the proof is particularly difficult, but because it depends on the classification of finite simple groups. He claimed it was a relatively straightforward corollary, but it appears that this does not factor into the actual GI algorithm constructively, but only as an assumption that a certain loop invariant will hold.

To recall, we started with this assumption that \textup{Aut}_G(x) was the entire symmetry group we started with, which is in particular an assumption that the inclusion \varphi \to S_m is giant. Now you want to refute this hypothesis, but you can’t look at all of S_m because even the underlying set m has too many subsets to check. But what you can do is pick a test set A \subset [m] where |A| is polylogarithmic in size, and test whether the restriction of \varphi to the test set is giant in \textup{Sym}(A). If it is giant, we call A full.

Theorem (Babai 15): One can test the property of a polylogarithmic size test set being full in quasipolynomial time in m.

Babai claimed it should be surprising that fullness is a quasipolynomial time testable property, but more surprising is that regardless of whether it’s full or not, we can construct an explicit certificate of fullness or non-fullness. In the latter case, one can come up with an explicit subgroup which contains the image of the projection of the automorphism group onto the symmetry group of the test set. In addition, given two test sets A,B, one can efficiently compare the action between the two different test sets. And finding these non-full test sets is what allows one to construct the k-ary relations. So the output of this lower bound technique informs the upper bound technique of how to proceed.

The other outcome is that A could be full, and coming up with a certificate of fullness is harder. The algorithm sketched below claims to do it, and it involves finding enough “independent” automorphisms to certify that the projection is giant.

Now once you try all possible test sets, which gives \binom{m}{k}^2 many certificates (a quasipolynomial number), one has to aggregate them into a full automorphism of x, which Babai assured us was a group theoretic exercise.

The algorithm to test fullness (and construct a certificate) he called the Local Certificates Algorithm. It was sketched as follows: you are given as input a set A and a group G_A \subset G being the setwise stabilizer of A under \psi_A : G_A \to \textup{Sym}(A). Now let W be the group elements affected by \psi_A. You can be sure that at least one point is affected. Now you stabilize on W and get a refined subgroup of G_A, which you can use to compute newly affected elements, growing W in each step. By the unaffected stabilizer theorem, this preserves gianthood. Furthermore, in each step you get layers of W, and all of the stabilizers respect the structure of the previous layers. Babai described this as adding on “layers of a beard.”

The termination of this is either when W stops growing, in which case the projection is giant and W is our certificate of fullness (i.e. we get a rich family of automorphisms that are actually in our target automorphism group), or else we discover the projected ceases to be giant and W is our certificate of non-fullness. Indeed, the subgroup generated by these layers is a subgroup of \textup{Aut}_G(x), and the subgroup generated by the elements of a non-fullness certificate contain the automorphism group.

Not enough details?

This was supposed to be just a high-level sketch of the algorithm, and Babai is giving two more talks elaborating on the details. Unfortunately, I won’t be able to make it to his second talk in which he’ll discuss some of the core group theoretic ideas that go into the algorithm. I will, however, make it to his third talk in which he will sketch the proof of the split-or-Johnson routine. That is in two weeks from the time of this writing, and I will update this post with any additional insights then.

Babai has not yet released a preprint, and when I asked him he said “soon, soon.” Until then :)

This blog post is based on my personal notes from Laszlo Babai’s lecture at the University of Chicago on November 10, 2015. At the time of this writing, Babai’s work has not been peer reviewed, and my understanding of his lectures has large gaps and may be faulty. Do not put your life in danger based on information in this post.

Serial Dictatorships and House Allocation

I was recently an invited speaker in a series of STEM talks at Moraine Valley Community College. My talk was called “What can algorithms tell us about life, love, and happiness?” and it’s on Youtube now so you can go watch it. The central theme of the talk was the lens of computation, that algorithms and theoretical computer science can provide new and novel explanations for the natural phenomena we observe in the world.

One of the main stories I told in the talk is about stable marriages and the deferred acceptance algorithm, which we covered previously on this blog. However, one of the examples of the applications I gave was to kidney exchanges and school allocation. I said in the talk that it’s a variant of the stable marriages, but it’s not clear exactly how the two are related. This post will fill that gap and showcase some of the unity in the field of mechanism design.

Mechanism design, which is sometimes called market design, has a grand vision. There is a population of players with individual incentives, and given some central goal the designer wants to come up with a game where the self-interest of the players will lead them to efficiently achieve the designer’s goals. That’s what we’re going to do today with a class of problems called allocation problems.

As usual, all of the code we used in this post is available in a repository on this blog’s Github page.

Allocating houses with dictators

In stable marriages we had n men and n women and we wanted to pair them off one to one in a way that there were no mutual incentives to cheat. Let’s modify this scenario so that only one side has preferences and the other does not. The analogy here is that we have n people and n houses, but what do we want to guarantee? It doesn’t make sense to say that people will cheat on each other, but it does make sense to ask that there’s no way for people to swap houses and have everyone be at least as happy as before. Let’s formalize this.

Let A be a set of people (agents) and H be a set of houses, and n = |A| = |H|. A matching is a one-to-one map from A \to H. Each agent is assumed to have a strict preference over houses, and if we’re given two houses h_1, h_2 and a \in A prefers h_1 over h_2, we express that by saying h_1 >_a h_2. If we want to include the possibility that h_1 = h_2, we would say h_1 \geq_a h_2. I.e., either they’re the same house, or a strictly prefers h_1 more.

Definition: A matching M: A \to H is called pareto-optimal if there is no other matching M with both of the following properties:

  • Every agent is at least as happy in N as in M, i.e. for every a \in A, N(a) \geq_a M(a).
  • Some agent is strictly happier in N, i.e. there exists an a \in A with N(a) >_a M(a).

We say a matching N “pareto-dominates” another matching M if these two properties hold. As a side note, if you like abstract algebra you might notice that you can take matchings and form them into a lattice where the comparison is pareto-domination. If you go deep into the theory of lattices, you can use some nice fixed-point theorems to (non-constructively) prove the existences of optimal allocations in this context and for stable marriages. See this paper if you’re interested. Of course, we will give efficient algorithms to achieve our goals, which is how I prefer to live life.

The mechanism we’ll use to find such an optimal matching is extremely simple, and it’s called the serial dictatorship.

First you pick an arbitrary ordering of the agents and all houses are marked “available.” Then the first agent in the ordering picks their top choice, and you remove their choice from the available houses. Continue in this way down the list until you get to the end, and the outcome is guaranteed to be pareto-optimal.

Theorem: Serial dictatorship always produces a pareto-optimal matching.

Proof. Let M be the output of the algorithm. Suppose the theorem is false, that there is some N that pareto-dominates M. Let a be the first agent in the chosen ordering who gets a strictly better house in N than in M. Whatever house a gets, call it N(a), it has to be a house that was unavailable at the time in the algorithm when a got to pick (otherwise a would have picked N(a) during the algorithm!). This means that a took the house chosen by some agent b \in A whose turn to pick comes before a. But by assumption, a was the first agent to get a strictly better house, so b has to end up with a worse house. This contradicts that every agent is at least as happy in N than in M, so N cannot pareto-dominate M.


It’s easy enough to implement this in Python. Each agent will be represented by its list of preferences, each object will be an integer, and the matching will be a dictionary. The only thing we need to do is pick a way to order the agents, and we’ll just pick a random ordering. As usual, all of the code used in this post is available on this blog’s github page.

# serialDictatorship: [[int]], [int] -> {int: int}
# construct a pareto-optimal allocation of objects to agents.
def serialDictatorship(agents, objects, seed=None):
   if seed is not None:

   agentPreferences = agents[:]
   allocation = dict()
   availableHouses = set(objects)

   for agentIndex, preference in enumerate(agentPreferences):
      allocation[agentIndex] = max(availableHouses, key=preference.index)

   return allocation

And a test

agents = [['d','a','c','b'], # 4th in my chosen seed
          ['a','d','c','b'], # 3rd
          ['a','d','b','c'], # 2nd
          ['d','a','c','b']] # 1st
objects = ['a','b','c','d']
allocation = serialDictatorship(agents, objects, seed=1)
test({0: 'b', 1: 'c', 2: 'd', 3: 'a'}, allocation)

This algorithm is so simple it’s almost hard to believe. But it get’s better, because under some reasonable conditions, it’s the only algorithm that solves this problem.

Theorem [Svensson 98]: Serial dictatorship is the only algorithm that produces a pareto-optimal matching and also has the following three properties:

  • Strategy-proof: no agent can improve their outcomes by lying about their preferences at the beginning.
  • Neutral: the outcome of the algorithm is unchanged if you permute the items (i.e., does not depend on the index of the item in some list)
  • Non-bossy: No agent can change the outcome of the algorithm without also changing the object they receive.

And if we drop any one of these conditions there are other mechanisms that satisfy the rest. This theorem was proved in this paper by Lars-Gunnar Svensson in 1998, and it’s not particularly long or complicated. The proof of the main theorem is about a page. It would be a great exercise in reading mathematics to go through the proof and summarize the main idea (you could even leave a comment with your answer!).

Allocation with existing ownership

Now we switch to a slightly different problem. There are still n houses and n agents, but now every agent already “owns” a house. The question becomes: can they improve their situation by trading houses? It shouldn’t be immediately obvious whether this is possible, because a trade can happen in a “cycle” like the following:


Here A prefers the house of B, and B prefers the house of C, and C prefers the house of A, so they’d all benefit from doing a three-way cyclic trade. You can easily imagine the generalization to larger cycles.

This model was studied by Shapley and Scarf in 1974 (the same Shapley who did the deferred acceptance algorithm for stable marriages). Just as you’d expect, our goal is to find an optimal (re)-allocation of houses to agents in which there is no cycle the stands to improve. That is, there is no subset of agents that can jointly improve their standing. In formalizing this we call an “optimal” matching a core matching. Again A is a set of agents, and H is a set of houses.

Definition: A matching M: A \to H is called a core matching if there is no subset B \subset A and no matching N: A \to H with the following properties:

  • For every b \in B, N(b) is owned by some other agent in B (trading only happens within B).
  • Every agent b in B is at least as happy as before, i.e. N(b) \geq_b M(b) for all b.
  • Some agent in B strictly improves, i.e. for some b, N(b) >_b M(b).

We also call an algorithm individually rational if it ensures that every agent gets a house that is at least as good as their starting house. It should be clear that an algorithm which produces a core matching is individually rational, because for any agent a we can set B = \{a\}, i.e. force a to consider not trading at all, and being a core matching says that’s not better for a. Likewise, core matchings are also pareto-optimal by setting B = A.

It might seem like the idea of a “core” solution to an allocation problem is more general, and you’re right. You can define it for a very general setting of cooperative games and prove the existence of core matchings in that setting. See Wikipedia for more. As is our prerogative, we’ll achieve the same thing by constructing core matchings with an algorithm.

Indeed, the following theorem is due to Shapley & Scarf.

Theorem [Shapley-Scarf 74]: There is a core matching for every choice of preferences. Moreover, one can be found by an efficient algorithm.

Proof. The mechanism we’ll define is called the top trading cycles algorithm. We operate in rounds, and the first round goes as follows.

Form a directed graph with nodes in A \cup H. That is there is one node for each agent and one node for each house. Then we start by having each agent “point” to its most preferred house, and each house “points” to its original owner. That is, we add in directed edges from agents to their top pick, and houses to their owners. For example, say there are five agents A = \{ a, b, c, d, e, f \} and houses H = \{ 1,2,3,4,5 \} with a owning 1, and b owning 2, etc. but their favorite picks goes backwards, so that a prefers house 5 most, and b prefers 4 most, c prefers 3 (which c also owns), etc. Then the “pointing picture” in the first round looks like this.


The claim about such a graph is that there is always some directed cycle. In the example above, there are three. And moreover, we claim that no two cycles can share an edge. It’s easy to see there has to be a cycle: you can start at any agent and just follow the single outgoing edge until you find yourself repeating some vertices. By the fact that there is only one edge going out of any vertex, it follows that no two cycles could share an edge (or else in the last edge they share, there’d have to be a fork, i.e. two outgoing edges).

In the example above, you can start from A and follow the only edge and you get the cycle A -> 5 -> E -> 1 -> A. Similarly, starting at 4 would give you 4 -> D -> 2 -> B -> 4.

The point is that when you remove a cycle, you can have the agents in that cycle do the trade indicated by the cycle and remove the entire cycle from the graph. The consequence of this is that you have some agents who were pointing to houses that are removed, and so these agents revise their outgoing edge to point at their next most preferred available house. You can then continue removing cycles in this way until all the agents have been assigned a house.

The proof that this is a core matching is analogous to the proof that serial dictatorships were pareto-optimal. If there were some subset B and some other matching N under which B does better, then one of these agents has to be the first to be removed in a cycle during the algorithm’s run. But that agent got the best possible pick of a house, so by involving with B that agent necessarily gets a worse outcome.


This algorithm is commonly called the Top Trading Cycles algorithm, because it splits the set of agents and houses into a disjoint union of cycles, each of which is the best trade possible for every agent involved.

Implementing the Top Trading Cycles algorithm in code requires us to be able to find cycles in graphs, but that isn’t so hard. I implemented a simple data structure for a graph with helper functions that are specific to our kind of graph (i.e., every vertex has outdegree 1, so the algorithm to find cycles is simpler than something like Tarjan’s algorithm). You can see the data structure on this post’s github repository in the file graph.py. An example of using it:

>>> G = Graph([1,'a',2,'b',3,'c',4,'d',5,'e',6,'f'])
>>> G.addEdges([(1,'a'), ('a',2), (2,'b'), ('b',3), (3,'c'), ('c',1),
            (4,'d'), ('d',5), (5,'e'), ('e',4), (6,'f'), ('f',6)])
>>> G['d']
>>> G['d'].outgoingEdges
{('d', 5)}
>>> G['d'].anyNext() # return the target of any outgoing edge from 'd'
>>> G.delete('e')
>>> G[4].incomingEdges

Next we implement a function to find a cycle, and a function to extract the agents from a cycle. For latter we can assume the cycle is just represented by any agent on the cycle (again, because our graphs always have outdegree exactly 1).

# anyCycle: graph -> vertex
# find any vertex involved in a cycle
def anyCycle(G):
   visited = set()
   v = G.anyVertex()

   while v not in visited:
      v = v.anyNext()

   return v

# getAgents: graph, vertex -> set(vertex)
# get the set of agents on a cycle starting at the given vertex
def getAgents(G, cycle, agents):
   # make sure starting vertex is a house
   if cycle.vertexId in agents:
      cycle = cycle.anyNext()

   startingHouse = cycle
   currentVertex = startingHouse.anyNext()
   theAgents = set()

   while currentVertex not in theAgents:
      currentVertex = currentVertex.anyNext()
      currentVertex = currentVertex.anyNext()

   return theAgents

Finally, implementing the algorithm is just bookkeeping. After setting up the initial graph, the core of the routine is

def topTradingCycles(agents, houses, agentPreferences, initialOwnership):
   # form the initial graph


   allocation = dict()
   while len(G.vertices) &> 0:
      cycle = anyCycle(G)
      cycleAgents = getAgents(G, cycle, agents)

      # assign agents in the cycle their choice of house
      for a in cycleAgents:
         h = a.anyNext().vertexId
         allocation[a.vertexId] = h

      for a in agents:
         if a in G.vertices and G[a].outdegree() == 0:
            # update preferences

            G.addEdge(a, preferredHouse(a))

   return allocation

This mutates the graph in each round by deleting any cycle that was found, and adding new edges when the top choice of some agent is removed. Finally, to fill in the ellipses we just need to say how we represent the preferences. The input agentPreferences is a dictionary mapping agents to a list of all houses in order of preference. So again we can just represent the “top available pick” by an index and update that index when agents lose their top pick.

# maps agent to an index of the list agentPreferences[agent]
currentPreferenceIndex = dict((a,0) for a in agents)
preferredHouse = lambda a: agentPreferences[a][currentPreferenceIndex[a]]

Then to update we just have to replace the currentPreferenceIndex for each disappointed agent by its next best option.

      for a in agents:
         if a in G.vertices and G[a].outdegree() == 0:
            while preferredHouse(a) not in G.vertices:
               currentPreferenceIndex[a] += 1
            G.addEdge(a, preferredHouse(a))

And that’s it! We included a small suite of test cases which you can run if you want to play around with it more.

One final nice thing about this algorithm is that it almost generalizes the serial dictatorship algorithm. What you do is rather than have each house point to its original owner, you just have all houses point to the first agent in the pre-specified ordering. Then a cycle will always have length 2, the first agent gets their preferred house, and in the next round the houses now point to the second agent in the ordering, and so on.

Kidney exchange

We still need one more ingredient to see the bridge from allocation problems to kidney exchanges. The setting is like this: say Manuel needs a kidney transplant, and he’s lucky enough that his sister-in-law Anastasia wants to donate her kidney to Manuel. However, it turns out that Anastasia doesn’t the same right blood/antibody type for a donation, and so even though she has a kidney to give, they can’t give it to Manuel. Now one might say “just sell your kidney and use the money to buy a kidney with the right type!” Turns out that’s illegal; at some point we as a society decided that it’s immoral to sell organs. But it is legal to exchange a kidney for a kidney. So if Manuel and Anastasia can find a pair of people both of whom happen to have the right blood types, they can arrange for a swap.

But finding two people both of whom have the right blood types is unlikely, and we can actually do far better! We can turn this into a housing allocation problem as follows. Anyone with a kidney to donate is a “house,” and anyone who needs a kidney is an “agent.” And to start off with, we say that each agent “owns” the kidney of their willing donor. And the preferences of each agent are determined by which kidney donors have the right blood type (with ties split, say, by geographical distance). Then when you do the top trading cycles algorithm you find these chains where Anastasia, instead of donating to Manuel, donates to another person who has the right blood type. On the other end of the cycle, Manuel receives a kidney from someone with the right blood type.

The big twist is that not everyone who needs a kidney knows someone willing to donate. So there are agents who are “new” to the market and don’t already own a house. Moreover, maybe you have someone who is willing to donate a kidney but isn’t asking for anything in return.

Because of this the algorithm changes slightly. You can no longer guarantee the existence of a cycle (though you can still guarantee that no two cycles will share an edge). But as new people are added to the graph, cycles will eventually form and you can make the trades. There are a few extra details if you want to ensure that everyone is being honest (if you’re thinking about it like a market in the economic sense, where people could be lying about their preferences).

The resulting mechanism is called You Request My House I Get Your Turn (YRMHIGYT). In short, the idea is that you pick an order on the agents, say for kidney exchanges it’s the order in which the patients are diagnosed. And you have them add edges to the graph in that order. At each step you look for a cycle, and when one appears you remove it as usual. The twist, and the source of the name, is that when someone who has no house requests a house which is already owned, the agent who owns the house gets to jump forward in the queue. This turns out to make everything “fair” (in that everyone is guaranteed to get a house at least as good as the one they own) and one can prove analogous optimality theorems to the ones we did for serial dictatorship.

This mechanism was implemented by Alvin Roth in the US hospital system, and by some measure it has saved many lives. If you want to hear more about the process and how successful the kidney exchange program is, you can listen to this Freakonomics podcast episode where they interviewed Al Roth and some of the patients who benefited from this new allocation market.

It would be an excellent exercise to go deeper into the guts of the kidney exchange program (see this paper by Alvin Roth et al.), and implement the matching system in code. At the very least, implementing the YRMHIGYT mechanism is only a minor modification of our existing Top Trading Cycles code.

Until next time!

One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

Statistical parity

Here is one way to define fairness.

Your population is a set X and there is some known subset S \subset X that is a “protected” subset of the population. For discussion we’ll say X is people and S is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution D over X which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so D takes that into account. Generally we impose no restrictions on D, and the definition of fairness will have to work no matter what D is.

Now suppose we have a (possibly randomized) classifier h:X \to \{-1,1\} giving labels to X. When given a person x as input h(x)=1 if x gets a loan and -1 otherwise. The bias, or statistical imparity, of h on S with respect to X,D is the following quantity. In words, it is the difference between the probability that a random individual drawn from S is labeled 1 and the probability that a random individual from the complement S^C is labeled 1.

\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]

The probability is taken both over the distribution D and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis h:X \to \{-1,1\} is said to have statistical parity on D with respect to S up to bias \varepsilon if |\textup{bias}_h(X,S,D)| < \varepsilon.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point x \in X is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):   
   protectedClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] == protectedValue]   
   elseClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] != protectedValue]

   if len(protectedClass) == 0 or len(elseClass) == 0:
      raise Exception("One of the classes is empty!")
      protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
      elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

   return elseProb - protectedProb

Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
   return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)

Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":
   from data import adult
   train, test = adult.load(separatePointsAndLabels=True)

   # [(test name, (index, value))]
   tests = [('gender', (1,0)), 
            ('private employment', (2,1)), 
            ('asian race', (33,1)),
            ('divorced', (12, 1))]

   for (name, (index, value)) in tests:
      print("'%s' bias in training data: %.4f" %
         (name, labelBias(train[0], train[1], index, value)))

(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582

Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
>>> signedBias(train[0], indian, 1, 0)

So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least $50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set T in place of S^C, or looking at discrepancies among k different sub-populations or with m different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class S. I mean two things by this. First, you could have some historical data you want to train a classifier h on, and usually you’ll be given training labels for the data that tell you whether h(x) should be 1 or -1. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against S then the training labels are not trustworthy. As a consequence, achieving statistical parity for S necessarily reduces the accuracy of h. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals h behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

  1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
  2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!

The Boosting Margin, or Why Boosting Doesn’t Overfit

There’s a well-understood phenomenon in machine learning called overfitting. The idea is best shown by a graph:


Let me explain. The vertical axis represents the error of a hypothesis. The horizontal axis represents the complexity of the hypothesis. The blue curve represents the error of a machine learning algorithm’s output on its training data, and the red curve represents the generalization of that hypothesis to the real world. The overfitting phenomenon is marker in the middle of the graph, before which the training error and generalization error both go down, but after which the training error continues to fall while the generalization error rises.

The explanation is a sort of numerical version of Occam’s Razor that says more complex hypotheses can model a fixed data set better and better, but at some point a simpler hypothesis better models the underlying phenomenon that generates the data. To optimize a particular learning algorithm, one wants to set parameters of their model to hit the minimum of the red curve.

This is where things get juicy. Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t overfit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty.

Schapire, Freund, Bartlett, and Lee proposed a theoretical explanation for this based on the notion of a margin, and the goal of this post is to go through the details of their theorem and proof. Remember that the standard AdaBoost algorithm produces a set of weak hypotheses h_i(x) and a corresponding weight \alpha_i \in [-1,1] for each round i=1, \dots, T. The classifier at the end is a weighted majority vote of all the weak learners (roughly: weak learners with high error on “hard” data points get less weight).

Definition: The signed confidence of a labeled example (x,y) is the weighted sum:

\displaystyle \textup{conf}(x) = \sum_{i=1}^T \alpha_i h_i(x)

The margin of (x,y) is the quantity \textup{margin}(x,y) = y \textup{conf}(x). The notation implicitly depends on the outputs of the AdaBoost algorithm via “conf.”

We use the product of the label and the confidence for the observation that y \cdot \textup{conf}(x) \leq 0 if and only if the classifier is incorrect. The theorem we’ll prove in this post is

Theorem: With high probability over a random choice of training data, for any 0 < \theta < 1 generalization error of boosting is bounded from above by

\displaystyle \Pr_{\textup{train}}[\textup{margin}(x) \leq \theta] + O \left ( \frac{1}{\theta} (\textup{typical error terms}) \right )

In words, the generalization error of the boosting hypothesis is bounded by the distribution of margins observed on the training data. To state and prove the theorem more generally we have to return to the details of PAC-learning. Here and in the rest of this post, \Pr_D denotes \Pr_{x \sim D}, the probability over a random example drawn from the distribution D, and \Pr_S denotes the probability over a random (training) set of examples drawn from D.

Theorem: Let S be a set of m random examples chosen from the distribution D generating the data. Assume the weak learner corresponds to a finite hypothesis space H of size |H|, and let \delta > 0. Then with probability at least 1 - \delta (over the choice of S), every weighted-majority vote function f satisfies the following generalization bound for every \theta > 0.

\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_S[y f(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{\log m \log |H|}{\theta^2} + \log(1/\delta)} \right )

In other words, this phenomenon is a fact about voting schemes, not boosting in particular. From now on, a “majority vote” function f(x) will mean to take the sign of a sum of the form \sum_{i=1}^N a_i h_i(x), where a_i \geq 0 and \sum_i a_i = 1. This is the “convex hull” of the set of weak learners H. If H is infinite (in our proof it will be finite, but we’ll state a generalization afterward), then only finitely many of the a_i in the sum may be nonzero.

To prove the theorem, we’ll start by defining a class of functions corresponding to “unweighted majority votes with duplicates:”

Definition: Let C_N be the set of functions f(x) of the form \frac{1}{N} \sum_{i=1}^N h_i(x) where h_i \in H and the h_i may contain duplicates (some of the h_i may be equal to some other of the h_j).

Now every majority vote function f can be written as a weighted sum of h_i with weights a_i (I’m using a instead of \alpha to distinguish arbitrary weights from those weights arising from Boosting). So any such f(x) defines a natural distribution over H where you draw function h_i with probability a_i. I’ll call this distribution A_f. If we draw from this distribution N times and take an unweighted sum, we’ll get a function g(x) \in C_N. Call the random process (distribution) generating functions in this way Q_f. In diagram form, the logic goes

f \to weights a_i \to distribution over H \to function in C_N by drawing N times according to H.

The main fact about the relationship between f and Q_f is that each is completely determined by the other. Obviously Q_f is determined by f because we defined it that way, but f is also completely determined by Q_f as follows:

\displaystyle f(x) = \mathbb{E}_{g \sim Q_f}[g(x)]

Proving the equality is an exercise for the reader.

Proof of Theorem. First we’ll split the probability \Pr_D[y f(x) \leq 0] into two pieces, and then bound each piece.

First a probability reminder. If we have two events A and B (in what’s below, this will be yg(x) \leq \theta/2 and yf(x) \leq 0, we can split up \Pr[A] into \Pr[A \textup{ and } B] + \Pr[A \textup{ and } \overline{B}] (where \overline{B} is the opposite of B). This is called the law of total probability. Moreover, because \Pr[A \textup{ and } B] = \Pr[A | B] \Pr[B] and because these quantities are all at most 1, it’s true that \Pr[A \textup{ and } B] \leq \Pr[A \mid B] (the conditional probability) and that \Pr[A \textup{ and } B] \leq \Pr[B].

Back to the proof. Notice that for any g(x) \in C_N and any \theta > 0, we can write \Pr_D[y f(x) \leq 0] as a sum:

\displaystyle \Pr_D[y f(x) \leq 0] =\\ \Pr_D[yg(x) \leq \theta/2 \textup{ and } y f(x) \leq 0] + \Pr_D[yg(x) > \theta/2 \textup{ and } y f(x) \leq 0]

Now I’ll loosen the first term by removing the second event (that only makes the whole probability bigger) and loosen the second term by relaxing it to a conditional:

\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_D[y g(x) \leq \theta / 2] + \Pr_D[yg(x) > \theta/2 \mid yf(x) \leq 0]

Now because the inequality is true for every g(x) \in C_N, it’s also true if we take an expectation of the RHS over any distribution we choose. We’ll choose the distribution Q_f to get

\displaystyle \Pr_D[yf(x) \leq 0] \leq T_1 + T_2

And T_1 (term 1) is

\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]

And T_2 is

\displaystyle \Pr_{x \sim D, g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0] = \mathbb{E}_D[\Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]]

We can rewrite the probabilities using expectations because (1) the variables being drawn in the distributions are independent, and (2) the probability of an event is the expectation of the indicator function of the event.

Now we’ll bound the terms T_1, T_2 separately. We’ll start with T_2.

Fix (x,y) and look at the quantity inside the expectation of T_2.

\displaystyle \Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]

This should intuitively be very small for the following reason. We’re sampling g according to a distribution whose expectation is f, and we know that yf(x) \leq 0. Of course yg(x) is unlikely to be large.

Mathematically we can prove this by transforming the thing inside the probability to a form suitable for the Chernoff bound. Saying yg(x) > \theta / 2 is the same as saying |yg(x) - \mathbb{E}[yg(x)]| > \theta /2, i.e. that some random variable which is a sum of independent random variables (the h_i) deviates from its expectation by at least \theta/2. Since the y‘s are all \pm 1 and constant inside the expectation, they can be removed from the absolute value to get

\displaystyle \leq \Pr_{g \sim Q_f}[g(x) - \mathbb{E}[g(x)] > \theta/2]

The Chernoff bound allows us to bound this by an exponential in the number of random variables in the sum, i.e. N. It turns out the bound is e^{-N \theta^2 / 8}.

Now recall T_1

\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]

For T_1, we don’t want to bound it absolutely like we did for T_2, because there is nothing stopping the classifier f from being a bad classifier and having lots of error. Rather, we want to bound it in terms of the probability that yf(x) \leq \theta. We’ll do this in two steps. In step 1, we’ll go from \Pr_D of the g‘s to \Pr_S of the g‘s.

Step 1: For any fixed g, \theta, if we take a sample S of size m, then consider the event in which the sample probability deviates from the true distribution by some value \varepsilon_N, i.e. the event

\displaystyle \Pr_D[yg(x) \leq \theta /2] > \Pr_{S, x \sim S}[yg(x) \leq \theta/2] + \varepsilon_N

The claim is this happens with probability at most e^{-2m\varepsilon_N^2}. This is again the Chernoff bound in disguise, because the expected value of \Pr_S is \Pr_D, and the probability over S is an average of random variables (it’s a slightly different form of the Chernoff bound; see this post for more). From now on we’ll drop the x \sim S when writing \Pr_S.

The bound above holds true for any fixed g,\theta, but we want a bound over all g and \theta. To do that we use the union bound. Note that there are only (N+1) possible choices for a nonnegative \theta because g(x) is a sum of N values each of which is either \pm1. And there are only |C_N| \leq |H|^N possibilities for g(x). So the union bound says the above event will occur with probability at most (N+1)|H|^N e^{-2m\varepsilon_N^2}.

If we want the event to occur with probability at most \delta_N, we can judiciously pick

\displaystyle \varepsilon_N = \sqrt{(1/2m) \log ((N+1)|H|^N / \delta_N)}

And since the bound holds in general, we can take expectation with respect to Q_f and nothing changes. This means that for any \delta_N, our chosen \varepsilon_N ensures that the following is true with probability at least 1-\delta_N:

\displaystyle \Pr_{D, g \sim Q_f}[yg(x) \leq \theta/2] \leq \Pr_{S, g \sim Q_f}[yg(x) \leq \theta/2] + \varepsilon_N

Now for step 2, we bound the probability that yg(x) \leq \theta/2 on a sample to the probability that yf(x) \leq \theta on a sample.

Step 2: The first claim is that

\displaystyle \Pr_{S, g \sim Q_f}[yg(x) \leq \theta / 2] \leq \Pr_{S} [yf(x) \leq \theta] + \mathbb{E}_{S}[\Pr_{g \sim Q_f}[yg(x) \leq \theta/2 \mid yf(x) \geq \theta]]

What we did was break up the LHS into two “and”s, when yf(x) > \theta and yf(x) \leq \theta (this was still an equality). Then we loosened the first term to \Pr_{S}[yf(x) \leq \theta] since that is only more likely than both yg(x) \leq \theta/2 and yf(x) \leq \theta. Then we loosened the second term again using the fact that a probability of an “and” is bounded by the conditional probability.

Now we have the probability of yg(x) \leq \theta / 2 bounded by the probability that yf(x) \leq 0 plus some stuff. We just need to bound the “plus some stuff” absolutely and then we’ll be done. The argument is the same as our previous use of the Chernoff bound: we assume yf(x) \geq \theta, and yet yg(x) \leq \theta / 2. So the deviation of yg(x) from its expectation is large, and the probability that happens is exponentially small in the amount of deviation. The bound you get is

\displaystyle \Pr_{g \sim Q}[yg(x) \leq \theta/2 \mid yf(x) > \theta] \leq e^{-N\theta^2 / 8}.

And again we use the union bound to ensure the failure of this bound for any N will be very small. Specifically, if we want the total failure probability to be at most \delta, then we need to pick some \delta_j‘s so that \delta = \sum_{j=0}^{\infty} \delta_j. Choosing \delta_N = \frac{\delta}{N(N+1)} works.

Putting everything together, we get that with probability at least 1-\delta for every \theta and every N, this bound on the failure probability of f(x):

\displaystyle \Pr_{x \sim D}[yf(x) \leq 0] \leq \Pr_{S, x \sim S}[yf(x) \leq \theta] + 2e^{-N \theta^2 / 8} + \sqrt{\frac{1}{2m} \log \left ( \frac{N(N+1)^2 |H|^N}{\delta} \right )}.

This claim is true for every N, so we can pick N that minimizes it. Doing a little bit of behind-the-scenes calculus that is left as an exercise to the reader, a tight choice of N is (4/ \theta)^2 \log(m/ \log |H|). And this gives the statement of the theorem.


We proved this for finite hypothesis classes, and if you know what VC-dimension is, you’ll know that it’s a central tool for reasoning about the complexity of infinite hypothesis classes. An analogous theorem can be proved in terms of the VC dimension. In that case, calling d the VC-dimension of the weak learner’s output hypothesis class, the bound is

\displaystyle \Pr_D[yf(x) \leq 0] \leq \Pr_S[yf(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{d \log^2(m/d)}{\theta^2} + \log(1/\delta)} \right )

How can we interpret these bounds with so many parameters floating around? That’s where asymptotic notation comes in handy. If we fix \theta \leq 1/2 and \delta = 0.01, then the big-O part of the theorem simplifies to \sqrt{(\log |H| \cdot \log m) / m}, which is easier to think about since (\log m)/m goes to zero very fast.

Now the theorem we just proved was about any weighted majority function. The question still remains: why is AdaBoost good? That follows from another theorem, which we’ll state and leave as an exercise (it essentially follows by unwrapping the definition of the AdaBoost algorithm from last time).

Theorem: Suppose that during AdaBoost the weak learners produce hypotheses with training errors \varepsilon_1, \dots , \varepsilon_T. Then for any \theta,

\displaystyle \Pr_{(x,y) \sim S} [yf(x) \leq \theta] \leq 2^T \prod_{t=1}^T \sqrt{\varepsilon_t^{(1-\theta)} (1-\varepsilon_t)^{(1+\theta)}}

Let’s interpret this for some concrete numbers. Say that \theta = 0 and \varepsilon_t is any fixed value less than 1/2. In this case the term inside product becomes \sqrt{\varepsilon (1-\varepsilon)} < 1/2 and the whole bound tends exponentially quickly to zero in the number of rounds T. On the other hand, if we raise \theta to about 1/3, then in order to maintain the LHS tending to zero we would need \varepsilon < \frac{1}{4} ( 3 - \sqrt{5} ) which is about 20% error.

If you’re interested in learning more about Boosting, there is an excellent book by Freund and Schapire (the inventors of boosting) called Boosting: Foundations and Algorithms. There they include a tighter analysis based on the idea of Rademacher complexity. The bound I presented in this post is nice because the proof doesn’t require any machinery past basic probability, but if you want to reach the cutting edge of knowledge about boosting you need to invest in the technical stuff.

Until next time!