# The Inequality

Math and computer science are full of inequalities, but there is one that shows up more often in my work than any other. Of course, I’m talking about

$\displaystyle 1+x \leq e^{x}$

This is The Inequality. I’ve been told on many occasions that the entire field of machine learning reduces to The Inequality combined with the Chernoff bound (which is proved using The Inequality).

Why does it show up so often in machine learning? Mostly because in analyzing an algorithm you want to bound the probability that some bad event happens. Bad events are usually phrased similarly to

$\displaystyle \prod_{i=1}^m (1-p_i)$

And applying The Inequality we can bound this from above by

$\displaystyle\prod_{i=1}^m (1-p_i) \leq \prod_{i=1}^m e^{-p_i} = e^{-\sum_{i=1}^m p_i}$

The point is that usually $m$ is the size of your dataset, which you get to choose, and by picking larger $m$ you make the probability of the bad event vanish exponentially quickly in $m$. (Here $p_i$ is unrelated to how I am about to use $p_i$ as weights).

Of course, The Inequality has much deeper implications than bounds for the efficiency and correctness of machine learning algorithms. To convince you of the depth of this simple statement, let’s see its use in an elegant proof of the arithmetic geometric inequality.

Theorem: (The arithmetic-mean geometric-mean inequality, general version): For all non-negative real numbers $a_1, \dots, a_n$ and all positive $p_1, \dots, p_n$ such that $p_1 + \dots + p_n = 1$, the following inequality holds:

$\displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq p_1 a_1 + \dots + p_n a_n$

Note that when all the $p_i = 1/n$ this is the standard AM-GM inequality.

Proof. This proof is due to George Polya (in Hungarian, Pólya György).

We start by modifying The Inequality $1+x \leq e^x$ by a shift of variables $x \mapsto x-1$, so that the inequality now reads $x \leq e^{x-1}$. We can apply this to each $a_i$ giving $a_i \leq e^{a_i - 1}$, and in fact,

$\displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq e^{\sum_{i=1}^n p_ia_i - p_i} = e^{\left ( \sum_{i=1}^n p_ia_i \right ) - 1}$

Now we have something quite curious: if we call $A$ the sum $p_1a_1 + \dots + p_na_n$, the above shows that $a_1^{p_1} \cdots a_n^{p_n} \leq e^{A-1}$. Moreover, again because $A \leq e^{A-1}$ that shows that the right hand side of the inequality we’re trying to prove is also bounded by $e^{A-1}$. So we know that both sides of our desired inequality (and in particular, the max) is bounded from above by $e^{A-1}$. This seems like a conundrum until we introduce the following beautiful idea: normalize by the thing you think should be the larger of the two sides of the inequality.

Define new variables $b_i = a_i / A$ and notice that $\sum_i p_i b_i = 1$ just by unraveling the definition. Call this sum $B = \sum_i p_i b_i$. Now we know that

$b_1^{p_1} \cdots b_n^{p_n} = \left ( \frac{a_1}{A} \right )^{p_1} \cdots \left ( \frac{a_n}{A} \right )^{p_n} \leq e^{B - 1} = e^0 = 1$

Now we unpack the pieces, and multiply through by $A^{p_1}A^{p_2} \cdots A^{p_n} = A$, the result is exactly the AM-GM inequality.

$\square$

Even deeper, there is only one case when The Inequality is tight, i.e. when $1+x = e^x$, and that is $x=0$. This allows us to use the proof above to come to a full characterization of the case of equality in the proof above. Indeed, the crucial step was that $(a_i / A) = e^{A-1}$, which is only true when $(a_i / A) = 1$, i.e. when $a_i = A$. Spending a few seconds thinking about this gives the characterization of equality if and only if $a_1 = a_2 = \dots = a_n = A$.

So this is excellent: the arithmetic-geometric inequality is a deep theorem with applications all over mathematics and statistics. Adding another layer of indirection for impressiveness, one can use the AM-GM inequality to prove the Cauchy-Schwarz inequality rather directly. Sadly, the Wikipedia page for the Cauchy-Schwarz inequality hardly does it justice as far as the massive number of applications. For example, many novel techniques in geometry and number theory are proved directly from C-S. More, in fact, than I can hope to learn.

Of course, no article about The Inequality could be complete without a proof of The Inequality.

Theorem: For all $x \in \mathbb{R}$, $1+x \leq e^x$.

Proof. The proof starts by proving a simpler theorem, named after Bernoulli, that $1+nx \leq (1+x)^n$ for every $x [-1, \infty)$ and every $n \in \mathbb{N}$. This is relatively straightforward by induction. The base case is trivial, and

$\displaystyle (1+x)^{n+1} = (1+x)(1+x)^n \geq (1+x)(1+nx) = 1 + (n+1)x + nx^2$

And because $nx^2 \geq 0$, we get Bernoulli’s inequality.

Now for any $z \geq 0$ we can set $x = z/n$, and get $(1+z) = (1+nx) \leq (1+\frac{z}{n})^n$ for every $n$. Note that Bernoulli’s inequality is preserved for larger and larger $n$ because $x \geq 0$. So taking limits of both sides as $n \to \infty$ we get the definition of $e^z$ on the right hand side of the inequality. We can prove a symmetrical inequality for $-x$ when $x < 0$, and this proves the theorem.

$\square$

What other insights can we glean about The Inequality? For one, it’s a truncated version of the Taylor series approximation

$\displaystyle e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots$

Indeed, the Taylor remainder theorem tells us that the first two terms approximate $e^x$ around zero with error depending on some constant times $e^x x^2 \geq 0$. In other words, $1+x$ is a lower bound on $e^x$ around zero. It is perhaps miraculous that this extends to a lower bound everywhere, until you realize that exponentials grow extremely quickly and lines do not.

One might wonder whether we can improve our approximation with higher order approximations. Indeed we can, but we have to be a bit careful. In particular, $1+x+x^2/2 \leq e^x$ is only true for nonnegative $x$ (because the remainder theorem now applies to $x^3$, but if we restrict to odd terms we win: $1+x+x^2/2 + x^3/6 \leq e^x$ is true for all $x$.

What is really surprising about The Inequality is that, at least in the applications I work with, we rarely see higher order approximations used. For most applications, The difference between an error term which is quadratic and one which is cubic or quartic is often not worth the extra work in analyzing the result. You get the same theorem: that something vanishes exponentially quickly.

# A Quasipolynomial Time Algorithm for Graph Isomorphism: The Details

Update 2015-11-21: Ken Regan and Dick Lipton posted an article with some more details, and a high level overview of how the techniques fit into the larger picture of CS theory.

Update 2015-11-16: Laci has posted the talk on his website. It’s an hour and a half long, and I encourage you to watch it if you have the time :)

Laszlo Babai has claimed an astounding theorem, that the Graph Isomorphism problem can be solved in quasipolynomial time. On Tuesday I was at Babai’s talk on this topic (he has yet to release a preprint), and I’ve compiled my notes here. As in Babai’s talk, familiarity with basic group theory and graph theory is assumed, and if you’re a casual (i.e., math-phobic) reader looking to understand what the fuss is all about, this is probably not the right post for you. This post is research level theoretical computer science. We’re here for the juicy, glorious details.

Note: this blog post will receive periodic updates as my understanding of the details improve.

Laci during his lecture. Photo taken by me.

Standing room only at Laci’s talk. My advisor in the bottom right, my coauthor mid-left with the thumbs. Various famous researchers spottable elsewhere.

## Background on Graph Isomorphism

I’ll start by giving a bit of background into why Graph Isomorphism (hereafter, GI) is such a famous problem, and why this result is important. If you’re familiar with graph isomorphism and the basics of complexity theory, skip to the next section where I get into the details.

GI is the following problem: given two graphs $G = (V_G, E_G), H = (V_H, E_H)$, determine whether the graphs are isomorphic, that is, whether there is a bijection $f : V_G \to V_H$ such that $u,v$ are connected in $G$ if and only if $f(u), f(v)$ are connected in $H$. Informally, GI asks whether it’s easy to tell from two drawings of a graph whether the drawings actually represent the same graph. If you’re wondering why this problem might be hard, check out the following picture of the same graph drawn in three different ways.

Indeed, a priori the worst-case scenario is that one would have to try all $n!$ ways to rearrange the nodes of the first graph and see if one rearrangement achieves the second graph. The best case scenario is that one can solve this problem efficiently, that is, with an algorithm whose worst-case runtime on graphs with $n$ nodes and $m$ edges is polynomial in $n$ and $m$ (this would show that GI is in the class P). However, nobody knows whether there is a polynomial time algorithm for GI, and it’s been a big open question in CS theory for over forty years. This is the direction that Babai is making progress toward, showing that there are efficient algorithms. He didn’t get a polynomial time algorithm, but he got something quite close, which is why everyone is atwitter.

It turns out that telling whether two graphs are isomorphic has practical value in some applications. I hear rumor that chemists use it to search through databases of chemicals for one with certain properties (one way to think of a chemical compound is as a graph). I also hear that some people use graph isomorphism to compare files, do optical character recognition, and analyze social networks, but it seems highly probable to me that GI is not the central workhorse (or even a main workhorse) in these fields. Nevertheless, the common understanding is that pretty much anybody who needs to solve GI on a practical level can do so efficiently. The heuristics work well. Even in Babai’s own words, getting better worst-case algorithms for GI is purely a theoretical enterprise.

So if GI isn’t vastly important for real life problems, why are TCS researchers so excited about it?

Well it’s known that GI is in the class NP, meaning if two graphs are isomorphic you can give me a short proof that I can verify in polynomial time (the proof is just a description of the function $f : V_G \to V_H$). And if you’ll recall that inside NP there is this class called NP-complete, which are the “hardest” problems in NP. Now most problems in NP that we care about are also NP-complete, but it turns out GI is not known to be NP-complete either. Now, for all we know P = NP and then the question about GI is moot, but in the scenario that most people believe P and NP are different, so it leaves open the question of where GI lies: does it have efficient algorithms or not?

So we have a problem which is in NP, it’s not known to be in P, and it’s not known to be NP-complete. One obvious objection is that it might be neither. In fact, there’s a famous theorem of Ladner that says if P is different from NP, then there must be problems in NP, not in P, and not NP-complete. Such problems are called “NP-intermediate.” It’s perfectly reasonable that GI is one of these problems. But there’s a bit of a catch.

See, Ladner’s theorem doesn’t provide a natural problem which is NP intermediate; what Ladner did in his theorem was assume P is not NP, and then use that assumption to invent a new problem that he could prove is NP intermediate. If you come up with a problem whose only purpose is to prove a theorem, then the problem is deemed unnatural. In fact, there is no known “natural” NP-intermediate problem (assuming P is not NP). The pattern in CS theory is actually that if we find a problem that might be NP-intermediate, someone later finds an efficient algorithm for it or proves it’s NP-complete. There is a small and dwindling list of such problems. I say dwindling because not so long ago the problem of telling whether an integer is prime was in this list. The symptoms are that one first solves the problem on many large classes of special cases (this is true of GI) or one gets a nice quasipolynomial-time algorithm (Babai’s claimed new result), and then finally it falls into P. In fact, there is even stronger evidence against it being NP-complete: if GI were NP-complete, the polynomial hierarchy would collapse. To the layperson, the polynomial hierarchy is abstruse complexity theoretic technical hoo-hah, but suffice it to say that most experts believe the hierarchy does not collapse, so this counts as evidence.

So indeed, it could be that GI will become the first ever problem which is NP-intermediate (assuming P is not NP), but from historical patterns it seems more likely that it will fall into P. So people are excited because it’s tantalizing: everyone believes it should be in P, but nobody can prove it. It’s right at the edge of the current state of knowledge about the theoretical capabilities and limits of computation.

This is the point at which I will start assuming some level of mathematical maturity.

## The Main Result

Theorem: There is a deterministic algorithm for GI which runs in time $2^{O(\log^c(n))}$ for some constant $c$.

This is an improvement over the best previously known algorithm which had runtime $2^{\sqrt{n \log n}}$. Note the $\sqrt{n}$ in the exponent has been eliminated, which is a huge difference. Quantities which are exponential in some power of a logarithm are called “quasipolynomial.”

But the main result is actually a quasipolynomial time algorithm for a different, more general problem called string automorphism. In this context, given a set $X$string is a function from $X$ to some finite alphabet (really it is a coloring of $X$, but we are going to use colorings in the usual sense later so we have to use a new name here). If the set $X$ is given a linear ordering then strings on $X$ really correspond to strings of length $|X|$ over the alphabet. We will call strings $x,y \in X$.

Now given a set $X$ and a group $G$ acting on $X$, there is a natural action of $G$ on strings over $X$, denoted $x^\sigma$, by permuting the indices $x^{\sigma}(i) = x(\sigma(i))$. So you can ask the natural question: given two strings $x,y$ and a representation of a group $G$ acting on $X$ by a set of generating permutations of $G$, is there a $\sigma \in G$ with $x^\sigma = y$? This problem is called the string isomorphism problem, and it’s clearly in NP.

Now if you call $\textup{ISO}_G(x,y)$ the set of all permutations in $G$ that map $x$ to $y$, and you call $\textup{Aut}_G(x) = \textup{ISO}_G(x,x)$, then the actual theorem Babai claims to have proved is the following.

Theorem: Given a generating set for a group $G$ of permutations of a set $X$ and a string $x$, there is a quasipolynomial time algorithm which produces a generating set of the subgroup $\textup{Aut}_G(x)$ of $G$, i.e. the string automorphisms of $x$ that lie in in $G$.

It is not completely obvious that GI reduces to the automorphism problem, but I will prove it does in the next section. Furthermore, the overview of Babai’s proof of the theorem follows an outline laid out by Eugene Luks in 1982, which involves a divide-and-conquer method for splitting the analysis of $\textup{Aut}_G(x)$ into simpler and simpler subgroups as they are found.

## Luks’s program

Eugene Luks was the first person to incorporate “serious group theory” (Babai’s words) into the study of graph isomorphism. Why would group theory help in a question about graphs? Let me explain with a lemma.

Lemma: GI is polynomial-time reducible to the problem of computing, given a graph $X$, a list of generators for the automorphism group of $G$, denoted $\textup{Aut}(X)$.

Proof. Without loss of generality suppose $X_1, X_2$ are connected graphs. If we want to decide whether $X_1, X_2$ are isomorphic, we may form the disjoint union $X = X_1 \cup X_2$. It is easy to see that $X_1$ and $X_2$ are isomorphic if and only if some $\sigma \in \textup{Aut}(X)$ swaps $X_1$ and $X_2$. Indeed, if any automorphism with this property exists, every generating set of $\textup{Aut}(G)$ must contain one.

$\square$

Similarly, the string isomorphism problem reduces to the problem of computing a generating set for $\textup{Aut}_G(x)$ using a similar reduction to the one above. As a side note, while $\textup{ISO}_G(x,y)$ can be exponentially large as a set, it is either the empty set, or a coset of $\textup{Aut}_G(x)$ by any element of $\textup{ISO}_G(x,y)$. So there are group-theoretic connections between the automorphism group of a string and the isomorphisms between two strings.

But more importantly, computing the automorphism group of a graph reduces to computing the automorphism subgroup of a particular group for a given string in the following way. Given a graph $X$ on a vertex set $V = \{ 1, 2, \dots, v \}$ write $X$ as a binary string on the set of unordered pairs $Z = \binom{V}{2}$ by mapping $(i,j) \to 1$ if and only if $i$ and $j$ are connected by an edge. The alphabet size is 2. Then $\textup{Aut}(X)$ (automorphisms of the graph) induces an action on strings as a subgroup $G_X$ of $\textup{Aut}(Z)$ (automorphisms of strings). These induced automorphisms are exactly those which preserve proper encodings of a graph. Moreover, any string automorphism in $G_X$ is an automorphism of $X$ and vice versa. Note that since $Z$ is larger than $V$ by a factor of $v^2$, the subgroup $G_X$ is much smaller than all of $\textup{Aut}(Z)$.

Moreover, $\textup{Aut}(X)$ sits inside the full symmetry group $\textup{Sym}(V)$ of $V$, the vertex set of the starting graph, and $\textup{Sym}(V)$ also induces an action $G_V$ on $Z$. The inclusion is

$\displaystyle \textup{Aut}(X) \subset \textup{Sym}(V)$

induces

$\displaystyle G_X = \textup{Aut}(\textup{Enc}(X)) \subset G_V \subset \textup{Aut}(Z)$

I.e.,

Call $G = G_V$ the induced subgroup of permutations of strings-as-graphs. Now we just have some subgroup of permutations $G$ of $\textup{Aut}(Z)$, and we want to find a generating set for $\textup{Aut}_G(x)$ (where $x$ happens to be the encoding of a graph). That is exactly the string automorphism problem. Reduction complete.

Now the basic idea to compute $\textup{Aut}_G(x)$ is to start from the assumption that $\textup{Aut}_G(x) = G$. We know it’s a subgroup, so it could be all of $G$; in terms of GI if this assumption were true it would mean the starting graph was the complete graph, but for string automorphism in general $G$ can be whatever. Then we try to refute this belief by finding additional structure in $\textup{Aut}_G(x)$, either by breaking it up into smaller pieces (say, orbits) or by constructing automorphisms in it. That additional structure allows us to break up $G$ in a way that $\textup{Aut}_G(x)$ is a subgroup of the product of the corresponding factors of $G$.

The analogy that Babai used, which goes back to graphs, is the following. If you have a graph $X$ and you want to know its automorphisms, one thing you can do is to partition the vertices by degree. You know that an automorphism has to preserve the degree of an individual vertex, so in particular you can break up the assumption that $\textup{Aut}(X) = \textup{Sym}(V)$ into the fact that $\textup{Aut}(X)$ must be a subgroup of the product of the symmetry groups of the pieces of the partition; then you recurse. In this way you’ve hugely reduced the total number of automorphisms you need to consider. When the degrees get small enough you can brute-force search for automorphisms (and there is some brute-force searching in putting the pieces back together). But of course this approach fails miserably in the first step you start with a regular graph, so one would need to look for other kinds of structure.

One example is an equitable partition, which is a partition of vertices by degree relative to vertices in other blocks of the partition. So a vertex which has degree 3 but two degree 2 neighbors would be in a different block than a vertex with degree 3 and only 1 neighbor of degree 2. Finding these equitable partitions (which can be done in polynomial time) is one of the central tools used to attack GI. As an example of why it can be very helpful: in many regimes a Erdos-Renyi random graph has asymptotically almost surely a coarsest equitable partition which consists entirely of singletons. This is despite the fact that the degree sequences themselves are tightly constrained with high probability. This means that, if you’re given two Erdos-Renyi random graphs and you want to know whether they’re isomorphic, you can just separately compute the coarsest equitable partition for each one and see if the singleton blocks match up. That is your isomorphism.

Even still, there are many worst case graphs that resist being broken up by an equitable partition. A hard example is known as the Johnson graph, which we’ll return to later.

For strings the sorts of structures to look for are even more refined than equitable partitions of graphs, because the automorphism group of a graph can be partitioned into orbits which preserve the block structure of an equitable partition. But it still turns out that Johnson graphs admit parts of the automorphism group that can’t be broken up much by orbits.

The point is that when some useful substructure is found, it will “make progress” toward the result by breaking the problem into many pieces (say, $n^{\log n}$ pieces) where each piece has size $9/10$ the size of the original. So you get a recursion in the amount of time needed which looks like $f(n) \leq n^{\log n} f(9n/10)$. If you call $q(n) = n^{\log n}$ the quasipolynomial factor, then solving the recurrence gives $f(n) \leq q(n)^{O(\log n)}$ which only adds an extra log factor in the exponent. So you keep making progress until the problem size is polylogarithmic, and then you brute force it and put the pieces back together in quasipolynomial time.

## Two main lemmas, which are theorems in their own right

This is where the details start to get difficult, in part because Babai jumped back and forth between thinking of the object as a graph and as a string. The point of this in the lecture was to illustrate both where the existing techniques for solving GI (which were in terms of finding canonical graph substructures in graphs) break down.

The central graph-theoretic picture is that of “individualizing” a vertex by breaking it off from an existing equitable partition, which then breaks the equitable partition structure so you need to do some more (polytime) work to further refine it into an equitable partition again. But the point is that you can take all the vertices in a block, pick all possible ways to individualize them by breaking them into smaller blocks. If you traverse these possibilities in a canonical order, you will eventually get down to a partition of singletons, which is your “canonical labeling” of the graph. And if you do this process with two different graphs and you get to different canonical labelings, you had to have started with non-isomorphic graphs.

The problem is that when you get to a coarsest equitable partition, you may end up with blocks of size $\sqrt{n}$, meaning you have an exponential number of individualizations to check. This is the case with Johnson graphs, and in fact if you have a Johnson graph $J(m,t)$ which has $\binom{m}{t}$ vertices and you individualize fewer than $m/10t$ if them, then you will only get down to blocks of size polynomially smaller than $\binom{m}{t}$, which is too big if you want to brute force check all individualizations of a block.

The main combinatorial lemma that Babai proves to avoid this problem is that the Johnson graphs are the only obstacle to finding efficient partitions.

Theorem (Babai 15): If $X$ is a regular graph on $m$ vertices, then by individualizing a polylog number of vertices we can find one of the three following things:

1. A canonical coloring with each color class having at most 90% of all the nodes.
2. A canonical equipartition of some subset of the vertices that has at least 90% of the nodes (i.e. a big color class from (1)).
3. A canonically embedded Johnson graph on at least 90% of the nodes.

[Edit: I think that what Babai means by a “canonical coloring” is an equitable partition of the nodes (not to be confused with an equipartition), but I am not entirely sure. I have changed the language to reflect more clearly what was in the talk as opposed to what I think I understood from doing additional reading.]

The first two are apparently the “easy” cases in the sense that they allow for simple recursion that has already been known before Babai’s work. The hard part is establishing the last case (and this is the theorem whose proof sketch he has deferred for two more weeks). But once you have such a Johnson graph your life is much better, because (for a reason I did not understand) you can recurse on a problem of size roughly the square root of the starting size.

In discussing Johnson graphs, Babai said they were a source of “unspeakable misery” for people who want to solve GI quickly. At the same time, it is a “curse and a blessing,” as once you’ve found a Johnson graph embedded in your problem you can recurse to much smaller instances. This routine to find one of these three things is called the “split-or-Johnson” routine.

The analogue for strings (I believe this is true, but I’m a bit fuzzy on this part) is to find a “canonical” $k$-ary relational structure (where $k$ is polylog in size) with some additional condition on the size of alternating subgroups of the automorphism group of the $k$-ary relational structure. Then you can “individualize” the points in the base of this relational structure and find analogous partitions and embedded Johnson schemes (a special kind of combinatorial design).

One important fact to note is that the split-or-Johnson routine breaks down at $\log^3(n)$ size, and Babai has counterexamples that say his result is tight, so getting GI in P would have to bypass this barrier with a different idea.

The second crucial lemma has to do with giant homomorphisms, and this is the method by which Babai constructs automorphisms that bound $\textup{Aut}_G(x)$ from below. As opposed to the split-or-Johnson lemma, which finds structure to bound the group from above by breaking it into simpler pieces. Fair warning: one thing I don’t yet understand is how these two routines interact in the final algorithm. My guess is they are essentially run in parallel (or alternate), but that guess is as good as wild.

Definition: A homomorphism $\varphi: G \to S_m$ is called giant if the image of $G$ is either the alternating group $A_n$ or else all of $S_m$. I.e. $\varphi$ is surjective, or almost so. Let $\textup{Stab}_G(x)$ denote the stabilizer subgroup of $x \in G$. Then $x$ is called “affected” by $\varphi$ if $\varphi|_{\textup{Stab}_g(x)}$ is not giant.

The central tool in Babai’s algorithm is the dichotomy between points that are affected and those that are not. The ability to decide this property in quasipolynomial time is what makes the divide and conquer work.

There is one deep theorem he uses that relates affected points to giant homomorphisms:

Theorem (Unaffected Stabilizer Theorem): Let $\varphi: G \to S_m$ be a giant homomorphism and $U \subset G$ the set of unaffected elements. Let $G_{(U)}$ be the pointwise stabilizer of $U$, and suppose that $m > \textup{max}(8, 2 + \log_2 n)$. Then the restriction $\varphi : G_{(U)} \to S_m$ is still giant.

Babai claimed this was a nontrivial theorem, not because the proof is particularly difficult, but because it depends on the classification of finite simple groups. He claimed it was a relatively straightforward corollary, but it appears that this does not factor into the actual GI algorithm constructively, but only as an assumption that a certain loop invariant will hold.

To recall, we started with this assumption that $\textup{Aut}_G(x)$ was the entire symmetry group we started with, which is in particular an assumption that the inclusion $\varphi \to S_m$ is giant. Now you want to refute this hypothesis, but you can’t look at all of $S_m$ because even the underlying set $m$ has too many subsets to check. But what you can do is pick a test set $A \subset [m]$ where $|A|$ is polylogarithmic in size, and test whether the restriction of $\varphi$ to the test set is giant in $\textup{Sym}(A)$. If it is giant, we call $A$ full.

Theorem (Babai 15): One can test the property of a polylogarithmic size test set being full in quasipolynomial time in m.

Babai claimed it should be surprising that fullness is a quasipolynomial time testable property, but more surprising is that regardless of whether it’s full or not, we can construct an explicit certificate of fullness or non-fullness. In the latter case, one can come up with an explicit subgroup which contains the image of the projection of the automorphism group onto the symmetry group of the test set. In addition, given two test sets $A,B$, one can efficiently compare the action between the two different test sets. And finding these non-full test sets is what allows one to construct the $k$-ary relations. So the output of this lower bound technique informs the upper bound technique of how to proceed.

The other outcome is that $A$ could be full, and coming up with a certificate of fullness is harder. The algorithm sketched below claims to do it, and it involves finding enough “independent” automorphisms to certify that the projection is giant.

Now once you try all possible test sets, which gives $\binom{m}{k}^2$ many certificates (a quasipolynomial number), one has to aggregate them into a full automorphism of $x$, which Babai assured us was a group theoretic exercise.

The algorithm to test fullness (and construct a certificate) he called the Local Certificates Algorithm. It was sketched as follows: you are given as input a set $A$ and a group $G_A \subset G$ being the setwise stabilizer of $A$ under $\psi_A : G_A \to \textup{Sym}(A)$. Now let $W$ be the group elements affected by $\psi_A$. You can be sure that at least one point is affected. Now you stabilize on $W$ and get a refined subgroup of $G_A$, which you can use to compute newly affected elements, growing $W$ in each step. By the unaffected stabilizer theorem, this preserves gianthood. Furthermore, in each step you get layers of $W$, and all of the stabilizers respect the structure of the previous layers. Babai described this as adding on “layers of a beard.”

The termination of this is either when $W$ stops growing, in which case the projection is giant and $W$ is our certificate of fullness (i.e. we get a rich family of automorphisms that are actually in our target automorphism group), or else we discover the projected ceases to be giant and $W$ is our certificate of non-fullness. Indeed, the subgroup generated by these layers is a subgroup of $\textup{Aut}_G(x)$, and the subgroup generated by the elements of a non-fullness certificate contain the automorphism group.

## Not enough details?

This was supposed to be just a high-level sketch of the algorithm, and Babai is giving two more talks elaborating on the details. Unfortunately, I won’t be able to make it to his second talk in which he’ll discuss some of the core group theoretic ideas that go into the algorithm. I will, however, make it to his third talk in which he will sketch the proof of the split-or-Johnson routine. That is in two weeks from the time of this writing, and I will update this post with any additional insights then.

Babai has not yet released a preprint, and when I asked him he said “soon, soon.” Until then :)

This blog post is based on my personal notes from Laszlo Babai’s lecture at the University of Chicago on November 10, 2015. At the time of this writing, Babai’s work has not been peer reviewed, and my understanding of his lectures has large gaps and may be faulty. Do not put your life in danger based on information in this post.

# Serial Dictatorships and House Allocation

I was recently an invited speaker in a series of STEM talks at Moraine Valley Community College. My talk was called “What can algorithms tell us about life, love, and happiness?” and it’s on Youtube now so you can go watch it. The central theme of the talk was the lens of computation, that algorithms and theoretical computer science can provide new and novel explanations for the natural phenomena we observe in the world.

One of the main stories I told in the talk is about stable marriages and the deferred acceptance algorithm, which we covered previously on this blog. However, one of the examples of the applications I gave was to kidney exchanges and school allocation. I said in the talk that it’s a variant of the stable marriages, but it’s not clear exactly how the two are related. This post will fill that gap and showcase some of the unity in the field of mechanism design.

Mechanism design, which is sometimes called market design, has a grand vision. There is a population of players with individual incentives, and given some central goal the designer wants to come up with a game where the self-interest of the players will lead them to efficiently achieve the designer’s goals. That’s what we’re going to do today with a class of problems called allocation problems.

As usual, all of the code we used in this post is available in a repository on this blog’s Github page.

## Allocating houses with dictators

In stable marriages we had $n$ men and $n$ women and we wanted to pair them off one to one in a way that there were no mutual incentives to cheat. Let’s modify this scenario so that only one side has preferences and the other does not. The analogy here is that we have $n$ people and $n$ houses, but what do we want to guarantee? It doesn’t make sense to say that people will cheat on each other, but it does make sense to ask that there’s no way for people to swap houses and have everyone be at least as happy as before. Let’s formalize this.

Let $A$ be a set of people (agents) and $H$ be a set of houses, and $n = |A| = |H|$. A matching is a one-to-one map from $A \to H$. Each agent is assumed to have a strict preference over houses, and if we’re given two houses $h_1, h_2$ and $a \in A$ prefers $h_1$ over $h_2$, we express that by saying $h_1 >_a h_2$. If we want to include the possibility that $h_1 = h_2$, we would say $h_1 \geq_a h_2$. I.e., either they’re the same house, or $a$ strictly prefers $h_1$ more.

Definition: A matching $M: A \to H$ is called pareto-optimal if there is no other matching $M$ with both of the following properties:

• Every agent is at least as happy in $N$ as in $M$, i.e. for every $a \in A$, $N(a) \geq_a M(a)$.
• Some agent is strictly happier in $N$, i.e. there exists an $a \in A$ with $N(a) >_a M(a)$.

We say a matching $N$ “pareto-dominates” another matching $M$ if these two properties hold. As a side note, if you like abstract algebra you might notice that you can take matchings and form them into a lattice where the comparison is pareto-domination. If you go deep into the theory of lattices, you can use some nice fixed-point theorems to (non-constructively) prove the existences of optimal allocations in this context and for stable marriages. See this paper if you’re interested. Of course, we will give efficient algorithms to achieve our goals, which is how I prefer to live life.

The mechanism we’ll use to find such an optimal matching is extremely simple, and it’s called the serial dictatorship.

First you pick an arbitrary ordering of the agents and all houses are marked “available.” Then the first agent in the ordering picks their top choice, and you remove their choice from the available houses. Continue in this way down the list until you get to the end, and the outcome is guaranteed to be pareto-optimal.

Theorem: Serial dictatorship always produces a pareto-optimal matching.

Proof. Let $M$ be the output of the algorithm. Suppose the theorem is false, that there is some $N$ that pareto-dominates $M$. Let $a$ be the first agent in the chosen ordering who gets a strictly better house in $N$ than in $M$. Whatever house $a$ gets, call it $N(a)$, it has to be a house that was unavailable at the time in the algorithm when $a$ got to pick (otherwise $a$ would have picked $N(a)$ during the algorithm!). This means that $a$ took the house chosen by some agent $b \in A$ whose turn to pick comes before $a$. But by assumption, $a$ was the first agent to get a strictly better house, so $b$ has to end up with a worse house. This contradicts that every agent is at least as happy in $N$ than in $M$, so $N$ cannot pareto-dominate $M$.

$\square$

It’s easy enough to implement this in Python. Each agent will be represented by its list of preferences, each object will be an integer, and the matching will be a dictionary. The only thing we need to do is pick a way to order the agents, and we’ll just pick a random ordering. As usual, all of the code used in this post is available on this blog’s github page.

# serialDictatorship: [[int]], [int] -> {int: int}
# construct a pareto-optimal allocation of objects to agents.
def serialDictatorship(agents, objects, seed=None):
if seed is not None:
random.seed(seed)

agentPreferences = agents[:]
random.shuffle(agentPreferences)
allocation = dict()
availableHouses = set(objects)

for agentIndex, preference in enumerate(agentPreferences):
allocation[agentIndex] = max(availableHouses, key=preference.index)
availableHouses.remove(allocation[agentIndex])

return allocation


And a test

agents = [['d','a','c','b'], # 4th in my chosen seed
['a','d','c','b'], # 3rd
['a','d','b','c'], # 2nd
['d','a','c','b']] # 1st
objects = ['a','b','c','d']
allocation = serialDictatorship(agents, objects, seed=1)
test({0: 'b', 1: 'c', 2: 'd', 3: 'a'}, allocation)


This algorithm is so simple it’s almost hard to believe. But it get’s better, because under some reasonable conditions, it’s the only algorithm that solves this problem.

Theorem [Svensson 98]: Serial dictatorship is the only algorithm that produces a pareto-optimal matching and also has the following three properties:

• Strategy-proof: no agent can improve their outcomes by lying about their preferences at the beginning.
• Neutral: the outcome of the algorithm is unchanged if you permute the items (i.e., does not depend on the index of the item in some list)
• Non-bossy: No agent can change the outcome of the algorithm without also changing the object they receive.

And if we drop any one of these conditions there are other mechanisms that satisfy the rest. This theorem was proved in this paper by Lars-Gunnar Svensson in 1998, and it’s not particularly long or complicated. The proof of the main theorem is about a page. It would be a great exercise in reading mathematics to go through the proof and summarize the main idea (you could even leave a comment with your answer!).

## Allocation with existing ownership

Now we switch to a slightly different problem. There are still $n$ houses and $n$ agents, but now every agent already “owns” a house. The question becomes: can they improve their situation by trading houses? It shouldn’t be immediately obvious whether this is possible, because a trade can happen in a “cycle” like the following:

Here A prefers the house of B, and B prefers the house of C, and C prefers the house of A, so they’d all benefit from doing a three-way cyclic trade. You can easily imagine the generalization to larger cycles.

This model was studied by Shapley and Scarf in 1974 (the same Shapley who did the deferred acceptance algorithm for stable marriages). Just as you’d expect, our goal is to find an optimal (re)-allocation of houses to agents in which there is no cycle the stands to improve. That is, there is no subset of agents that can jointly improve their standing. In formalizing this we call an “optimal” matching a core matching. Again $A$ is a set of agents, and $H$ is a set of houses.

Definition: A matching $M: A \to H$ is called a core matching if there is no subset $B \subset A$ and no matching $N: A \to H$ with the following properties:

• For every $b \in B$, $N(b)$ is owned by some other agent in $B$ (trading only happens within $B$).
• Every agent $b$ in $B$ is at least as happy as before, i.e. $N(b) \geq_b M(b)$ for all $b$.
• Some agent in $B$ strictly improves, i.e. for some $b, N(b) >_b M(b)$.

We also call an algorithm individually rational if it ensures that every agent gets a house that is at least as good as their starting house. It should be clear that an algorithm which produces a core matching is individually rational, because for any agent $a$ we can set $B = \{a\}$, i.e. force $a$ to consider not trading at all, and being a core matching says that’s not better for $a$. Likewise, core matchings are also pareto-optimal by setting $B = A$.

It might seem like the idea of a “core” solution to an allocation problem is more general, and you’re right. You can define it for a very general setting of cooperative games and prove the existence of core matchings in that setting. See Wikipedia for more. As is our prerogative, we’ll achieve the same thing by constructing core matchings with an algorithm.

Indeed, the following theorem is due to Shapley & Scarf.

Theorem [Shapley-Scarf 74]: There is a core matching for every choice of preferences. Moreover, one can be found by an efficient algorithm.

Proof. The mechanism we’ll define is called the top trading cycles algorithm. We operate in rounds, and the first round goes as follows.

Form a directed graph with nodes in $A \cup H$. That is there is one node for each agent and one node for each house. Then we start by having each agent “point” to its most preferred house, and each house “points” to its original owner. That is, we add in directed edges from agents to their top pick, and houses to their owners. For example, say there are five agents $A = \{ a, b, c, d, e, f \}$ and houses $H = \{ 1,2,3,4,5 \}$ with $a$ owning $1$, and $b$ owning $2$, etc. but their favorite picks goes backwards, so that $a$ prefers house $5$ most, and $b$ prefers $4$ most, $c$ prefers $3$ (which $c$ also owns), etc. Then the “pointing picture” in the first round looks like this.

The claim about such a graph is that there is always some directed cycle. In the example above, there are three. And moreover, we claim that no two cycles can share an edge. It’s easy to see there has to be a cycle: you can start at any agent and just follow the single outgoing edge until you find yourself repeating some vertices. By the fact that there is only one edge going out of any vertex, it follows that no two cycles could share an edge (or else in the last edge they share, there’d have to be a fork, i.e. two outgoing edges).

In the example above, you can start from A and follow the only edge and you get the cycle A -> 5 -> E -> 1 -> A. Similarly, starting at 4 would give you 4 -> D -> 2 -> B -> 4.

The point is that when you remove a cycle, you can have the agents in that cycle do the trade indicated by the cycle and remove the entire cycle from the graph. The consequence of this is that you have some agents who were pointing to houses that are removed, and so these agents revise their outgoing edge to point at their next most preferred available house. You can then continue removing cycles in this way until all the agents have been assigned a house.

The proof that this is a core matching is analogous to the proof that serial dictatorships were pareto-optimal. If there were some subset $B$ and some other matching $N$ under which $B$ does better, then one of these agents has to be the first to be removed in a cycle during the algorithm’s run. But that agent got the best possible pick of a house, so by involving with $B$ that agent necessarily gets a worse outcome.

$\square$

This algorithm is commonly called the Top Trading Cycles algorithm, because it splits the set of agents and houses into a disjoint union of cycles, each of which is the best trade possible for every agent involved.

Implementing the Top Trading Cycles algorithm in code requires us to be able to find cycles in graphs, but that isn’t so hard. I implemented a simple data structure for a graph with helper functions that are specific to our kind of graph (i.e., every vertex has outdegree 1, so the algorithm to find cycles is simpler than something like Tarjan’s algorithm). You can see the data structure on this post’s github repository in the file graph.py. An example of using it:

>>> G = Graph([1,'a',2,'b',3,'c',4,'d',5,'e',6,'f'])
>>> G.addEdges([(1,'a'), ('a',2), (2,'b'), ('b',3), (3,'c'), ('c',1),
(4,'d'), ('d',5), (5,'e'), ('e',4), (6,'f'), ('f',6)])
>>> G['d']
Vertex('d')
>>> G['d'].outgoingEdges
{('d', 5)}
>>> G['d'].anyNext() # return the target of any outgoing edge from 'd'
Vertex(5)
>>> G.delete('e')
>>> G[4].incomingEdges
set()


Next we implement a function to find a cycle, and a function to extract the agents from a cycle. For latter we can assume the cycle is just represented by any agent on the cycle (again, because our graphs always have outdegree exactly 1).

# anyCycle: graph -> vertex
# find any vertex involved in a cycle
def anyCycle(G):
visited = set()
v = G.anyVertex()

while v not in visited:
v = v.anyNext()

return v

# getAgents: graph, vertex -> set(vertex)
# get the set of agents on a cycle starting at the given vertex
def getAgents(G, cycle, agents):
# make sure starting vertex is a house
if cycle.vertexId in agents:
cycle = cycle.anyNext()

startingHouse = cycle
currentVertex = startingHouse.anyNext()
theAgents = set()

while currentVertex not in theAgents:
currentVertex = currentVertex.anyNext()
currentVertex = currentVertex.anyNext()

return theAgents


Finally, implementing the algorithm is just bookkeeping. After setting up the initial graph, the core of the routine is

def topTradingCycles(agents, houses, agentPreferences, initialOwnership):
# form the initial graph

...

allocation = dict()
while len(G.vertices) &> 0:
cycle = anyCycle(G)
cycleAgents = getAgents(G, cycle, agents)

# assign agents in the cycle their choice of house
for a in cycleAgents:
h = a.anyNext().vertexId
allocation[a.vertexId] = h
G.delete(a)
G.delete(h)

for a in agents:
if a in G.vertices and G[a].outdegree() == 0:
# update preferences
...

return allocation


This mutates the graph in each round by deleting any cycle that was found, and adding new edges when the top choice of some agent is removed. Finally, to fill in the ellipses we just need to say how we represent the preferences. The input agentPreferences is a dictionary mapping agents to a list of all houses in order of preference. So again we can just represent the “top available pick” by an index and update that index when agents lose their top pick.

# maps agent to an index of the list agentPreferences[agent]
currentPreferenceIndex = dict((a,0) for a in agents)
preferredHouse = lambda a: agentPreferences[a][currentPreferenceIndex[a]]


Then to update we just have to replace the currentPreferenceIndex for each disappointed agent by its next best option.

      for a in agents:
if a in G.vertices and G[a].outdegree() == 0:
while preferredHouse(a) not in G.vertices:
currentPreferenceIndex[a] += 1


And that’s it! We included a small suite of test cases which you can run if you want to play around with it more.

One final nice thing about this algorithm is that it almost generalizes the serial dictatorship algorithm. What you do is rather than have each house point to its original owner, you just have all houses point to the first agent in the pre-specified ordering. Then a cycle will always have length 2, the first agent gets their preferred house, and in the next round the houses now point to the second agent in the ordering, and so on.

## Kidney exchange

We still need one more ingredient to see the bridge from allocation problems to kidney exchanges. The setting is like this: say Manuel needs a kidney transplant, and he’s lucky enough that his sister-in-law Anastasia wants to donate her kidney to Manuel. However, it turns out that Anastasia doesn’t the same right blood/antibody type for a donation, and so even though she has a kidney to give, they can’t give it to Manuel. Now one might say “just sell your kidney and use the money to buy a kidney with the right type!” Turns out that’s illegal; at some point we as a society decided that it’s immoral to sell organs. But it is legal to exchange a kidney for a kidney. So if Manuel and Anastasia can find a pair of people both of whom happen to have the right blood types, they can arrange for a swap.

But finding two people both of whom have the right blood types is unlikely, and we can actually do far better! We can turn this into a housing allocation problem as follows. Anyone with a kidney to donate is a “house,” and anyone who needs a kidney is an “agent.” And to start off with, we say that each agent “owns” the kidney of their willing donor. And the preferences of each agent are determined by which kidney donors have the right blood type (with ties split, say, by geographical distance). Then when you do the top trading cycles algorithm you find these chains where Anastasia, instead of donating to Manuel, donates to another person who has the right blood type. On the other end of the cycle, Manuel receives a kidney from someone with the right blood type.

The big twist is that not everyone who needs a kidney knows someone willing to donate. So there are agents who are “new” to the market and don’t already own a house. Moreover, maybe you have someone who is willing to donate a kidney but isn’t asking for anything in return.

Because of this the algorithm changes slightly. You can no longer guarantee the existence of a cycle (though you can still guarantee that no two cycles will share an edge). But as new people are added to the graph, cycles will eventually form and you can make the trades. There are a few extra details if you want to ensure that everyone is being honest (if you’re thinking about it like a market in the economic sense, where people could be lying about their preferences).

The resulting mechanism is called You Request My House I Get Your Turn (YRMHIGYT). In short, the idea is that you pick an order on the agents, say for kidney exchanges it’s the order in which the patients are diagnosed. And you have them add edges to the graph in that order. At each step you look for a cycle, and when one appears you remove it as usual. The twist, and the source of the name, is that when someone who has no house requests a house which is already owned, the agent who owns the house gets to jump forward in the queue. This turns out to make everything “fair” (in that everyone is guaranteed to get a house at least as good as the one they own) and one can prove analogous optimality theorems to the ones we did for serial dictatorship.

This mechanism was implemented by Alvin Roth in the US hospital system, and by some measure it has saved many lives. If you want to hear more about the process and how successful the kidney exchange program is, you can listen to this Freakonomics podcast episode where they interviewed Al Roth and some of the patients who benefited from this new allocation market.

It would be an excellent exercise to go deeper into the guts of the kidney exchange program (see this paper by Alvin Roth et al.), and implement the matching system in code. At the very least, implementing the YRMHIGYT mechanism is only a minor modification of our existing Top Trading Cycles code.

Until next time!

# One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

## Statistical parity

Here is one way to define fairness.

Your population is a set $X$ and there is some known subset $S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $X$ is people and $S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $D$ over $X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $D$ takes that into account. Generally we impose no restrictions on $D$, and the definition of fairness will have to work no matter what $D$ is.

Now suppose we have a (possibly randomized) classifier $h:X \to \{-1,1\}$ giving labels to $X$. When given a person $x$ as input $h(x)=1$ if $x$ gets a loan and $-1$ otherwise. The bias, or statistical imparity, of $h$ on $S$ with respect to $X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $S$ is labeled 1 and the probability that a random individual from the complement $S^C$ is labeled 1.

$\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $h:X \to \{-1,1\}$ is said to have statistical parity on $D$ with respect to $S$ up to bias $\varepsilon$ if $|\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):
protectedClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] == protectedValue]
elseClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] != protectedValue]

if len(protectedClass) == 0 or len(elseClass) == 0:
raise Exception("One of the classes is empty!")
else:
protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

return elseProb - protectedProb


Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)


Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":

# [(test name, (index, value))]
tests = [('gender', (1,0)),
('private employment', (2,1)),
('asian race', (33,1)),
('divorced', (12, 1))]

for (name, (index, value)) in tests:
print("'%s' bias in training data: %.4f" %
(name, labelBias(train[0], train[1], index, value)))


(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582


Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
>>> signedBias(train[0], indian, 1, 0)
0.0030631816119030884


So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least \$50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $T$ in place of $S^C$, or looking at discrepancies among $k$ different sub-populations or with $m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $S$. I mean two things by this. First, you could have some historical data you want to train a classifier $h$ on, and usually you’ll be given training labels for the data that tell you whether $h(x)$ should be $1$ or $-1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $S$ necessarily reduces the accuracy of $h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

## Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!