# Hashing to Estimate the Size of a Stream

Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory.

Solution: (in python)

import random

def randomHash(modulus):
a, b = random.randint(0,modulus-1), random.randint(0,modulus-1)
def f(x):
return (a*x + b) % modulus
return f

def average(L):
return sum(L) / len(L)

def numDistinctElements(stream, numParallelHashes=10):
modulus = 2**20
hashes = [randomHash(modulus) for _ in range(numParallelHashes)]
minima = [modulus] * numParallelHashes
currentEstimate = 0

for i in stream:
hashValues = [h(i) for h in hashes]
for i, newValue in enumerate(hashValues):
if newValue < minima[i]:
minima[i] = newValue

currentEstimate = modulus / average(minima)

yield currentEstimate


Discussion: The technique used here is to use random hash functions. The central idea is the same as the general principle presented in our recent post on hashing for load balancing. In particular, if you have an algorithm that works under the assumption that the data is uniformly random, then the same algorithm will work (up to a good approximation) if you process the data through a randomly chosen hash function.

So if we assume the data in the stream consists of $N$ uniformly random real numbers between zero and one, what we would do is the following. Maintain a single number $x_{\textup{min}}$ representing the minimum element in the list, and update it every time we encounter a smaller number in the stream. A simple probability calculation or an argument by symmetry shows that the expected value of the minimum is $1/(N+1)$. So your estimate would be $1/(x_{\textup{min}}+1)$. (The extra +1 does not change much as we’ll see.) One can spend some time thinking about the variance of this estimate (indeed, our earlier post is great guidance for how such a calculation would work), but since the data is not random we need to do more work. If the elements are actually integers between zero and $k$, then this estimate can be scaled by $k$ and everything basically works out the same.

Processing the data through a hash function $h$ chosen randomly from a 2-universal family (and we proved in the aforementioned post that this modulus thing is 2-universal) makes the outputs “essentially random” enough to have the above technique work with some small loss in accuracy. And to reduce variance, you can process the stream in parallel with many random hash functions. This rough sketch results in the code above. Indeed, before I state a formal theorem, let’s see the above code in action. First on truly random data:

S = [random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output
18299.75567190227
7940.7497160166595
12034.154552410098
12387.19432959244
15205.56844547564
8409.913113220158
8057.99978043693
9987.627098464103
10313.862295081966
9084.872639057356
10952.745228373375
10360.569781803211
11022.469475216301
9741.250165892501
11474.896038520465
10538.452261306533
10068.793492995934
10100.266495424627
9780.532155130093
8806.382800033594
10354.11482578643
10001.59202254498
10623.87031408308
9400.404915767062
10710.246772348424
10210.087633885101
9943.64709187974
10459.610972568578
10159.60175069326
9213.120899718839


As you can see the output is never off by more than a factor of 2. Now with “adversarial data.”

S = range(10000) #[random.randint(1,2**20) for _ in range(10000)]

for k in range(10,301,10):
for est in numDistinctElements(S, k):
pass
print(abs(est))

# output

12192.744186046511
15935.80547112462
10167.188106011634
12977.425742574258
6454.364151175674
7405.197740112994
11247.367453263867
4261.854392115023
8453.228233608026
7706.717624577393
7582.891328643745
5152.918628936483
1996.9365093316926
8319.20208545846
3259.0787592465967
6812.252720480753
4975.796789951151
8456.258064516129
8851.10133724288
7317.348220516398
10527.871485943775
3999.76974425661
3696.2999065091117
8308.843106180666
6740.999794281012
8468.603733730935
5728.532232608959
5822.072220349402
6382.349459544548
8734.008940222673


The estimates here are off by a factor of up to 5, and this estimate seems to get better as the number of hash functions used increases. The formal theorem is this:

Theorem: If $S$ is the set of distinct items in the stream and $n = |S|$ and $m > 100 n$, then with probability at least 2/3 the estimate $m / x_{\textup{min}}$ is between $n/6$ and $6n$.

We omit the proof (see below for references and better methods). As a quick analysis, since we’re only storing a constant number of integers at any given step, the algorithm has space requirement $O(\log m) = O(\log n)$, and each step takes time polynomial in $\log(m)$ to update in each step (since we have to compute multiplication and modulus of $m$).

This method is just the first ripple in a lake of research on this topic. The general area is called “streaming algorithms,” or “sublinear algorithms.” This particular problem, called cardinality estimation, is related to a family of problems called estimating frequency moments. The literature gets pretty involved in the various tradeoffs between space requirements and processing time per stream element.

As far as estimating cardinality goes, the first major results were due to Flajolet and Martin in 1983, where they provided a slightly more involved version of the above algorithm, which uses logarithmic space.

Later revisions to the algorithm (2003) got the space requirement down to $O(\log \log n)$, which is exponentially better than our solution. And further tweaks and analysis improved the variance bounds to something like a multiplicative factor of $\sqrt{m}$. This is called the HyperLogLog algorithm, and it has been tested in practice at Google.

Finally, a theoretically optimal algorithm (achieving an arbitrarily good estimate with logarithmic space) was presented and analyzed by Kane et al in 2010.

# Load Balancing and the Power of Hashing

Here’s a bit of folklore I often hear (and retell) that’s somewhere between a joke and deep wisdom: if you’re doing a software interview that involves some algorithms problem that seems hard, your best bet is to use hash tables.

More succinctly put: Google loves hash tables.

As someone with a passion for math and theoretical CS, it’s kind of silly and reductionist. But if you actually work with terabytes of data that can’t fit on a single machine, it also makes sense.

But to understand why hash tables are so applicable, you should have at least a fuzzy understanding of the math that goes into it, which is surprisingly unrelated to the actual act of hashing. Instead it’s the guarantees that a “random enough” hash provides that makes it so useful. The basic intuition is that if you have an algorithm that works well assuming the input data is completely random, then you can probably get a good guarantee by preprocessing the input by hashing.

In this post I’ll explain the details, and show the application to an important problem that one often faces in dealing with huge amounts of data: how to allocate resources efficiently (load balancing). As usual, all of the code used in the making of this post is available on Github.

Next week, I’ll follow this post up with another application of hashing to estimating the number of distinct items in a set that’s too large to store in memory.

## Families of Hash Functions

To emphasize which specific properties of hash functions are important for a given application, we start by introducing an abstraction: a hash function is just some computable function that accepts strings as input and produces numbers between 1 and $n$ as output. We call the set of allowed inputs $U$ (for “Universe”). A family of hash functions is just a set of possible hash functions to choose from. We’ll use a scripty $\mathscr{H}$ for our family, and so every hash function $h$ in $\mathscr{H}$ is a function $h : U \to \{ 1, \dots, n \}$.

You can use a single hash function $h$ to maintain an unordered set of objects in a computer. The reason this is a problem that needs solving is because if you were to store items sequentially in a list, and if you want to determine if a specific item is already in the list, you need to potentially check every item in the list (or do something fancier). In any event, without hashing you have to spend some non-negligible amount of time searching. With hashing, you can choose the location of an element $x \in U$ based on the value of its hash $h(x)$. If you pick your hash function well, then you’ll have very few collisions and can deal with them efficiently. The relevant section on Wikipedia has more about the various techniques to deal with collisions in hash tables specifically, but we want to move beyond that in this post.

Here we have a family of random hash functions. So what’s the use of having many hash functions? You can pick a hash randomly from a “good” family of hash functions. While this doesn’t seem so magical, it has the informal property that it makes arbitrary data “random enough,” so that an algorithm which you designed to work with truly random data will also work with the hashes of arbitrary data. Moreover, even if an adversary knows $\mathscr{H}$ and knows that you’re picking a hash function at random, there’s no way for the adversary to manufacture problems by feeding bad data. With overwhelming probability the worst-case scenario will not occur. Our first example of this is in load-balancing.

## Load balancing and 2-uniformity

You can imagine load balancing in two ways, concretely and mathematically. In the concrete version you have a public-facing server that accepts requests from users, and forwards them to a back-end server which processes them and sends a response to the user. When you have a billion users and a million servers, you want to forward the requests in such a way that no server gets too many requests, or else the users will experience delays. Moreover, you’re worried that the League of Tanzanian Hackers is trying to take down your website by sending you requests in a carefully chosen order so as to screw up your load balancing algorithm.

The mathematical version of this problem usually goes with the metaphor of balls and bins. You have some collection of $m$ balls and $n$ bins in which to put the balls, and you want to put the balls into the bins. But there’s a twist: an adversary is throwing balls at you, and you have to put them into the bins before the next ball comes, so you don’t have time to remember (or count) how many balls are in each bin already. You only have time to do a small bit of mental arithmetic, sending ball $i$ to bin $f(i)$ where $f$ is some simple function. Moreover, whatever rule you pick for distributing the balls in the bins, the adversary knows it and will throw balls at you in the worst order possible.

A young man applying his knowledge of balls and bins. That’s totally what he’s doing.

There is one obvious approach: why not just pick a uniformly random bin for each ball? The problem here is that we need the choice to be persistent. That is, if the adversary throws the same ball at us a second time, we need to put it in the same bin as the first time, and it doesn’t count toward the overall load. This is where the ball/bin metaphor breaks down. In the request/server picture, there is data specific to each user stored on the back-end server between requests (a session), and you need to make sure that data is not lost for some reasonable period of time. And if we were to save a uniform random choice after each request, we’d need to store a number for every request, which is too much. In short, we need the mapping to be persistent, but we also want it to be “like random” in effect.

So what do you do? The idea is to take a “good” family of hash functions $\mathscr{H}$, pick one $h \in \mathscr{H}$ uniformly at random for the whole game, and when you get a request/ball $x \in U$ send it to server/bin $h(x)$. Note that in this case, the adversary knows your universal family $\mathscr{H}$ ahead of time, and it knows your algorithm of committing to some single randomly chosen $h \in \mathscr{H}$, but the adversary does not know which particular $h$ you chose.

The property of a family of hash functions that makes this strategy work is called 2-universality.

Definition: A family of functions $\mathscr{H}$ from some universe $U \to \{ 1, \dots, n \}$. is called 2-universal if, for every two distinct $x, y \in U$, the probability over the random choice of a hash function $h$ from $\mathscr{H}$ that $h(x) = h(y)$ is at most $1/n$. In notation,

$\displaystyle \Pr_{h \in \mathscr{H}}[h(x) = h(y)] \leq \frac{1}{n}$

I’ll give an example of such a family shortly, but let’s apply this to our load balancing problem. Our load-balancing algorithm would fail if, with even some modest probability, there is some server that receives many more than its fair share ($m/n$) of the $m$ requests. If $\mathscr{H}$ is 2-universal, then we can compute an upper bound on the expected load of a given server, say server 1. Specifically, pick any element $x$ which hashes to 1 under our randomly chosen $h$. Then we can compute an upper bound on the expected number of other elements that hash to 1. In this computation we’ll only use the fact that expectation splits over sums, and the definition of 2-universal. Call $\mathbf{1}_{h(y) = 1}$ the random variable which is zero when $h(y) \neq 1$ and one when $h(y) = 1$, and call $X = \sum_{y \in U} \mathbf{1}_{h(y) = 1}$. In words, $X$ simply represents the number of inputs that hash to 1. Then

So in expectation we can expect server 1 gets its fair share of requests. And clearly this doesn’t depend on the output hash being 1; it works for any server. There are two obvious questions.

1. How do we measure the risk that, despite the expectation we computed above, some server is overloaded?
2. If it seems like (1) is on track to happen, what can you do?

For 1 we’re asking to compute, for a given deviation $t$, the probability that $X - \mathbb{E}[X] > t$. This makes more sense if we jump to multiplicative factors, since it’s usually okay for a server to bear twice or three times its usual load, but not like $\sqrt{n}$ times more than it’s usual load. (Industry experts, please correct me if I’m wrong! I’m far from an expert on the practical details of load balancing.)

So we want to know what is the probability that $X - \mathbb{E}[X] > t \cdot \mathbb{E}[X]$ for some small number $t$, and we want this to get small quickly as $t$ grows. This is where the Chebyshev inequality becomes useful. For those who don’t want to click the link, for our sitauation Chebyshev’s inequality is the statement that, for any random variable $X$

$\displaystyle \Pr[|X - \mathbb{E}[X]| > t\mathbb{E}[X]] \leq \frac{\textup{Var}[X]}{t^2 \mathbb{E}^2[X]}.$

So all we need to do is compute the variance of the load of a server. It’s a bit of a hairy calculation to write down, but rest assured it doesn’t use anything fancier than the linearity of expectation and 2-universality. Let’s dive in. We start by writing the definition of variance as an expectation, and then we split $X$ up into its parts, expand the product and group the parts.

$\displaystyle \textup{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

The easy part is $(\mathbb{E}[X])^2$, it’s just $(1 + (m-1)/n)^2$, and the hard part is $\mathbb{E}[X^2]$. So let’s compute that

In order to continue (and get a reasonable bound) we need an additional property of our hash family which is not immediately spelled out by 2-universality. Specifically, we need that for every $h$ and $i$, $\Pr_x[h(x) = i] = O(\frac{1}{n})$. In other words, each hash function should evenly split the inputs across servers.

The reason this helps is because we can split $\Pr[h(x) = h(y) = 1]$  into $\Pr[h(x) = h(y) \mid h(x) = 1] \cdot \Pr[h(x) = 1]$. Using 2-universality to bound the left term, this quantity is at most $1/n^2$, and since there are $\binom{m}{2}$ total terms in the double sum above, the whole thing is at most $O(m/n + m^2 / n^2) = O(m^2 / n^2)$. Note that in our big-O analysis we’re assuming $m$ is much bigger than $n$.

Sweeping some of the details inside the big-O, this means that our variance is $O(m^2/n^2)$, and so our bound on the deviation of $X$ from its expectation by a multiplicative factor of $t$ is at most $O(1/t^2)$.

Now we computed a bound on the probability that a single server is not overloaded, but if we want to extend that to the worst-case server, the typical probability technique is to take the union bound over all servers. This means we just add up all the individual bounds and ignore how they relate. So the probability that none of the servers has a load more than a multiplicative factor of $t$ is at most $O(n/t^2)$. This is only less than one when $t = \Omega(\sqrt{n})$, so all we can say with this analysis is that (with some small constant probability) no server will have a load worse than $\sqrt{n}$ times more than the expected load.

So we have this analysis that seems not so good. If we have a million servers then the worst load on one server could potentially be a thousand times higher than the expected load. This doesn’t scale, and the problem could be in any (or all) of three places:

1. Our analysis is weak, and we should use tighter bounds because the true max load is actually much smaller.
2. Our hash families don’t have strong enough properties, and we should beef those up to get tighter bounds.
3. The whole algorithm sucks and needs to be improved.

It turns out all three are true. One heuristic solution is easy and avoids all math. Have some second server (which does not process requests) count hash collisions. When some server exceeds a factor of $t$ more than the expected load, send a message to the load balancer to randomly pick a new hash function from $\mathscr{H}$ and for any requests that don’t have existing sessions (this is included in the request data), use the new hash function. Once the old sessions expire, switch any new incoming requests from those IPs over to the new hash function.

But there are much better solutions out there. Unfortunately their analyses are too long for a blog post (they fill multiple research papers). Fortunately their descriptions and guarantees are easy to describe, and they’re easy to program. The basic idea goes by the name “the power of two choices,” which we explored on this blog in a completely different context of random graphs.

In more detail, the idea is that you start by picking two random hash functions $h_1, h_2 \in \mathscr{H}$, and when you get a new request, you compute both hashes, inspect the load of the two servers indexed by those hashes, and send the request to the server with the smaller load.

This has the disadvantage of requiring bidirectional talk between the load balancer and the server, rather than obliviously forwarding requests. But the advantage is an exponential decrease in the worst-case maximum load. In particular, the following theorem holds for the case where the hashes are fully random.

Theorem: Suppose one places $m$ balls into $n$ bins in order according to the following procedure: for each ball pick two uniformly random and independent integers $1 \leq i,j \leq n$, and place the ball into the bin with the smallest current size. If there are ties pick the bin with the smaller index. Then with high probability the largest bin has no more than $\Theta(m/n) + O(\log \log (n))$ balls.

This theorem appears to have been proved in a few different forms, with the best analysis being by Berenbrink et al. You can improve the constant on the $\log \log n$ by computing more than 2 hashes. How does this relate to a good family of hash functions, which is not quite fully random? Let’s explore the answer by implementing the algorithm in python.

## An example of universal hash functions, and the load balancing algorithm

In order to implement the load balancer, we need to have some good hash functions under our belt. We’ll go with the simplest example of a hash function that’s easy to prove nice properties for. Specifically each hash in our family just performs some arithmetic modulo a random prime.

Definition: Pick any prime $p > m$, and for any $1 \leq a < p$ and $0 \leq b \leq n$ define $h_{a,b}(x) = (ax + b \mod p) \mod m$. Let $\mathscr{H} = \{ h_{a,b} \mid 0 \leq b < p, 1 \leq a < p \}$.

This family of hash functions is 2-universal.

Theorem: For every $x \neq y \in \{0, \dots, p\}$,

$\Pr_{h \in \mathscr{H}}[h(x) = h(y)] \leq 1/p$

Proof. To say that $h(x) = h(y)$ is to say that $ax+b = ay+b + i \cdot m \mod p$ for some integer $i$. I.e., the two remainders of $ax+b$ and $ay+b$ are equivalent mod $m$. The $b$‘s cancel and we can solve for $a$

$a = im (x-y)^{-1} \mod p$

Since $a \neq 0$, there are $p-1$ possible choices for $a$. Moreover, there is no point to pick $i$ bigger than $p/m$ since we’re working modulo $p$. So there are $(p-1)/m$ possible values for the right hand side of the above equation. So if we chose them uniformly at random, (remember, $x-y$ is fixed ahead of time, so the only choice is $a, i$), then there is a $(p-1)/m$ out of $p-1$ chance that the equality holds, which is at most $1/m$. (To be exact you should account for taking a floor of $(p-1)/m$ when $m$ does not evenly divide $p-1$, but it only decreases the overall probability.)

$\square$

If $m$ and $p$ were equal then this would be even more trivial: it’s just the fact that there is a unique line passing through any two distinct points. While that’s obviously true from standard geometry, it is also true when you work with arithmetic modulo a prime. In fact, it works using arithmetic over any field.

Implementing these hash functions is easier than shooting fish in a barrel.

import random

def draw(p, m):
a = random.randint(1, p-1)
b = random.randint(0, p-1)

return lambda x: ((a*x + b) % p) % m


To encapsulate the process a little bit we implemented a UniversalHashFamily class which computes a random probable prime to use as the modulus and stores $m$. The interested reader can see the Github repository for more.

If we try to run this and feed in a large range of inputs, we can see how the outputs are distributed. In this example $m$ is a hundred thousand and $n$ is a hundred (it’s not two terabytes, but give me some slack it’s a demo and I’ve only got my desktop!). So the expected bin size for any 2-universal family is just about 1,000.

>>> m = 100000
>>> n = 100
>>> H = UniversalHashFamily(numBins=n, primeBounds=[n, 2*n])
>>> results = []
>>> for simulation in range(100):
...    bins = [0] * n
...    h = H.draw()
...    for i in range(m):
...       bins[h(i)] += 1
...    results.append(max(bins))
...
>>> max(bins) # a single run
1228
>>> min(bins)
613
>>> max(results) # the max bin size over all runs
1228
>>> min(results)
1227


Indeed, the max is very close to the expected value.

But this example is misleading, because the point of this was that some adversary would try to screw us over by picking a worst-case input. If the adversary knew exactly which $h$ was chosen (which it doesn’t) then the worst case input would be the set of all inputs that have the given hash output value. Let’s see it happen live.

>>> h = H.draw()
>>> badInputs = [i for i in range(m) if h(i) == 9]
1227
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1227, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


The expected size of a bin is 12, but as expected this is 100 times worse (linearly worse in $n$). But if we instead pick a random $h$ after the bad inputs are chosen, the result is much better.

>>> testInputs(n,m,badInputs) # randomly picks a hash
[19, 20, 20, 19, 18, 18, 17, 16, 16, 16, 16, 17, 18, 18, 19, 20, 20, 19, 18, 17, 17, 16, 16, 16, 16, 17, 18, 18, 19, 20, 20, 19, 18, 17, 17, 16, 16, 16, 16, 8, 8, 9, 9, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 8, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 9, 8, 8, 8, 8, 8, 8, 8, 9, 9, 10]


However, if you re-ran this test many times, you’d eventually get unlucky and draw the hash function for which this actually is the worst input, and get a single huge bin. Other times you can get a bad hash in which two or three bins have all the inputs.

An interesting question is, what is really the worst-case input for this algorithm? I suspect it’s characterized by some choice of hash output values, taking all inputs for the chosen outputs. If this is the case, then there’s a tradeoff between the number of inputs you pick and how egregious the worst bin is. As an exercise to the reader, empirically estimate this tradeoff and find the best worst-case input for the adversary. Also, for your choice of parameters, estimate by simulation the probability that the max bin is three times larger than the expected value.

Now that we’ve played around with the basic hashing algorithm and made a family of 2-universal hashes, let’s see the power of two choices. Recall, this algorithm picks two random hash functions and sends an input to the bin with the smallest size. This obviously generalizes to $k$ choices, although the theoretical guarantee only improves by a constant factor, so let’s implement the more generic version.

class ChoiceHashFamily(object):
def __init__(self, hashFamily, queryBinSize, numChoices=2):
self.queryBinSize = queryBinSize
self.hashFamily = hashFamily
self.numChoices = numChoices

def draw(self):
hashes = [self.hashFamily.draw()
for _ in range(self.numChoices)]

def h(x):
indices = [h(x) for h in hashes]
counts = [self.queryBinSize(i) for i in indices]
count, index = min([(c,i) for (c,i) in zip(counts,indices)])
return index

return h


And if we test this with the bad inputs (as used previously, all the inputs that hash to 9), as a typical output we get

>>> bins
[15, 16, 15, 15, 16, 14, 16, 14, 16, 15, 16, 15, 15, 15, 17, 14, 16, 14, 16, 16, 15, 16, 15, 16, 15, 15, 17, 15, 16, 15, 15, 15, 15, 16, 15, 14, 16, 14, 16, 15, 15, 15, 14, 16, 15, 15, 15, 14, 17, 14, 15, 15, 14, 16, 13, 15, 14, 15, 15, 15, 14, 15, 13, 16, 14, 16, 15, 15, 15, 16, 15, 15, 13, 16, 14, 15, 15, 16, 14, 15, 15, 15, 11, 13, 11, 12, 13, 14, 13, 11, 11, 12, 14, 14, 13, 10, 16, 12, 14, 10]


And a typical list of bin maxima is

>>> results
[16, 16, 16, 18, 17, 365, 18, 16, 16, 365, 18, 17, 17, 17, 17, 16, 16, 17, 18, 16, 17, 18, 17, 16, 17, 17, 18, 16, 18, 17, 17, 17, 17, 18, 18, 17, 17, 16, 17, 365, 17, 18, 16, 16, 18, 17, 16, 18, 365, 16, 17, 17, 16, 16, 18, 17, 17, 17, 17, 17, 18, 16, 18, 16, 16, 18, 17, 17, 365, 16, 17, 17, 17, 17, 16, 17, 16, 17, 16, 16, 17, 17, 16, 365, 18, 16, 17, 17, 17, 17, 17, 18, 17, 17, 16, 18, 18, 17, 17, 17]


Those big bumps are the times when we picked an unlucky hash function, which is scarily large, although this bad event would be proportionally less likely as you scale up. But in the good case the load is clearly more even than the previous example, and the max load would get linearly smaller as you pick between a larger set of randomly chosen hashes (obviously).

Coupling this with the technique of switching hash functions when you start to observe a large deviation, and you have yourself an elegant solution.

In addition to load balancing, hashing has a ton of applications. Remember, the main key that you may want to use hashing is when you have an algorithm that works well when the input data is random. This comes up in streaming and sublinear algorithms, in data structure design and analysis, and many other places. We’ll be covering those applications in future posts on this blog.

Until then!

# Serial Dictatorships and House Allocation

I was recently an invited speaker in a series of STEM talks at Moraine Valley Community College. My talk was called “What can algorithms tell us about life, love, and happiness?” and it’s on Youtube now so you can go watch it. The central theme of the talk was the lens of computation, that algorithms and theoretical computer science can provide new and novel explanations for the natural phenomena we observe in the world.

One of the main stories I told in the talk is about stable marriages and the deferred acceptance algorithm, which we covered previously on this blog. However, one of the examples of the applications I gave was to kidney exchanges and school allocation. I said in the talk that it’s a variant of the stable marriages, but it’s not clear exactly how the two are related. This post will fill that gap and showcase some of the unity in the field of mechanism design.

Mechanism design, which is sometimes called market design, has a grand vision. There is a population of players with individual incentives, and given some central goal the designer wants to come up with a game where the self-interest of the players will lead them to efficiently achieve the designer’s goals. That’s what we’re going to do today with a class of problems called allocation problems.

As usual, all of the code we used in this post is available in a repository on this blog’s Github page.

## Allocating houses with dictators

In stable marriages we had $n$ men and $n$ women and we wanted to pair them off one to one in a way that there were no mutual incentives to cheat. Let’s modify this scenario so that only one side has preferences and the other does not. The analogy here is that we have $n$ people and $n$ houses, but what do we want to guarantee? It doesn’t make sense to say that people will cheat on each other, but it does make sense to ask that there’s no way for people to swap houses and have everyone be at least as happy as before. Let’s formalize this.

Let $A$ be a set of people (agents) and $H$ be a set of houses, and $n = |A| = |H|$. A matching is a one-to-one map from $A \to H$. Each agent is assumed to have a strict preference over houses, and if we’re given two houses $h_1, h_2$ and $a \in A$ prefers $h_1$ over $h_2$, we express that by saying $h_1 >_a h_2$. If we want to include the possibility that $h_1 = h_2$, we would say $h_1 \geq_a h_2$. I.e., either they’re the same house, or $a$ strictly prefers $h_1$ more.

Definition: A matching $M: A \to H$ is called pareto-optimal if there is no other matching $M$ with both of the following properties:

• Every agent is at least as happy in $N$ as in $M$, i.e. for every $a \in A$, $N(a) \geq_a M(a)$.
• Some agent is strictly happier in $N$, i.e. there exists an $a \in A$ with $N(a) >_a M(a)$.

We say a matching $N$ “pareto-dominates” another matching $M$ if these two properties hold. As a side note, if you like abstract algebra you might notice that you can take matchings and form them into a lattice where the comparison is pareto-domination. If you go deep into the theory of lattices, you can use some nice fixed-point theorems to (non-constructively) prove the existences of optimal allocations in this context and for stable marriages. See this paper if you’re interested. Of course, we will give efficient algorithms to achieve our goals, which is how I prefer to live life.

The mechanism we’ll use to find such an optimal matching is extremely simple, and it’s called the serial dictatorship.

First you pick an arbitrary ordering of the agents and all houses are marked “available.” Then the first agent in the ordering picks their top choice, and you remove their choice from the available houses. Continue in this way down the list until you get to the end, and the outcome is guaranteed to be pareto-optimal.

Theorem: Serial dictatorship always produces a pareto-optimal matching.

Proof. Let $M$ be the output of the algorithm. Suppose the theorem is false, that there is some $N$ that pareto-dominates $M$. Let $a$ be the first agent in the chosen ordering who gets a strictly better house in $N$ than in $M$. Whatever house $a$ gets, call it $N(a)$, it has to be a house that was unavailable at the time in the algorithm when $a$ got to pick (otherwise $a$ would have picked $N(a)$ during the algorithm!). This means that $a$ took the house chosen by some agent $b \in A$ whose turn to pick comes before $a$. But by assumption, $a$ was the first agent to get a strictly better house, so $b$ has to end up with a worse house. This contradicts that every agent is at least as happy in $N$ than in $M$, so $N$ cannot pareto-dominate $M$.

$\square$

It’s easy enough to implement this in Python. Each agent will be represented by its list of preferences, each object will be an integer, and the matching will be a dictionary. The only thing we need to do is pick a way to order the agents, and we’ll just pick a random ordering. As usual, all of the code used in this post is available on this blog’s github page.

# serialDictatorship: [[int]], [int] -> {int: int}
# construct a pareto-optimal allocation of objects to agents.
def serialDictatorship(agents, objects, seed=None):
if seed is not None:
random.seed(seed)

agentPreferences = agents[:]
random.shuffle(agentPreferences)
allocation = dict()
availableHouses = set(objects)

for agentIndex, preference in enumerate(agentPreferences):
allocation[agentIndex] = max(availableHouses, key=preference.index)
availableHouses.remove(allocation[agentIndex])

return allocation


And a test

agents = [['d','a','c','b'], # 4th in my chosen seed
['a','d','c','b'], # 3rd
['a','d','b','c'], # 2nd
['d','a','c','b']] # 1st
objects = ['a','b','c','d']
allocation = serialDictatorship(agents, objects, seed=1)
test({0: 'b', 1: 'c', 2: 'd', 3: 'a'}, allocation)


This algorithm is so simple it’s almost hard to believe. But it get’s better, because under some reasonable conditions, it’s the only algorithm that solves this problem.

Theorem [Svensson 98]: Serial dictatorship is the only algorithm that produces a pareto-optimal matching and also has the following three properties:

• Strategy-proof: no agent can improve their outcomes by lying about their preferences at the beginning.
• Neutral: the outcome of the algorithm is unchanged if you permute the items (i.e., does not depend on the index of the item in some list)
• Non-bossy: No agent can change the outcome of the algorithm without also changing the object they receive.

And if we drop any one of these conditions there are other mechanisms that satisfy the rest. This theorem was proved in this paper by Lars-Gunnar Svensson in 1998, and it’s not particularly long or complicated. The proof of the main theorem is about a page. It would be a great exercise in reading mathematics to go through the proof and summarize the main idea (you could even leave a comment with your answer!).

## Allocation with existing ownership

Now we switch to a slightly different problem. There are still $n$ houses and $n$ agents, but now every agent already “owns” a house. The question becomes: can they improve their situation by trading houses? It shouldn’t be immediately obvious whether this is possible, because a trade can happen in a “cycle” like the following:

Here A prefers the house of B, and B prefers the house of C, and C prefers the house of A, so they’d all benefit from doing a three-way cyclic trade. You can easily imagine the generalization to larger cycles.

This model was studied by Shapley and Scarf in 1974 (the same Shapley who did the deferred acceptance algorithm for stable marriages). Just as you’d expect, our goal is to find an optimal (re)-allocation of houses to agents in which there is no cycle the stands to improve. That is, there is no subset of agents that can jointly improve their standing. In formalizing this we call an “optimal” matching a core matching. Again $A$ is a set of agents, and $H$ is a set of houses.

Definition: A matching $M: A \to H$ is called a core matching if there is no subset $B \subset A$ and no matching $N: A \to H$ with the following properties:

• For every $b \in B$, $N(b)$ is owned by some other agent in $B$ (trading only happens within $B$).
• Every agent $b$ in $B$ is at least as happy as before, i.e. $N(b) \geq_b M(b)$ for all $b$.
• Some agent in $B$ strictly improves, i.e. for some $b, N(b) >_b M(b)$.

We also call an algorithm individually rational if it ensures that every agent gets a house that is at least as good as their starting house. It should be clear that an algorithm which produces a core matching is individually rational, because for any agent $a$ we can set $B = \{a\}$, i.e. force $a$ to consider not trading at all, and being a core matching says that’s not better for $a$. Likewise, core matchings are also pareto-optimal by setting $B = A$.

It might seem like the idea of a “core” solution to an allocation problem is more general, and you’re right. You can define it for a very general setting of cooperative games and prove the existence of core matchings in that setting. See Wikipedia for more. As is our prerogative, we’ll achieve the same thing by constructing core matchings with an algorithm.

Indeed, the following theorem is due to Shapley & Scarf.

Theorem [Shapley-Scarf 74]: There is a core matching for every choice of preferences. Moreover, one can be found by an efficient algorithm.

Proof. The mechanism we’ll define is called the top trading cycles algorithm. We operate in rounds, and the first round goes as follows.

Form a directed graph with nodes in $A \cup H$. That is there is one node for each agent and one node for each house. Then we start by having each agent “point” to its most preferred house, and each house “points” to its original owner. That is, we add in directed edges from agents to their top pick, and houses to their owners. For example, say there are five agents $A = \{ a, b, c, d, e, f \}$ and houses $H = \{ 1,2,3,4,5 \}$ with $a$ owning $1$, and $b$ owning $2$, etc. but their favorite picks goes backwards, so that $a$ prefers house $5$ most, and $b$ prefers $4$ most, $c$ prefers $3$ (which $c$ also owns), etc. Then the “pointing picture” in the first round looks like this.

The claim about such a graph is that there is always some directed cycle. In the example above, there are three. And moreover, we claim that no two cycles can share an edge. It’s easy to see there has to be a cycle: you can start at any agent and just follow the single outgoing edge until you find yourself repeating some vertices. By the fact that there is only one edge going out of any vertex, it follows that no two cycles could share an edge (or else in the last edge they share, there’d have to be a fork, i.e. two outgoing edges).

In the example above, you can start from A and follow the only edge and you get the cycle A -> 5 -> E -> 1 -> A. Similarly, starting at 4 would give you 4 -> D -> 2 -> B -> 4.

The point is that when you remove a cycle, you can have the agents in that cycle do the trade indicated by the cycle and remove the entire cycle from the graph. The consequence of this is that you have some agents who were pointing to houses that are removed, and so these agents revise their outgoing edge to point at their next most preferred available house. You can then continue removing cycles in this way until all the agents have been assigned a house.

The proof that this is a core matching is analogous to the proof that serial dictatorships were pareto-optimal. If there were some subset $B$ and some other matching $N$ under which $B$ does better, then one of these agents has to be the first to be removed in a cycle during the algorithm’s run. But that agent got the best possible pick of a house, so by involving with $B$ that agent necessarily gets a worse outcome.

$\square$

This algorithm is commonly called the Top Trading Cycles algorithm, because it splits the set of agents and houses into a disjoint union of cycles, each of which is the best trade possible for every agent involved.

Implementing the Top Trading Cycles algorithm in code requires us to be able to find cycles in graphs, but that isn’t so hard. I implemented a simple data structure for a graph with helper functions that are specific to our kind of graph (i.e., every vertex has outdegree 1, so the algorithm to find cycles is simpler than something like Tarjan’s algorithm). You can see the data structure on this post’s github repository in the file graph.py. An example of using it:

>>> G = Graph([1,'a',2,'b',3,'c',4,'d',5,'e',6,'f'])
>>> G.addEdges([(1,'a'), ('a',2), (2,'b'), ('b',3), (3,'c'), ('c',1),
(4,'d'), ('d',5), (5,'e'), ('e',4), (6,'f'), ('f',6)])
>>> G['d']
Vertex('d')
>>> G['d'].outgoingEdges
{('d', 5)}
>>> G['d'].anyNext() # return the target of any outgoing edge from 'd'
Vertex(5)
>>> G.delete('e')
>>> G[4].incomingEdges
set()


Next we implement a function to find a cycle, and a function to extract the agents from a cycle. For latter we can assume the cycle is just represented by any agent on the cycle (again, because our graphs always have outdegree exactly 1).

# anyCycle: graph -> vertex
# find any vertex involved in a cycle
def anyCycle(G):
visited = set()
v = G.anyVertex()

while v not in visited:
v = v.anyNext()

return v

# getAgents: graph, vertex -> set(vertex)
# get the set of agents on a cycle starting at the given vertex
def getAgents(G, cycle, agents):
# make sure starting vertex is a house
if cycle.vertexId in agents:
cycle = cycle.anyNext()

startingHouse = cycle
currentVertex = startingHouse.anyNext()
theAgents = set()

while currentVertex not in theAgents:
currentVertex = currentVertex.anyNext()
currentVertex = currentVertex.anyNext()

return theAgents


Finally, implementing the algorithm is just bookkeeping. After setting up the initial graph, the core of the routine is

def topTradingCycles(agents, houses, agentPreferences, initialOwnership):
# form the initial graph

...

allocation = dict()
while len(G.vertices) &> 0:
cycle = anyCycle(G)
cycleAgents = getAgents(G, cycle, agents)

# assign agents in the cycle their choice of house
for a in cycleAgents:
h = a.anyNext().vertexId
allocation[a.vertexId] = h
G.delete(a)
G.delete(h)

for a in agents:
if a in G.vertices and G[a].outdegree() == 0:
# update preferences
...

return allocation


This mutates the graph in each round by deleting any cycle that was found, and adding new edges when the top choice of some agent is removed. Finally, to fill in the ellipses we just need to say how we represent the preferences. The input agentPreferences is a dictionary mapping agents to a list of all houses in order of preference. So again we can just represent the “top available pick” by an index and update that index when agents lose their top pick.

# maps agent to an index of the list agentPreferences[agent]
currentPreferenceIndex = dict((a,0) for a in agents)
preferredHouse = lambda a: agentPreferences[a][currentPreferenceIndex[a]]


Then to update we just have to replace the currentPreferenceIndex for each disappointed agent by its next best option.

      for a in agents:
if a in G.vertices and G[a].outdegree() == 0:
while preferredHouse(a) not in G.vertices:
currentPreferenceIndex[a] += 1


And that’s it! We included a small suite of test cases which you can run if you want to play around with it more.

One final nice thing about this algorithm is that it almost generalizes the serial dictatorship algorithm. What you do is rather than have each house point to its original owner, you just have all houses point to the first agent in the pre-specified ordering. Then a cycle will always have length 2, the first agent gets their preferred house, and in the next round the houses now point to the second agent in the ordering, and so on.

## Kidney exchange

We still need one more ingredient to see the bridge from allocation problems to kidney exchanges. The setting is like this: say Manuel needs a kidney transplant, and he’s lucky enough that his sister-in-law Anastasia wants to donate her kidney to Manuel. However, it turns out that Anastasia doesn’t the same right blood/antibody type for a donation, and so even though she has a kidney to give, they can’t give it to Manuel. Now one might say “just sell your kidney and use the money to buy a kidney with the right type!” Turns out that’s illegal; at some point we as a society decided that it’s immoral to sell organs. But it is legal to exchange a kidney for a kidney. So if Manuel and Anastasia can find a pair of people both of whom happen to have the right blood types, they can arrange for a swap.

But finding two people both of whom have the right blood types is unlikely, and we can actually do far better! We can turn this into a housing allocation problem as follows. Anyone with a kidney to donate is a “house,” and anyone who needs a kidney is an “agent.” And to start off with, we say that each agent “owns” the kidney of their willing donor. And the preferences of each agent are determined by which kidney donors have the right blood type (with ties split, say, by geographical distance). Then when you do the top trading cycles algorithm you find these chains where Anastasia, instead of donating to Manuel, donates to another person who has the right blood type. On the other end of the cycle, Manuel receives a kidney from someone with the right blood type.

The big twist is that not everyone who needs a kidney knows someone willing to donate. So there are agents who are “new” to the market and don’t already own a house. Moreover, maybe you have someone who is willing to donate a kidney but isn’t asking for anything in return.

Because of this the algorithm changes slightly. You can no longer guarantee the existence of a cycle (though you can still guarantee that no two cycles will share an edge). But as new people are added to the graph, cycles will eventually form and you can make the trades. There are a few extra details if you want to ensure that everyone is being honest (if you’re thinking about it like a market in the economic sense, where people could be lying about their preferences).

The resulting mechanism is called You Request My House I Get Your Turn (YRMHIGYT). In short, the idea is that you pick an order on the agents, say for kidney exchanges it’s the order in which the patients are diagnosed. And you have them add edges to the graph in that order. At each step you look for a cycle, and when one appears you remove it as usual. The twist, and the source of the name, is that when someone who has no house requests a house which is already owned, the agent who owns the house gets to jump forward in the queue. This turns out to make everything “fair” (in that everyone is guaranteed to get a house at least as good as the one they own) and one can prove analogous optimality theorems to the ones we did for serial dictatorship.

This mechanism was implemented by Alvin Roth in the US hospital system, and by some measure it has saved many lives. If you want to hear more about the process and how successful the kidney exchange program is, you can listen to this Freakonomics podcast episode where they interviewed Al Roth and some of the patients who benefited from this new allocation market.

It would be an excellent exercise to go deeper into the guts of the kidney exchange program (see this paper by Alvin Roth et al.), and implement the matching system in code. At the very least, implementing the YRMHIGYT mechanism is only a minor modification of our existing Top Trading Cycles code.

Until next time!

# One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

## Statistical parity

Here is one way to define fairness.

Your population is a set $X$ and there is some known subset $S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $X$ is people and $S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $D$ over $X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $D$ takes that into account. Generally we impose no restrictions on $D$, and the definition of fairness will have to work no matter what $D$ is.

Now suppose we have a (possibly randomized) classifier $h:X \to \{-1,1\}$ giving labels to $X$. When given a person $x$ as input $h(x)=1$ if $x$ gets a loan and $-1$ otherwise. The bias, or statistical imparity, of $h$ on $S$ with respect to $X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $S$ is labeled 1 and the probability that a random individual from the complement $S^C$ is labeled 1.

$\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $h:X \to \{-1,1\}$ is said to have statistical parity on $D$ with respect to $S$ up to bias $\varepsilon$ if $|\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):
protectedClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] == protectedValue]
elseClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] != protectedValue]

if len(protectedClass) == 0 or len(elseClass) == 0:
raise Exception("One of the classes is empty!")
else:
protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

return elseProb - protectedProb


Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)


Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":
from data import adult

# [(test name, (index, value))]
tests = [('gender', (1,0)),
('private employment', (2,1)),
('asian race', (33,1)),
('divorced', (12, 1))]

for (name, (index, value)) in tests:
print("'%s' bias in training data: %.4f" %
(name, labelBias(train[0], train[1], index, value)))


(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582


Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
>>> signedBias(train[0], indian, 1, 0)
0.0030631816119030884


So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least \$50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $T$ in place of $S^C$, or looking at discrepancies among $k$ different sub-populations or with $m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $S$. I mean two things by this. First, you could have some historical data you want to train a classifier $h$ on, and usually you’ll be given training labels for the data that tell you whether $h(x)$ should be $1$ or $-1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $S$ necessarily reduces the accuracy of $h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

## Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!