# One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

## Statistical parity

Here is one way to define fairness.

Your population is a set $X$ and there is some known subset $S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $X$ is people and $S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $D$ over $X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $D$ takes that into account. Generally we impose no restrictions on $D$, and the definition of fairness will have to work no matter what $D$ is.

Now suppose we have a (possibly randomized) classifier $h:X \to \{-1,1\}$ giving labels to $X$. When given a person $x$ as input $h(x)=1$ if $x$ gets a loan and $-1$ otherwise. The bias, or statistical imparity, of $h$ on $S$ with respect to $X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $S$ is labeled 1 and the probability that a random individual from the complement $S^C$ is labeled 1.

$\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $h:X \to \{-1,1\}$ is said to have statistical parity on $D$ with respect to $S$ up to bias $\varepsilon$ if $|\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):
protectedClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] == protectedValue]
elseClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] != protectedValue]

if len(protectedClass) == 0 or len(elseClass) == 0:
raise Exception("One of the classes is empty!")
else:
protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

return elseProb - protectedProb


Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)


Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":

# [(test name, (index, value))]
tests = [('gender', (1,0)),
('private employment', (2,1)),
('asian race', (33,1)),
('divorced', (12, 1))]

for (name, (index, value)) in tests:
print("'%s' bias in training data: %.4f" %
(name, labelBias(train[0], train[1], index, value)))


(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582


Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
>>> signedBias(train[0], indian, 1, 0)
0.0030631816119030884


So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least \$50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $T$ in place of $S^C$, or looking at discrepancies among $k$ different sub-populations or with $m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $S$. I mean two things by this. First, you could have some historical data you want to train a classifier $h$ on, and usually you’ll be given training labels for the data that tell you whether $h(x)$ should be $1$ or $-1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $S$ necessarily reduces the accuracy of $h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

## Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!

# The Boosting Margin, or Why Boosting Doesn’t Overfit

There’s a well-understood phenomenon in machine learning called overfitting. The idea is best shown by a graph:

Let me explain. The vertical axis represents the error of a hypothesis. The horizontal axis represents the complexity of the hypothesis. The blue curve represents the error of a machine learning algorithm’s output on its training data, and the red curve represents the generalization of that hypothesis to the real world. The overfitting phenomenon is marker in the middle of the graph, before which the training error and generalization error both go down, but after which the training error continues to fall while the generalization error rises.

The explanation is a sort of numerical version of Occam’s Razor that says more complex hypotheses can model a fixed data set better and better, but at some point a simpler hypothesis better models the underlying phenomenon that generates the data. To optimize a particular learning algorithm, one wants to set parameters of their model to hit the minimum of the red curve.

This is where things get juicy. Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t overfit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty.

Schapire, Freund, Bartlett, and Lee proposed a theoretical explanation for this based on the notion of a margin, and the goal of this post is to go through the details of their theorem and proof. Remember that the standard AdaBoost algorithm produces a set of weak hypotheses $h_i(x)$ and a corresponding weight $\alpha_i \in [-1,1]$ for each round $i=1, \dots, T$. The classifier at the end is a weighted majority vote of all the weak learners (roughly: weak learners with high error on “hard” data points get less weight).

Definition: The signed confidence of a labeled example $(x,y)$ is the weighted sum:

$\displaystyle \textup{conf}(x) = \sum_{i=1}^T \alpha_i h_i(x)$

The margin of $(x,y)$ is the quantity $\textup{margin}(x,y) = y \textup{conf}(x)$. The notation implicitly depends on the outputs of the AdaBoost algorithm via “conf.”

We use the product of the label and the confidence for the observation that $y \cdot \textup{conf}(x) \leq 0$ if and only if the classifier is incorrect. The theorem we’ll prove in this post is

Theorem: With high probability over a random choice of training data, for any $0 < \theta < 1$ generalization error of boosting is bounded from above by

$\displaystyle \Pr_{\textup{train}}[\textup{margin}(x) \leq \theta] + O \left ( \frac{1}{\theta} (\textup{typical error terms}) \right )$

In words, the generalization error of the boosting hypothesis is bounded by the distribution of margins observed on the training data. To state and prove the theorem more generally we have to return to the details of PAC-learning. Here and in the rest of this post, $\Pr_D$ denotes $\Pr_{x \sim D}$, the probability over a random example drawn from the distribution $D$, and $\Pr_S$ denotes the probability over a random (training) set of examples drawn from $D$.

Theorem: Let $S$ be a set of $m$ random examples chosen from the distribution $D$ generating the data. Assume the weak learner corresponds to a finite hypothesis space $H$ of size $|H|$, and let $\delta > 0$. Then with probability at least $1 - \delta$ (over the choice of $S$), every weighted-majority vote function $f$ satisfies the following generalization bound for every $\theta > 0$.

$\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_S[y f(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{\log m \log |H|}{\theta^2} + \log(1/\delta)} \right )$

In other words, this phenomenon is a fact about voting schemes, not boosting in particular. From now on, a “majority vote” function $f(x)$ will mean to take the sign of a sum of the form $\sum_{i=1}^N a_i h_i(x)$, where $a_i \geq 0$ and $\sum_i a_i = 1$. This is the “convex hull” of the set of weak learners $H$. If $H$ is infinite (in our proof it will be finite, but we’ll state a generalization afterward), then only finitely many of the $a_i$ in the sum may be nonzero.

To prove the theorem, we’ll start by defining a class of functions corresponding to “unweighted majority votes with duplicates:”

Definition: Let $C_N$ be the set of functions $f(x)$ of the form $\frac{1}{N} \sum_{i=1}^N h_i(x)$ where $h_i \in H$ and the $h_i$ may contain duplicates (some of the $h_i$ may be equal to some other of the $h_j$).

Now every majority vote function $f$ can be written as a weighted sum of $h_i$ with weights $a_i$ (I’m using $a$ instead of $\alpha$ to distinguish arbitrary weights from those weights arising from Boosting). So any such $f(x)$ defines a natural distribution over $H$ where you draw function $h_i$ with probability $a_i$. I’ll call this distribution $A_f$. If we draw from this distribution $N$ times and take an unweighted sum, we’ll get a function $g(x) \in C_N$. Call the random process (distribution) generating functions in this way $Q_f$. In diagram form, the logic goes

$f \to$ weights $a_i \to$ distribution over $H \to$ function in $C_N$ by drawing $N$ times according to $H$.

The main fact about the relationship between $f$ and $Q_f$ is that each is completely determined by the other. Obviously $Q_f$ is determined by $f$ because we defined it that way, but $f$ is also completely determined by $Q_f$ as follows:

$\displaystyle f(x) = \mathbb{E}_{g \sim Q_f}[g(x)]$

Proving the equality is an exercise for the reader.

Proof of Theorem. First we’ll split the probability $\Pr_D[y f(x) \leq 0]$ into two pieces, and then bound each piece.

First a probability reminder. If we have two events $A$ and $B$ (in what’s below, this will be $yg(x) \leq \theta/2$ and $yf(x) \leq 0$, we can split up $\Pr[A]$ into $\Pr[A \textup{ and } B] + \Pr[A \textup{ and } \overline{B}]$ (where $\overline{B}$ is the opposite of $B$). This is called the law of total probability. Moreover, because $\Pr[A \textup{ and } B] = \Pr[A | B] \Pr[B]$ and because these quantities are all at most 1, it’s true that $\Pr[A \textup{ and } B] \leq \Pr[A \mid B]$ (the conditional probability) and that $\Pr[A \textup{ and } B] \leq \Pr[B]$.

Back to the proof. Notice that for any $g(x) \in C_N$ and any $\theta > 0$, we can write $\Pr_D[y f(x) \leq 0]$ as a sum:

$\displaystyle \Pr_D[y f(x) \leq 0] =\\ \Pr_D[yg(x) \leq \theta/2 \textup{ and } y f(x) \leq 0] + \Pr_D[yg(x) > \theta/2 \textup{ and } y f(x) \leq 0]$

Now I’ll loosen the first term by removing the second event (that only makes the whole probability bigger) and loosen the second term by relaxing it to a conditional:

$\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_D[y g(x) \leq \theta / 2] + \Pr_D[yg(x) > \theta/2 \mid yf(x) \leq 0]$

Now because the inequality is true for every $g(x) \in C_N$, it’s also true if we take an expectation of the RHS over any distribution we choose. We’ll choose the distribution $Q_f$ to get

$\displaystyle \Pr_D[yf(x) \leq 0] \leq T_1 + T_2$

And $T_1$ (term 1) is

$\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

And $T_2$ is

$\displaystyle \Pr_{x \sim D, g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0] = \mathbb{E}_D[\Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]]$

We can rewrite the probabilities using expectations because (1) the variables being drawn in the distributions are independent, and (2) the probability of an event is the expectation of the indicator function of the event.

Now we’ll bound the terms $T_1, T_2$ separately. We’ll start with $T_2$.

Fix $(x,y)$ and look at the quantity inside the expectation of $T_2$.

$\displaystyle \Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]$

This should intuitively be very small for the following reason. We’re sampling $g$ according to a distribution whose expectation is $f$, and we know that $yf(x) \leq 0$. Of course $yg(x)$ is unlikely to be large.

Mathematically we can prove this by transforming the thing inside the probability to a form suitable for the Chernoff bound. Saying $yg(x) > \theta / 2$ is the same as saying $|yg(x) - \mathbb{E}[yg(x)]| > \theta /2$, i.e. that some random variable which is a sum of independent random variables (the $h_i$) deviates from its expectation by at least $\theta/2$. Since the $y$‘s are all $\pm 1$ and constant inside the expectation, they can be removed from the absolute value to get

$\displaystyle \leq \Pr_{g \sim Q_f}[g(x) - \mathbb{E}[g(x)] > \theta/2]$

The Chernoff bound allows us to bound this by an exponential in the number of random variables in the sum, i.e. $N$. It turns out the bound is $e^{-N \theta^2 / 8}$.

Now recall $T_1$

$\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

For $T_1$, we don’t want to bound it absolutely like we did for $T_2$, because there is nothing stopping the classifier $f$ from being a bad classifier and having lots of error. Rather, we want to bound it in terms of the probability that $yf(x) \leq \theta$. We’ll do this in two steps. In step 1, we’ll go from $\Pr_D$ of the $g$‘s to $\Pr_S$ of the $g$‘s.

Step 1: For any fixed $g, \theta$, if we take a sample $S$ of size $m$, then consider the event in which the sample probability deviates from the true distribution by some value $\varepsilon_N$, i.e. the event

$\displaystyle \Pr_D[yg(x) \leq \theta /2] > \Pr_{S, x \sim S}[yg(x) \leq \theta/2] + \varepsilon_N$

The claim is this happens with probability at most $e^{-2m\varepsilon_N^2}$. This is again the Chernoff bound in disguise, because the expected value of $\Pr_S$ is $\Pr_D$, and the probability over $S$ is an average of random variables (it’s a slightly different form of the Chernoff bound; see this post for more). From now on we’ll drop the $x \sim S$ when writing $\Pr_S$.

The bound above holds true for any fixed $g,\theta$, but we want a bound over all $g$ and $\theta$. To do that we use the union bound. Note that there are only $(N+1)$ possible choices for a nonnegative $\theta$ because $g(x)$ is a sum of $N$ values each of which is either $\pm1$. And there are only $|C_N| \leq |H|^N$ possibilities for $g(x)$. So the union bound says the above event will occur with probability at most $(N+1)|H|^N e^{-2m\varepsilon_N^2}$.

If we want the event to occur with probability at most $\delta_N$, we can judiciously pick

$\displaystyle \varepsilon_N = \sqrt{(1/2m) \log ((N+1)|H|^N / \delta_N)}$

And since the bound holds in general, we can take expectation with respect to $Q_f$ and nothing changes. This means that for any $\delta_N$, our chosen $\varepsilon_N$ ensures that the following is true with probability at least $1-\delta_N$:

$\displaystyle \Pr_{D, g \sim Q_f}[yg(x) \leq \theta/2] \leq \Pr_{S, g \sim Q_f}[yg(x) \leq \theta/2] + \varepsilon_N$

Now for step 2, we bound the probability that $yg(x) \leq \theta/2$ on a sample to the probability that $yf(x) \leq \theta$ on a sample.

Step 2: The first claim is that

$\displaystyle \Pr_{S, g \sim Q_f}[yg(x) \leq \theta / 2] \leq \Pr_{S} [yf(x) \leq \theta] + \mathbb{E}_{S}[\Pr_{g \sim Q_f}[yg(x) \leq \theta/2 \mid yf(x) \geq \theta]]$

What we did was break up the LHS into two “and”s, when $yf(x) > \theta$ and $yf(x) \leq \theta$ (this was still an equality). Then we loosened the first term to $\Pr_{S}[yf(x) \leq \theta]$ since that is only more likely than both $yg(x) \leq \theta/2$ and $yf(x) \leq \theta$. Then we loosened the second term again using the fact that a probability of an “and” is bounded by the conditional probability.

Now we have the probability of $yg(x) \leq \theta / 2$ bounded by the probability that $yf(x) \leq 0$ plus some stuff. We just need to bound the “plus some stuff” absolutely and then we’ll be done. The argument is the same as our previous use of the Chernoff bound: we assume $yf(x) \geq \theta$, and yet $yg(x) \leq \theta / 2$. So the deviation of $yg(x)$ from its expectation is large, and the probability that happens is exponentially small in the amount of deviation. The bound you get is

$\displaystyle \Pr_{g \sim Q}[yg(x) \leq \theta/2 \mid yf(x) > \theta] \leq e^{-N\theta^2 / 8}.$

And again we use the union bound to ensure the failure of this bound for any $N$ will be very small. Specifically, if we want the total failure probability to be at most $\delta$, then we need to pick some $\delta_j$‘s so that $\delta = \sum_{j=0}^{\infty} \delta_j$. Choosing $\delta_N = \frac{\delta}{N(N+1)}$ works.

Putting everything together, we get that with probability at least $1-\delta$ for every $\theta$ and every $N$, this bound on the failure probability of $f(x)$:

$\displaystyle \Pr_{x \sim D}[yf(x) \leq 0] \leq \Pr_{S, x \sim S}[yf(x) \leq \theta] + 2e^{-N \theta^2 / 8} + \sqrt{\frac{1}{2m} \log \left ( \frac{N(N+1)^2 |H|^N}{\delta} \right )}.$

This claim is true for every $N$, so we can pick $N$ that minimizes it. Doing a little bit of behind-the-scenes calculus that is left as an exercise to the reader, a tight choice of $N$ is $(4/ \theta)^2 \log(m/ \log |H|)$. And this gives the statement of the theorem.

$\square$

We proved this for finite hypothesis classes, and if you know what VC-dimension is, you’ll know that it’s a central tool for reasoning about the complexity of infinite hypothesis classes. An analogous theorem can be proved in terms of the VC dimension. In that case, calling $d$ the VC-dimension of the weak learner’s output hypothesis class, the bound is

$\displaystyle \Pr_D[yf(x) \leq 0] \leq \Pr_S[yf(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{d \log^2(m/d)}{\theta^2} + \log(1/\delta)} \right )$

How can we interpret these bounds with so many parameters floating around? That’s where asymptotic notation comes in handy. If we fix $\theta \leq 1/2$ and $\delta = 0.01$, then the big-O part of the theorem simplifies to $\sqrt{(\log |H| \cdot \log m) / m}$, which is easier to think about since $(\log m)/m$ goes to zero very fast.

Now the theorem we just proved was about any weighted majority function. The question still remains: why is AdaBoost good? That follows from another theorem, which we’ll state and leave as an exercise (it essentially follows by unwrapping the definition of the AdaBoost algorithm from last time).

Theorem: Suppose that during AdaBoost the weak learners produce hypotheses with training errors $\varepsilon_1, \dots , \varepsilon_T$. Then for any $\theta$,

$\displaystyle \Pr_{(x,y) \sim S} [yf(x) \leq \theta] \leq 2^T \prod_{t=1}^T \sqrt{\varepsilon_t^{(1-\theta)} (1-\varepsilon_t)^{(1+\theta)}}$

Let’s interpret this for some concrete numbers. Say that $\theta = 0$ and $\varepsilon_t$ is any fixed value less than $1/2$. In this case the term inside product becomes $\sqrt{\varepsilon (1-\varepsilon)} < 1/2$ and the whole bound tends exponentially quickly to zero in the number of rounds $T$. On the other hand, if we raise $\theta$ to about 1/3, then in order to maintain the LHS tending to zero we would need $\varepsilon < \frac{1}{4} ( 3 - \sqrt{5} )$ which is about 20% error.

If you’re interested in learning more about Boosting, there is an excellent book by Freund and Schapire (the inventors of boosting) called Boosting: Foundations and Algorithms. There they include a tighter analysis based on the idea of Rademacher complexity. The bound I presented in this post is nice because the proof doesn’t require any machinery past basic probability, but if you want to reach the cutting edge of knowledge about boosting you need to invest in the technical stuff.

Until next time!

# What does it mean for an algorithm to be fair?

In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.

Warning: algorithms can facilitate illegal discrimination!

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have a historical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if we knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning (FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

# Bandits and Stocks

So far in this series we’ve seen two nontrivial algorithms for bandit learning in two different settings. The first was the UCB1 algorithm, which operated under the assumption that the rewards for the trials were independent and stochastic. That is, each slot machine was essentially a biased coin flip, and the algorithm was trying to find the machine with the best odds. The second was the Exp3 algorithm, which held the belief that the payoffs were arbitrary. In particular this includes the possibility that an adversary is setting the payoffs against you, and so we measured the success of an algorithm in terms of how it fares against the best single action (just as we did with UCB1, but with Exp3 it’s a nontrivial decision).

Before we move on to other bandit settings it’s natural to try to experiment with the ones we have on real world data. On one hand it’s interesting to see how they fare outside academia. And more relevantly to the design of the future bandit algorithms we’ll see on this blog, we need to know what worldly problems actually provide in terms of inputs to our learning algorithm in each round.

But another interesting issue goes like this. In the real world we can’t ever really know whether the rewards of the actions are stochastic or adversarial. Many people believe that adversarial settings are far too pathological to be realistic, while others claim that the assumptions made by stochastic models are too strict. To weigh in on this dispute, we’ll dip into a bit of experimental science and see which of the two algorithms performs better on the problem of stock trading. The result is then evidence that stocks behave stochastically or adversarially. But we don’t want to stir up too many flames, so we can always back up behind the veil of applied mathematics (“this model is too simple anyway”).

Indeed the model we use in this post is rather simplistic. I don’t know as much as I should (or as my father would have me know) about stock markets. In fact, I’m more partial to not trade stocks on principle. But I must admit that average-quality stock data is easy to come by, and the basic notions of market interactions lend themselves naturally to many machine learning problems. If the reader has any ideas about how to strengthen the model, I welcome suggestions in the comments (or a fork on github).

A fair warning to the reader, we do not solve the problem of trading stocks by any means. We use a model that’s almost entirely unrealistic, and the results aren’t even that good. I’m quite nervous to publish this at all, just because above all else it reveals my gaping ignorance on how stock markets work. But this author believes in revealing ignorance as learning, if for nothing else than that it provides extremely valuable insight into the nature of a problem and an appreciation of its complexity. So criticize away, dear readers.

As usual, all of the code and data we use in this post is available on this blog’s Github page. Our language of choice for this post is Python.

This little trader got lucky. Could it be because he’s got TEN MONITORS?!

## Stocks for Dummies (me)

A quick primer on stocks, which is only as detailed as it needs to be for this post: a stock is essentially the sum of the value of all the assets of a company. A publicly traded company divides their stock into a number of “shares,” and owning a share represents partial ownership of the company. If you own 50% of the shares, you own 50% of the company. Companies sell shares or give them to employees as benefits (or options), and use the money gained through their sale for whatever they see fit. The increase in the price of a stock generally signifies the company is successful and growing; for example, stocks generally rise when a hotly anticipated product is announced.

The stock of a company is traded through one of a number of markets called stock exchanges. The buying and selling interactions are recorded and public, and there are many people in the world who monitor the interactions as they happen (via television, or programmatically) in the hopes of noticing opportunities before others and capitalizing on them. Each interaction induces a change on the price of a share of stock: whenever a share is bought at a certain price, that is the established and recorded price of a share (up to some fudging by brokers which is entirely mysterious to me). In any case, the prices go up and down, and they’re often bundled into “bars” which summarize the data over a certain period of time. The bars we use in this post are daily, and consist of four numbers: the open, the price at the beginning of the day, the high and low, which are self-explanatory, and the close, which is the price at the end of the day.

## Bandits and Daily Stock Trading

Now let’s simplify things as much as possible. Our bandit learning algorithm will interact with the market as follows: each day it chooses whether or not to buy a single dollar’s worth of a stock, and at the end of the day it sells the stock and observes the profit. There are no brokers involved, and the price the algorithm sees is the price it gets. In bandit language: the stocks represent actions, and the amount of profit at the end of a day constitutes the payoff of an action in one round. Since small-scale stock price movement is generally very poorly understood, it makes some level of sense to assume the price movements within a given day are adversarial. On the other hand, since we understand them so poorly, we might be tempted to just call them “random” fluctuations, i.e. stochastic. So this is a nice little testbed for seeing which assumption yields a more successful algorithm.

Unlike the traditional image of stock trading where an individual owns shares of a stock over a long period of time, our program will operate on a daily time scale, and hence cannot experience the typical kinds of growth. Nevertheless, we can try to make some money over time, and if it’s a good strategy, we could scale up the single dollar to whatever we’re willing to risk. Specifically, the code we used to compute the payoff is

def payoff(stockTable, t, stock, amountToInvest=1.0):
openPrice, closePrice = stockTable[stock][t]

sharesBought = amountToInvest / openPrice
amountAfterSale = sharesBought * closePrice

return amountAfterSale - amountToInvest


The remainder of the code is interfacing with the Exp3 and UCB1 functions we gave in previous posts, and data shuffling. We got our data from Google Finance, and we provide it, along with all of the code used in the making of this post, on this blog’s Github page. Before we run our experiments, let’s give a few reasons why this model is unrealistic.

1. We assume we can buy/sell fractional shares of a stock, which to my knowledge is not possible. Though this experiment could be redone where you buy a single share of a stock, or with mutual funds/currency exchange/whatever replacing stocks, we didn’t do it this way.
2. Brokerage fees can drastically change the success of an algorithm which trades frequently.
3. Open and close prices are not typical prices. People will often make decisions based on the time of day, but then again we might expect this to be just another reason that Exp3 would perform better than UCB1.
4. We’re not actually trading in the stock market, and so we’re ignoring the effects of our own algorithm on the prices in the market.
5. It’s impossible to guarantee you get to use the opening price and closing price in your transactions.
6. UCB1 and Exp3 don’t use all of the information available. Indeed, they assume that they would not be able to see the outcome of an action they did not take, but with stocks you can get a good estimate of how much money you would have made had you chosen a different stock.
7. Each trial in a bandit learning problem is identical from the learner’s perspective, but one often keeps a stock around while making other decisions.

We’ll come back to #6 after seeing the raw experiments for an unaltered UCB1 and Exp3, because there is a natural extension of the algorithm to handle additional information. I’m sure there are other glaring issues with the experimental setup, and the reader should feel free to rant about it in the comments. It won’t stop me from running the algorithm and seeing what happens just for fun.

## Data Sets

We ran the experiment on two sets of stocks. The first set consisted of nine random stocks, taken from the random stocks twitter feed, with 5 years of past data. The stocks are:

lxrx, keg, cuba, tdi, brks, mux, cadx, belfb, htr

And you can view more information about these particular stocks via Google Finance. The second set was a non-random choice of nine Fortune 500 companies with 10 years of past data. The stocks were

amzn, cost, jpm, gs, wfc, msft, tgt, aapl, wmt

And again more information about these stocks is available via Google Finance. For the record, here were the cumulative payoffs of each of the nine Fortune 500 companies:

The cumulative rewards for the nine Fortune 500 companies over the last ten years of data.

Interestingly, the company which started off with the best prospects (Apple), turned out to have the worst cumulative reward by the end. The long-term winners in our little imaginary world happen to be Amazon, Costco, and Goldman Sachs. Perhaps this gives credence to the assumption that payoffs are adversarial. A learner can easily get tricked into putting too much faith in one action early on.

And for the random stocks:

The cumulative payoff for the nine randomly chosen stocks for the last five years of data.

The random stocks clearly perform worse and more variably overall (although HTR surpasses most of the Fortune 500 companies, despite its otherwise relatively modest stock growth over the last five years). To my untrained eyes these movements look more like a stochastic model than an adversarial one.

## Experiments

Here is a typical example of a run of Exp3 on the Fortune 500 data set (using $\gamma = 0.33$, recall $\gamma$ measures the amount of uniform exploration performed):

(Expected payoff, variance) over 1000 trials is (1.122463919564572, 0.5518037498918705)
For a single run:
Payoff was 1.12
Regret was 2.91
Best stock was amzn at 4.02
weights: '0.00, 0.00, 0.00, 0.46, 0.52, 0.00, 0.00, 0.00, 0.01'

And one for UCB1:

(Expected payoff, variance) over 1000 trials is (1.1529891576139333, 0.5012825847001482)
For a single run:
Payoff was 1.73
Regret was 2.29
Best stock was amzn at 4.02
ucbs: '0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234'

The results are quite curious. Indeed, the expected payoff seems to be a whopping 110% return! The variance of these results is quite high, and so it’s not at all impossible that the algorithm could have a negative return. But just as often it would return around 200% profit.

Before we go risking all our money on this strategy, let’s take a closer look at what’s happening in the algorithm. It appears that for UCB1 the upper confidence bounds assigned to each action are the same! In other words, even after ten years of trials, no single stock “shined” above the others in the eyes of UCB1. It may seem that Exp3 has a leg up on UCB1 in this respect, because it’s clear that it gives higher weights to some stocks over others. However, running the algorithm multiple times shows drastically different weight distributions, and if we average the resulting weights over a thousand rounds, we see that they all have roughly the same mean and variance (the mean being first in the pair):

weight stats for msft: (0.107, 0.025)
weight stats for jpm: (0.109, 0.027)
weight stats for tgt: (0.110, 0.029)
weight stats for gs: (0.112, 0.025)
weight stats for wmt: (0.110, 0.027)
weight stats for aapl: (0.111, 0.027)
weight stats for amzn: (0.120, 0.029)
weight stats for cost: (0.113, 0.026)
weight stats for wfc: (0.107, 0.023)
Indeed, the best stock, Amazon, had an average weight just barely larger (and more variable) than any of the other stocks. So this evidence points to the conclusion that neither EXP3 nor UCB1 has any clue which stock is better. Pairing this with the fact that both algorithms nevertheless perform well would suggest that a random choice of action at each step is equally likely to do well. Indeed, when we run with a “random bandit” that just chooses actions uniformly at random, we get the following results:
(Expected payoff, variance) over 1000 trials is (1.1094227056931132, 0.4403783017367529)
For a single run:
Payoff was 3.13
Regret was 0.90
Best stock was amzn at 4.02

It’s not quite as good as either Exp3 or UCB1, but it’s close and less variable, which means a lot to an investor. In other words, it’s starting to look like Exp3 and UCB1 aren’t doing significantly better than random at all, and that a monkey would do well in this system (for these particular stocks).

Of course, Fortune 500 companies are pretty successful by definition, so let’s turn our attention to the random stocks:

For the random bandit learner:

(Expected payoff, variance) over 1000 trials is (-0.23952295977625776, 1.0787311145181104)
For a single run:
Payoff was -2.01
Regret was 3.92
Best stock was htr at 1.91

For UCB1:

(Expected payoff, variance) over 1000 trials is (-0.3503593899029112, 1.1136234992964154)
For a single run:
Payoff was 0.26
Regret was 1.65
Best stock was htr at 1.91
ucbs: '0.315, 0.315, 0.315, 0.316, 0.315, 0.315, 0.315, 0.315, 0.316'

And for Exp3:

(Expected payoff, variance) over 1000 trials is (-0.25827976810345593, 1.2946101887058519)
For a single run:
Payoff was -0.34
Regret was 2.25
Best stock was htr at 1.91
weights: '0.08, 0.00, 0.14, 0.06, 0.48, 0.00, 0.00, 0.04, 0.19'

But again Exp3 has no idea what stocks are actually best, with the average, variance over 1000 trials being:

weight stats for lxrx: '0.11, 0.02'
weight stats for keg: '0.11, 0.02'
weight stats for htr: '0.12, 0.02'
weight stats for cadx: '0.10, 0.02'
weight stats for belfb: '0.11, 0.02'
weight stats for tdi: '0.11, 0.02'
weight stats for cuba: '0.11, 0.02'
weight stats for mux: '0.11, 0.02'
weight stats for brks: '0.11, 0.02'

The long and short of it is that the choice of Fortune 500 stocks was inherently so biased toward success than a monkey could have made money investing in them, while the average choice of stocks had, if anything, a bias toward loss. And unfortunately using an algorithm like UCB1 or Exp3 straight out of the box doesn’t produce anything better than a monkey.

## Issues and Improvements

There are two glaring theoretical issues here that we haven’t yet addressed. One of these goes back to issue #5 in that list we gave at the beginning of the post: the bandit algorithms are assuming they have less information than they actually have! Indeed, at the end of a day of stock trading, you have a good idea what would have happened to you had you bought a different stock, and in our simplified world you can know exactly what your profit would have been. Recalling that UCB1 and Exp3 both maintained some numbers representing the strength of an action (Exp3 had a “weight” and UCB1 an upper confidence bound), the natural extension to both UCB1 and Exp3 is simply to modify the beliefs about all actions after any given round. This is a pretty simple improvement to make in our implementation, since it just changes a single weight update to a loop. For Exp3:

for choice in range(numActions):
rewardForUpdate = reward(choice, t)
scaledReward = (rewardForUpdate - rewardMin) / (rewardMax - rewardMin)
estimatedReward = 1.0 * scaledReward / probabilityDistribution[choice]
weights[choice] *= math.exp(estimatedReward * gamma / numActions)


With a similar loop for UCB1. This code should be familiar from our previous posts on bandits. We then rerun the new algorithms on the same data sets, and the results are somewhat surprising. First, UCB1 on Fortune 500:

(Expected payoff, variance) over 1000 trials is (3.530670654982728, 0.007713190816014095)
For a single run:
Payoff was 3.56
Regret was 0.47

This is clearly outperforming the random bandit learning algorithm, with an average return of 350%! In fact, it does almost as well as the best stock, and the variance is quite low. UCB1 also outperforms Exp3, which fares comparably to its pre-improved self. That is, it’s still not much better than random:

(Expected payoff, variance) over 1000 trials is (1.1424797906901956, 0.434335471375294)
For a single run:
Payoff was 1.24
Regret was 2.79

And also for the random stocks, UCB1 with improvements outperforms Exp3 and UCB1 without improvements. UCB1:

(Expected payoff, variance) over 1000 trials is (0.680211923900068, 0.04226672915962647)
For a single run:
Payoff was 0.82
Regret was 1.09

And Exp3:

(Expected payoff, variance) over 1000 trials is (-0.2242152508929378, 1.1312843329929194)
For a single run:
Payoff was -0.16
Regret was 2.07

We might wonder why this is the case, and there is a plausible explanation. See, Exp3 has a difficult life: it has to assume that at any time the adversary can completely change the game. And so Exp3 must remain vigilant, continuing to try options it knows to be terrible for fear that they may spontaneously do well. Exp3 is the grandfather who, after 75 years of not winning the lotto, continues to buy tickets every week. A better analogy might be a lioness who, even after being moved to the zoo, stays up all night to protect a cub from predators. This gives us quite a new perspective on Exp3: the world really has to be that messed up for Exp3 to be useful. As we saw, UCB1 is much more eager to jump on a winning bandwagon, and it paid off in both the good (Fortune 500) and bad (random stock) scenarios. All in all, this experiment would provide some minor evidence that the stock market (or just this cheesy version of it) is more stochastic than adversarial.

The second problem is that we’re treating these stocks as if they were isolated from the rest of the world. Indeed, along with each stock comes some kind of context in the form of information about that stock. Historical prices, corporate announcements, cyclic boom and bust, what the talking heads think, all of this may be relevant to the price fluctuations of a stock on any given day. While Exp3 and UCB1 are ill-equipped to handle such a rich landscape, researchers in bandit learning have recognized the importance of context in decision making. So much so, in fact, that an entire subfield of “Contextual Bandits” was born. John Langford, perhaps the world’s leading expert on bandit learning, wrote on his blog in 2007,

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.

I tend to agree with him. Bandit problems almost always come with some inherent additional structure in the real world, and the best algorithms will always take advantage of that structure. A “context” associated with each round is perhaps the weakest kind of structure, so it’s a natural place to look for better algorithms.

So that’s what we’ll do in the future of this series. But before then we might decide to come up with another experiment to run Exp3 and UCB1 on. It would be nice to see an instance in which Exp3 seriously outperforms UCB1, but maybe the real world is just stochastic and there’s nothing we can do about it.

Until next time!