# One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

## Statistical parity

Here is one way to define fairness.

Your population is a set $X$ and there is some known subset $S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $X$ is people and $S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $D$ over $X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $D$ takes that into account. Generally we impose no restrictions on $D$, and the definition of fairness will have to work no matter what $D$ is.

Now suppose we have a (possibly randomized) classifier $h:X \to \{-1,1\}$ giving labels to $X$. When given a person $x$ as input $h(x)=1$ if $x$ gets a loan and $-1$ otherwise. The bias, or statistical imparity, of $h$ on $S$ with respect to $X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $S$ is labeled 1 and the probability that a random individual from the complement $S^C$ is labeled 1.

$\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $h:X \to \{-1,1\}$ is said to have statistical parity on $D$ with respect to $S$ up to bias $\varepsilon$ if $|\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):
protectedClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] == protectedValue]
elseClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] != protectedValue]

if len(protectedClass) == 0 or len(elseClass) == 0:
raise Exception("One of the classes is empty!")
else:
protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

return elseProb - protectedProb


Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)


Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":

# [(test name, (index, value))]
tests = [('gender', (1,0)),
('private employment', (2,1)),
('asian race', (33,1)),
('divorced', (12, 1))]

for (name, (index, value)) in tests:
print("'%s' bias in training data: %.4f" %
(name, labelBias(train[0], train[1], index, value)))


(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582


Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
>>> signedBias(train[0], indian, 1, 0)
0.0030631816119030884


So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least \$50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $T$ in place of $S^C$, or looking at discrepancies among $k$ different sub-populations or with $m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $S$. I mean two things by this. First, you could have some historical data you want to train a classifier $h$ on, and usually you’ll be given training labels for the data that tell you whether $h(x)$ should be $1$ or $-1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $S$ necessarily reduces the accuracy of $h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

## Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!

# Optimism in the Face of Uncertainty: the UCB1 Algorithm

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

## Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of $K$ actions labeled by the integers $\left \{ 1, 2, \dots, K \right \}$. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action $i$ be drawn from a fixed distribution $D_i$ (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm $A$ and a set of actions $\left \{1, 2, \dots, K \right \}$, the cumulative regret of $A$ in rounds $1, \dots, T$ is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of $A$ for the first $T$ rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

## Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers $\left \{ 1, \dots, K \right \}$. The reward of action $i$ is a $[0,1]$-valued random variable $X_i$ distributed according to an unknown distribution and possessing an unknown expected value $\mu_i$. The game progresses in rounds $t = 1, 2, \dots$ so that in each round we have different random variables $X_{i,t}$ for the reward of action $i$ in round $t$ (in particular, $X_{i,t}$ and $X_{i,s}$ are identically distributed). The $X_{i,t}$ are independent as both $t$ and $i$ vary, although when $i$ varies the distribution changes.

So if we were to play action 2 over and over for $T$ rounds, then the total payoff would be the random variable $G_2(T) = \sum_{t=1}^T X_{2,t}$. But by independence across rounds and the linearity of expectation, the expected payoff is just $\mu_2 T$. So we can describe the best action as the action with the highest expected payoff. Define

$\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i$

We call the action which achieves the maximum $i^*$.

A policy is a randomized algorithm $A$ which picks an action in each round based on the history of chosen actions and observed rewards so far. Define $I_t$ to be the action played by $A$ in round $t$ and $P_i(n)$ to be the number of times we’ve played action $i$ in rounds $1 \leq t \leq n$. These are both random variables. Then the cumulative payoff for the algorithm $A$ over the first $T$ rounds, denoted $G_A(T)$, is just

$\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}$

and its expected value is simply

$\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T))$.

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy $A$ after the first $T$ steps, denoted $R_A(T)$ can be written as

$\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)$

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

$\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T))$.

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than $\Omega(\sqrt{KT})$. It is also known that there are algorithms that guarantee no worse than $O(\sqrt{KT})$ expected regret. The algorithm we’ll see in the next section, however, only guarantees $O(\sqrt{KT \log T})$. We present it on this blog because of its simplicity and ubiquity in the field.

## The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action $\mu_i$ is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose $Y_1, \dots, Y_n$ are independent random variables whose values lie in $[0,1]$ and whose expected values are $\mu_i$. Call $Y = \frac{1}{n}\sum_{i}Y_i$ and $\mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i$. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of $Y$ deviates from its mean. Specifically,

$\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}$

For us, the $Y_i$ will be the payoff variables for a single action $j$ in the rounds for which we choose action $j$. Then the variable $Y$ is just the empirical average payoff for action $j$ over all the times we’ve tried it. Moreover, $a$ is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for $a$ to find an upper bound big enough to be confident that we’re within $a$ of the true mean.

Indeed, if we call $n_j$ the number of times we played action $j$ thus far, then $n = n_j$ in the equation above, and using $a = a(j,T) = \sqrt{2 \log(T) / n_j}$ we get that $\textup{P}(Y > \mu + a) \leq T^{-4}$, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action $j$, its upper bound $a$ grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound $a(j, T)$ is getting bigger for actions we haven’t played; any round in which we play an action $j$, it must be that $a(j, T+1) = a(j,T)$, although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the $K$ actions once, giving initial values for empirical mean payoffs $\overline{x}_i$ of each action $i$.
For each round $t = K, K+1, \dots$:

Let $n_j$ represent the number of times action $j$ was played so far.
Play the action $j$ maximizing $\overline{x}_j + \sqrt{2 \log t / n_j}$.
Observe the reward $X_{j,t}$ and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means $x_j$ change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with $K$ actions, each of whose reward distribution $X_{i,t}$ has values in [0,1]. Then its expected cumulative regret after $T$ rounds is at most $O(\sqrt{KT \log T})$.

Actually, we’ll prove a more specific theorem. Let $\Delta_i$ be the difference $\mu^* - \mu_i$, where $\mu^*$ is the expected payoff of the best action, and let $\Delta$ be the minimal nonzero $\Delta_i$. That is, $\Delta_i$ represents how suboptimal an action is and $\Delta$ is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret $\mathbb{E}(R_{\textup{UCB1}}(T))$ is at most

$\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )$

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if $\Delta_i$ is small we will require more tries to know that action $i$ is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the $1 + \pi^2 / 3$ part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the $O(\sqrt{KT \log T})$ bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function $R(\Delta_1, \dots, \Delta_K)$ (excluding of course $\Delta^*$), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

$\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}$

and it’s zero if and only if for each $i$, $\Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T})$. However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the $\Delta_i = O(\sqrt{\log T})$ (which are all the same) gives a total bound of $O(K \sqrt{\log T})$. If we look at the only possible endpoint (the $\Delta_i = 1$), then we get a local maximum of $O(K \sqrt{\log T})$. But this isn’t the $O(\sqrt{KT \log T})$ we promised, what gives? Well, this upper bound grows arbitrarily large as the $\Delta_i$ go to zero. But at the same time, if all the $\Delta_i$ are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the $\Delta_i = \Delta$ are the same, then another trivial regret bound is $\Delta T$ (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at $\Delta = O(\sqrt{(K \log T) / T})$. Hence the regret bound at the switch is $\Delta T = O(\sqrt{KT \log T})$, as desired.

## Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on $P_i(T)$, the expected number of times UCB chooses an action up to round $T$. Using the $\Delta$ notation, the regret is then just $\sum_i \Delta_i \mathbb{E}(P_i(T))$, and bounding the $P_i$‘s will bound the regret.

Recall the notation for our upper bound $a(j, T) = \sqrt{2 \log T / P_j(T)}$ and let’s loosen it a bit to $a(y, T) = \sqrt{2 \log T / y}$ so that we’re allowed to “pretend” a action has been played $y$ times. Recall further that the random variable $I_t$ has as its value the index of the machine chosen. We denote by $\chi(E)$ the indicator random variable for the event $E$. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., $\overline{x}^*$ is the empirical mean of the optimal action).

Indeed for any action $i$, the only way we know how to write down $P_i(T)$ is as

$\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)$

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action $i$. Now we’re just going to pull some number $m-1$ of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than $m$ times overall, this requires an inequality.

$P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)$

These indicator functions should be read as sentences: we’re just saying that we’re picking action $i$ in round $t$ and we’ve already played $i$ at least $m$ times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action $i$ in round $t$ means that the upper bound for action $i$ exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and $i$ might coincide with the best action, but that’s fine). In notation this event is

$\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)$

Denote the upper bound $\overline{x}_i + a(i,t)$ for action $i$ in round $t$ by $U_i(t)$. Since this event must occur every time we pick action $i$ (though not necessarily vice versa), we have

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)$

We’ll do this process again but with a slightly more complicated event. If the upper bound of action $i$ exceeds that of the optimal machine, it is also the case that the maximum upper bound for action $i$ we’ve seen after the first $m$ trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round $t$ we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine $i$ (except that it’s more than $m$). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by $\overline{x}_{i,s}$ the random variable for the empirical mean after playing action $i$ a total of $s$ times, and $\overline{x}^*_s$ the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )$

Indeed, at each $t$ for which the max is greater than the min, there will be at least one pair $s,s'$ for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs $s, s'$ for which it happens. That is, we can expand the event above into the double sum which is at least as large:

$\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )$

We can make one other odd inequality by increasing the sum to go from $t=1$ to $\infty$. This will become clear later, but it means we can replace $t-1$ with $t$ and thus have

$\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )$

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that $\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)$. Then what can we say? Well, consider the following three events:

(1) $\displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)$
(2) $\displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)$
(3) $\displaystyle \mu^* < \mu_i + 2a(s, t)$

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability $t^{-4}$. Likewise, (2) is the event that the empirical mean payoff of action $i$ is larger than the upper confidence bound, which also occurs with probability $t^{-4}$. We will see momentarily that (3) is impossible for a well-chosen $m$ (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

$\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}$

and

$\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}$

But putting these two inequalities together gives us precisely that (3) is true:

$\mu^* < \mu_i + 2a(s,t)$

This proves the claim.

By the union bound, the probability that at least one of these events happens is $2t^{-4}$ plus whatever the probability of (3) being true is. But as we said, we’ll pick $m$ to make (3) always false. Indeed $m$ depends on which action $i$ is being played, and if $s \geq m > 8 \log T / \Delta_i^2$ then $2a(s,t) \leq \Delta_i$, and by the definition of $\Delta_i$ we have

$\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0$.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

$\square$

## Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
return math.sqrt(2 * math.log(step + 1) / numPlays)


And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
payoffSums = [0] * numActions
numPlays = [1] * numActions
ucbs = [0] * numActions

# initialize empirical sums
for t in range(numActions):
payoffSums[t] = reward(t,t)
yield t, payoffSums[t], ucbs

t = numActions

while True:
ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
action = max(range(numActions), key=lambda i: ucbs[i])
theReward = reward(action, t)
numPlays[action] += 1
payoffSums[action] += theReward

yield action, theReward, ucbs
t = t + 1


The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from $[0,1]$, biased by $1/k$ for some $5 \leq k \leq 15$. The regret and theoretical regret bound are given in the graph below.

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

## Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

# Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

## The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $X, Y$. Here, if we want to be completely formal, $X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $X$, to be the independent variable and the other, say $Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $X$ and $Y$, and that we intend to use the value of $X$ to predict the value of $Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $X$ could be age, and $Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $X$ corresponds to a proportional change in $Y$, and so one could hope there exist constants $a,b$ so that (as random variables) $Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $X$ and $Y$, and try to estimate the value of $a$ and $b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $\hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $(x,y)$ where $x$ is a value of $X$ and $y$ of $Y$, then the error of our predictor on this example is $|y - \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $y$ value to our prediction for the same $x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $|x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $(y - \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $(x_i, y_i)_{i=1}^n$ and a potential prediction line $\hat{f}(x) = ax + b$, we define the error of $\hat{f}$ on the examples to be

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - \hat{f}(x_i))^2$

Which can also be written as

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - ax_i - b)^2$

Note that since we’re fixing our data sample, the function $S$ is purely a function of the variables $a,b$. Now we want to minimize this quantity with respect to $a,b$, so we can take a gradient,

$\displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i - ax_i - b) x_i$

$\displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i - b)$

and set them simultaneously equal to zero. In the first we solve for $b$:

$\displaystyle 0 = -2 \sum_{i=1}^n y_i - ax_i - b = -2 \left (-nb + \sum_{i=1}^n y_i - ax_i \right )$

$\displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i - ax_i$

If we denote by $x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $b = y_{\textup{avg}} - ax_{\textup{avg}}$. Substituting $b$ into the other equation we get

$\displaystyle -2 \sum_{i=1}^n (y_ix_i - ax_i^2 - y_{\textup{avg}}x_i - ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $a$, further simplifies to

$\displaystyle 0 = \sum_{i=1}^n y_ix_i - y_{\textup{avg}}x_i - a \sum_{i=1}^n (x_i^2 - x_{\textup{avg}}x_i)$

And so

$\displaystyle a = \frac{\sum_{i=1}^n (y_i - y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i - x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
xAvg, yAvg = map(avg, zip(*points))

aNum = 0
for (x,y) in points:
aNum += (y - yAvg) * x
aDenom += (x - xAvg) * x

b = yAvg - a * xAvg
return (a, b), lambda x: a*x + b


and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

## Many Variables and Matrix Form

If we take those two variables $x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $\mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$\displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $\mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $\mathbf{x} = (x_0, x_1)$. The output is still $y$. Here we understand implicitly that the indices line up: if $w_0$ is the constant term, then that makes $x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $x_1$ is the linear term with coefficient $w_1$. Now we can just write the prediction function as

$\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $Y$ depending on the $n$ random variables $X_1, \dots, X_n$, then our data will come in tuples of the form $(\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $x_0$ is fixed to 1. Then expanding our line $\mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $\mathbf{x}_j$ as rows of a matrix $X$ and the output values $y_j$ as entries of a vector $\mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$y - Xw$

Indeed, each entry of the vector $Xw$ is a dot product of a row of $X$ (an input data point) with the coefficients of the line $w$. It’s just $\hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$\displaystyle S(w) = \| y - Xw \|^2$

This is just taking a dot product of $y-Xw$ with itself. This form is awkward to differentiate because the variable $w$ is nested in the norm. Luckily, we can get the same result by viewing $y - Xw$ as a 1-by-$n$ matrix, transposing it, and multiplying by $y-Xw$.

$\displaystyle S(w) = (y - Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $S$ with respect to each of the $w_i$ variables in $w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$\displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $x$, the partial $\frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $w$ gives $w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
# input points are n+1 tuples of n inputs and 1 output
X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
y = array([p[-1] for p in points])

Xt = transpose(X)
theInverse = inv(dot(Xt, X))
w = dot(dot(theInverse, Xt), y)
return w, lambda x: dot(w, x)


Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

## Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $X,Y$. If we assume that $Y$ is perfectly determined by $X$ via some linear equation $Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $x$ example, its corresponding $y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $y_i$ would be $y_i = ax_i + b + \eta_i$, and we say all the $\eta_i$ are drawn independently and uniformly at random from the set $\left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $\eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $H$ the random variable producing the noise (in Greek $H$ is the capital letter for $\eta$), and write $Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $\pm 1$, but is rather chosen to be Gaussian with mean $0$ and some variance $\sigma$. We’d denote this by $\eta_i \sim N(\mu, \sigma)$, and say the $\eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $y$ are given by $y = Xw + H$ where $X$ is the observed data and $H$ is Gaussian noise with mean zero and some (unknown) standard deviation $\sigma$. Then if we call $\hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$\displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $y$ by $Xw + H$,

$\displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$\textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $w$ is constant (so its expected value is just itself) and $\textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $\hat{w}$ is just $w$. Because this is true for all choices of data $X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $\hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $w_i$ as variables depending on $X$, we apply the definition:

$\textup{Var}(\hat{w} | X) = \textup{E}(\| w - \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $H$ is just the diagonal matrix $\sigma^2 I_n$, we get that

$\textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $(X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

# Principal Component Analysis

Problem: Reduce the dimension of a data set, translating each data point into a representation that captures the “most important” features.

Solution: in Python

import numpy

def principalComponents(matrix):
# Columns of matrix correspond to data points, rows to dimensions.

deviationMatrix = (matrix.T - numpy.mean(matrix, axis=1)).T
covarianceMatrix = numpy.cov(deviationMatrix)
eigenvalues, principalComponents = numpy.linalg.eig(covarianceMatrix)

# sort the principal components in decreasing order of corresponding eigenvalue
indexList = numpy.argsort(-eigenvalues)
eigenvalues = eigenvalues[indexList]
principalComponents = principalComponents[:, indexList]

return eigenvalues, principalComponents


Discussion: The problem of reducing the dimension of a dataset in a meaningful way shows up all over modern data analysis. Sophisticated techniques are used to select the most important dimensions, and even more sophisticated techniques are used to reason about what it means for a dimension to be “important.”

One way to solve this problem is to compute the principal components of a datasetFor the method of principal components, “important” is interpreted as the direction of largest variability. Note that these “directions” are vectors which may incorporate a portion of many or all of the “standard” dimensions in question. For instance, the picture below obviously has two different intrinsic dimensions from the standard axes.

The regular reader of this blog may recognize this idea from our post on eigenfaces. Indeed, eigenfaces are simply the principal components of a dataset of face images. We will briefly discuss how the algorithm works here, but leave the why to the post on eigenfaces. The crucial interpretation to make is that finding principal components amounts to a linear transformation of the data (that is, only such operations as rotation, translation, scaling, shear, etc. are allowed) which overlays the black arrows above on the standard axes. In the parlance of linear algebra, we’re re-plotting the data with respect to a convenient orthonormal basis of eigenvectors.

Here we first represent the dataset as a matrix whose columns are the data points, and whose rows represent the different dimensions. For example, if it were financial data then the columns might be the instances of time at which the data was collected, and the rows might represent the prices of the commodities recorded at those times. From here we compute two statistical properties of the dataset: the average datapoint, and the standard deviation of each data point. This is done on line 6 above, where the arithmetic operations are entrywise (a convenient feature of Python’s numpy library).

Next, we compute the covariance matrix for the data points. That is, interpreting each dimension as a random variable and the data points are observations of that random variable, we want to compute how the different dimensions are correlated. One way to estimate this from a sample is to compute the dot products of the deviation vectors and divide by the number of data points. For more details, see this Wikipedia entry.

Now (again, for reasons which we detail in our post on eigenfaces), the eigenvectors of this covariance matrix point in the directions of maximal variance, and the magnitude of the eigenvalues correspond to the magnitude of the variance. Even more, regarding the dimensions a random variables, the correlation between the axes of this new representation are zero! This is part of why this method is so powerful; it represents the data in terms of unrelated features. One downside to this is that the principal component features may have no tractable interpretation in terms of real-life phenomena.

Finally, one common thing to do is only use the first few principal components, where by ‘first’ we mean those whose corresponding eigenvalues are the largest. Then one projects the original data points onto the chosen principal components, thus controlling precisely the dimension the data is reduced to. One important question is: how does one decide how many principal components to use?

Because the principal components with larger eigenvalues correspond to features with more variability, one can compute the total variation accounted for with a given set of principal components. Here, the ‘total variation’ is the sum of the variance of each of the random variables (that is, the trace of the covariance matrix, i.e. the sum of its eigenvalues). Since the eigenvalues correspond to the variation in the chosen principal components, we can naturally compute the accounted variation as a proportion. Specifically, if $\lambda_1, \dots \lambda_k$ are the eigenvalues of the chosen principal components, and $\textup{tr}(A)$ is the trace of the covariance matrix, then the total variation covered by the chosen principal components is simply $(\lambda_1 + \dots + \lambda_k)/\textup{tr}(A)$.

In many cases of high-dimensional data, one can encapsulate more than 90% of the total variation using a small fraction of the principal components. In our post on eigenfaces we used a relatively homogeneous dataset of images; our recognition algorithm performed quite well using only about 20 out of 36,000 principal components. Note that there were also some linear algebra tricks to compute only those principal components which had nonzero eigenvalues. In any case, it is clear that if the data is nice enough, principal component analysis is a vey powerful tool.