One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

Statistical parity

Here is one way to define fairness.

Your population is a set $ X$ and there is some known subset $ S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $ X$ is people and $ S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $ D$ over $ X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $ D$ takes that into account. Generally we impose no restrictions on $ D$, and the definition of fairness will have to work no matter what $ D$ is.

Now suppose we have a (possibly randomized) classifier $ h:X \to \{-1,1\}$ giving labels to $ X$. When given a person $ x$ as input $ h(x)=1$ if $ x$ gets a loan and $ -1$ otherwise. The bias, or statistical imparity, of $ h$ on $ S$ with respect to $ X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $ S$ is labeled 1 and the probability that a random individual from the complement $ S^C$ is labeled 1.

$ \textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] – \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $ D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $ h:X \to \{-1,1\}$ is said to have statistical parity on $ D$ with respect to $ S$ up to bias $ \varepsilon$ if $ |\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $ x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -&gt; float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):   
   protectedClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] == protectedValue]   
   elseClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] != protectedValue]

   if len(protectedClass) == 0 or len(elseClass) == 0:
      raise Exception(&quot;One of the classes is empty!&quot;)
   else:
      protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
      elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

   return elseProb - protectedProb

Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -&gt; float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
   return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)

Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == &quot;__main__&quot;:
   from data import adult
   train, test = adult.load(separatePointsAndLabels=True)

   # [(test name, (index, value))]
   tests = [('gender', (1,0)), 
            ('private employment', (2,1)), 
            ('asian race', (33,1)),
            ('divorced', (12, 1))]

   for (name, (index, value)) in tests:
      print(&quot;'%s' bias in training data: %.4f&quot; %
         (name, labelBias(train[0], train[1], index, value)))

(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582

Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

&gt;&gt;&gt; indian = lambda x: x[47] == 1
&gt;&gt;&gt; len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
&gt;&gt;&gt; signedBias(train[0], indian, 1, 0)
0.0030631816119030884

So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least $50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $ T$ in place of $ S^C$, or looking at discrepancies among $ k$ different sub-populations or with $ m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $ S$. I mean two things by this. First, you could have some historical data you want to train a classifier $ h$ on, and usually you’ll be given training labels for the data that tell you whether $ h(x)$ should be $ 1$ or $ -1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $ S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $ S$ necessarily reduces the accuracy of $ h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $ h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

  1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
  2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!

The Boosting Margin, or Why Boosting Doesn’t Overfit

There’s a well-understood phenomenon in machine learning called overfitting. The idea is best shown by a graph:

overfitting

Let me explain. The vertical axis represents the error of a hypothesis. The horizontal axis represents the complexity of the hypothesis. The blue curve represents the error of a machine learning algorithm’s output on its training data, and the red curve represents the generalization of that hypothesis to the real world. The overfitting phenomenon is marker in the middle of the graph, before which the training error and generalization error both go down, but after which the training error continues to fall while the generalization error rises.

The explanation is a sort of numerical version of Occam’s Razor that says more complex hypotheses can model a fixed data set better and better, but at some point a simpler hypothesis better models the underlying phenomenon that generates the data. To optimize a particular learning algorithm, one wants to set parameters of their model to hit the minimum of the red curve.

This is where things get juicy. Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t overfit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty.

Schapire, Freund, Bartlett, and Lee proposed a theoretical explanation for this based on the notion of a margin, and the goal of this post is to go through the details of their theorem and proof. Remember that the standard AdaBoost algorithm produces a set of weak hypotheses $ h_i(x)$ and a corresponding weight $ \alpha_i \in [-1,1]$ for each round $ i=1, \dots, T$. The classifier at the end is a weighted majority vote of all the weak learners (roughly: weak learners with high error on “hard” data points get less weight).

Definition: The signed confidence of a labeled example $ (x,y)$ is the weighted sum:

$ \displaystyle \textup{conf}(x) = \sum_{i=1}^T \alpha_i h_i(x)$

The margin of $ (x,y)$ is the quantity $ \textup{margin}(x,y) = y \textup{conf}(x)$. The notation implicitly depends on the outputs of the AdaBoost algorithm via “conf.”

We use the product of the label and the confidence for the observation that $ y \cdot \textup{conf}(x) \leq 0$ if and only if the classifier is incorrect. The theorem we’ll prove in this post is

Theorem: With high probability over a random choice of training data, for any $ 0 < \theta < 1$ generalization error of boosting is bounded from above by

$ \displaystyle \Pr_{\textup{train}}[\textup{margin}(x) \leq \theta] + O \left ( \frac{1}{\theta} (\textup{typical error terms}) \right )$

In words, the generalization error of the boosting hypothesis is bounded by the distribution of margins observed on the training data. To state and prove the theorem more generally we have to return to the details of PAC-learning. Here and in the rest of this post, $ \Pr_D$ denotes $ \Pr_{x \sim D}$, the probability over a random example drawn from the distribution $ D$, and $ \Pr_S$ denotes the probability over a random (training) set of examples drawn from $ D$.

Theorem: Let $ S$ be a set of $ m$ random examples chosen from the distribution $ D$ generating the data. Assume the weak learner corresponds to a finite hypothesis space $ H$ of size $ |H|$, and let $ \delta > 0$. Then with probability at least $ 1 – \delta$ (over the choice of $ S$), every weighted-majority vote function $ f$ satisfies the following generalization bound for every $ \theta > 0$.

$ \displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_S[y f(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{\log m \log |H|}{\theta^2} + \log(1/\delta)} \right )$

In other words, this phenomenon is a fact about voting schemes, not boosting in particular. From now on, a “majority vote” function $ f(x)$ will mean to take the sign of a sum of the form $ \sum_{i=1}^N a_i h_i(x)$, where $ a_i \geq 0$ and $ \sum_i a_i = 1$. This is the “convex hull” of the set of weak learners $ H$. If $ H$ is infinite (in our proof it will be finite, but we’ll state a generalization afterward), then only finitely many of the $ a_i$ in the sum may be nonzero.

To prove the theorem, we’ll start by defining a class of functions corresponding to “unweighted majority votes with duplicates:”

Definition: Let $ C_N$ be the set of functions $ f(x)$ of the form $ \frac{1}{N} \sum_{i=1}^N h_i(x)$ where $ h_i \in H$ and the $ h_i$ may contain duplicates (some of the $ h_i$ may be equal to some other of the $ h_j$).

Now every majority vote function $ f$ can be written as a weighted sum of $ h_i$ with weights $ a_i$ (I’m using $ a$ instead of $ \alpha$ to distinguish arbitrary weights from those weights arising from Boosting). So any such $ f(x)$ defines a natural distribution over $ H$ where you draw function $ h_i$ with probability $ a_i$. I’ll call this distribution $ A_f$. If we draw from this distribution $ N$ times and take an unweighted sum, we’ll get a function $ g(x) \in C_N$. Call the random process (distribution) generating functions in this way $ Q_f$. In diagram form, the logic goes

$ f \to $ weights $ a_i \to$ distribution over $ H \to$ function in $ C_N$ by drawing $ N$ times according to $ H$.

The main fact about the relationship between $ f$ and $ Q_f$ is that each is completely determined by the other. Obviously $ Q_f$ is determined by $ f$ because we defined it that way, but $ f$ is also completely determined by $ Q_f$ as follows:

$ \displaystyle f(x) = \mathbb{E}_{g \sim Q_f}[g(x)]$

Proving the equality is an exercise for the reader.

Proof of Theorem. First we’ll split the probability $ \Pr_D[y f(x) \leq 0]$ into two pieces, and then bound each piece.

First a probability reminder. If we have two events $ A$ and $ B$ (in what’s below, this will be $ yg(x) \leq \theta/2$ and $ yf(x) \leq 0$, we can split up $ \Pr[A]$ into $ \Pr[A \textup{ and } B] + \Pr[A \textup{ and } \overline{B}]$ (where $ \overline{B}$ is the opposite of $ B$). This is called the law of total probability. Moreover, because $ \Pr[A \textup{ and } B] = \Pr[A | B] \Pr[B]$ and because these quantities are all at most 1, it’s true that $ \Pr[A \textup{ and } B] \leq \Pr[A \mid B]$ (the conditional probability) and that $ \Pr[A \textup{ and } B] \leq \Pr[B]$.

Back to the proof. Notice that for any $ g(x) \in C_N$ and any $ \theta > 0$, we can write $ \Pr_D[y f(x) \leq 0]$ as a sum:

$ \displaystyle \Pr_D[y f(x) \leq 0] =\\ \Pr_D[yg(x) \leq \theta/2 \textup{ and } y f(x) \leq 0] + \Pr_D[yg(x) > \theta/2 \textup{ and } y f(x) \leq 0]$

Now I’ll loosen the first term by removing the second event (that only makes the whole probability bigger) and loosen the second term by relaxing it to a conditional:

$ \displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_D[y g(x) \leq \theta / 2] + \Pr_D[yg(x) > \theta/2 \mid yf(x) \leq 0]$

Now because the inequality is true for every $ g(x) \in C_N$, it’s also true if we take an expectation of the RHS over any distribution we choose. We’ll choose the distribution $ Q_f$ to get

$ \displaystyle \Pr_D[yf(x) \leq 0] \leq T_1 + T_2$

And $ T_1$ (term 1) is

$ \displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

And $ T_2$ is

$ \displaystyle \Pr_{x \sim D, g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0] = \mathbb{E}_D[\Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]]$

We can rewrite the probabilities using expectations because (1) the variables being drawn in the distributions are independent, and (2) the probability of an event is the expectation of the indicator function of the event.

Now we’ll bound the terms $ T_1, T_2$ separately. We’ll start with $ T_2$.

Fix $ (x,y)$ and look at the quantity inside the expectation of $ T_2$.

$ \displaystyle \Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]$

This should intuitively be very small for the following reason. We’re sampling $ g$ according to a distribution whose expectation is $ f$, and we know that $ yf(x) \leq 0$. Of course $ yg(x)$ is unlikely to be large.

Mathematically we can prove this by transforming the thing inside the probability to a form suitable for the Chernoff bound. Saying $ yg(x) > \theta / 2$ is the same as saying $ |yg(x) – \mathbb{E}[yg(x)]| > \theta /2$, i.e. that some random variable which is a sum of independent random variables (the $ h_i$) deviates from its expectation by at least $ \theta/2$. Since the $ y$’s are all $ \pm 1$ and constant inside the expectation, they can be removed from the absolute value to get

$ \displaystyle \leq \Pr_{g \sim Q_f}[g(x) – \mathbb{E}[g(x)] > \theta/2]$

The Chernoff bound allows us to bound this by an exponential in the number of random variables in the sum, i.e. $ N$. It turns out the bound is $ e^{-N \theta^2 / 8}$.

Now recall $ T_1$

$ \displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

For $ T_1$, we don’t want to bound it absolutely like we did for $ T_2$, because there is nothing stopping the classifier $ f$ from being a bad classifier and having lots of error. Rather, we want to bound it in terms of the probability that $ yf(x) \leq \theta$. We’ll do this in two steps. In step 1, we’ll go from $ \Pr_D$ of the $ g$’s to $ \Pr_S$ of the $ g$’s.

Step 1: For any fixed $ g, \theta$, if we take a sample $ S$ of size $ m$, then consider the event in which the sample probability deviates from the true distribution by some value $ \varepsilon_N$, i.e. the event

$ \displaystyle \Pr_D[yg(x) \leq \theta /2] > \Pr_{S, x \sim S}[yg(x) \leq \theta/2] + \varepsilon_N$

The claim is this happens with probability at most $ e^{-2m\varepsilon_N^2}$. This is again the Chernoff bound in disguise, because the expected value of $ \Pr_S$ is $ \Pr_D$, and the probability over $ S$ is an average of random variables (it’s a slightly different form of the Chernoff bound; see this post for more). From now on we’ll drop the $ x \sim S$ when writing $ \Pr_S$.

The bound above holds true for any fixed $ g,\theta$, but we want a bound over all $ g$ and $ \theta$. To do that we use the union bound. Note that there are only $ (N+1)$ possible choices for a nonnegative $ \theta$ because $ g(x)$ is a sum of $ N$ values each of which is either $ \pm1$. And there are only $ |C_N| \leq |H|^N$ possibilities for $ g(x)$. So the union bound says the above event will occur with probability at most $ (N+1)|H|^N e^{-2m\varepsilon_N^2}$.

If we want the event to occur with probability at most $ \delta_N$, we can judiciously pick

$ \displaystyle \varepsilon_N = \sqrt{(1/2m) \log ((N+1)|H|^N / \delta_N)}$

And since the bound holds in general, we can take expectation with respect to $ Q_f$ and nothing changes. This means that for any $ \delta_N$, our chosen $ \varepsilon_N$ ensures that the following is true with probability at least $ 1-\delta_N$:

$ \displaystyle \Pr_{D, g \sim Q_f}[yg(x) \leq \theta/2] \leq \Pr_{S, g \sim Q_f}[yg(x) \leq \theta/2] + \varepsilon_N$

Now for step 2, we bound the probability that $ yg(x) \leq \theta/2$ on a sample to the probability that $ yf(x) \leq \theta$ on a sample.

Step 2: The first claim is that

$ \displaystyle \Pr_{S, g \sim Q_f}[yg(x) \leq \theta / 2] \leq \Pr_{S} [yf(x) \leq \theta] + \mathbb{E}_{S}[\Pr_{g \sim Q_f}[yg(x) \leq \theta/2 \mid yf(x) \geq \theta]]$

What we did was break up the LHS into two “and”s, when $ yf(x) > \theta$ and $ yf(x) \leq \theta$ (this was still an equality). Then we loosened the first term to $ \Pr_{S}[yf(x) \leq \theta]$ since that is only more likely than both $ yg(x) \leq \theta/2$ and $ yf(x) \leq \theta$. Then we loosened the second term again using the fact that a probability of an “and” is bounded by the conditional probability.

Now we have the probability of $ yg(x) \leq \theta / 2$ bounded by the probability that $ yf(x) \leq 0$ plus some stuff. We just need to bound the “plus some stuff” absolutely and then we’ll be done. The argument is the same as our previous use of the Chernoff bound: we assume $ yf(x) \geq \theta$, and yet $ yg(x) \leq \theta / 2$. So the deviation of $ yg(x)$ from its expectation is large, and the probability that happens is exponentially small in the amount of deviation. The bound you get is

$ \displaystyle \Pr_{g \sim Q}[yg(x) \leq \theta/2 \mid yf(x) > \theta] \leq e^{-N\theta^2 / 8}.$

And again we use the union bound to ensure the failure of this bound for any $ N$ will be very small. Specifically, if we want the total failure probability to be at most $ \delta$, then we need to pick some $ \delta_j$’s so that $ \delta = \sum_{j=0}^{\infty} \delta_j$. Choosing $ \delta_N = \frac{\delta}{N(N+1)}$ works.

Putting everything together, we get that with probability at least $ 1-\delta$ for every $ \theta$ and every $ N$, this bound on the failure probability of $ f(x)$:

$ \displaystyle \Pr_{x \sim D}[yf(x) \leq 0] \leq \Pr_{S, x \sim S}[yf(x) \leq \theta] + 2e^{-N \theta^2 / 8} + \sqrt{\frac{1}{2m} \log \left ( \frac{N(N+1)^2 |H|^N}{\delta} \right )}.$

This claim is true for every $ N$, so we can pick $ N$ that minimizes it. Doing a little bit of behind-the-scenes calculus that is left as an exercise to the reader, a tight choice of $ N$ is $ (4/ \theta)^2 \log(m/ \log |H|)$. And this gives the statement of the theorem.

$ \square$

We proved this for finite hypothesis classes, and if you know what VC-dimension is, you’ll know that it’s a central tool for reasoning about the complexity of infinite hypothesis classes. An analogous theorem can be proved in terms of the VC dimension. In that case, calling $ d$ the VC-dimension of the weak learner’s output hypothesis class, the bound is

$ \displaystyle \Pr_D[yf(x) \leq 0] \leq \Pr_S[yf(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{d \log^2(m/d)}{\theta^2} + \log(1/\delta)} \right )$

How can we interpret these bounds with so many parameters floating around? That’s where asymptotic notation comes in handy. If we fix $ \theta \leq 1/2$ and $ \delta = 0.01$, then the big-O part of the theorem simplifies to $ \sqrt{(\log |H| \cdot \log m) / m}$, which is easier to think about since $ (\log m)/m$ goes to zero very fast.

Now the theorem we just proved was about any weighted majority function. The question still remains: why is AdaBoost good? That follows from another theorem, which we’ll state and leave as an exercise (it essentially follows by unwrapping the definition of the AdaBoost algorithm from last time).

Theorem: Suppose that during AdaBoost the weak learners produce hypotheses with training errors $ \varepsilon_1, \dots , \varepsilon_T$. Then for any $ \theta$,

$ \displaystyle \Pr_{(x,y) \sim S} [yf(x) \leq \theta] \leq 2^T \prod_{t=1}^T \sqrt{\varepsilon_t^{(1-\theta)} (1-\varepsilon_t)^{(1+\theta)}}$

Let’s interpret this for some concrete numbers. Say that $ \theta = 0$ and $ \varepsilon_t$ is any fixed value less than $ 1/2$. In this case the term inside product becomes $ \sqrt{\varepsilon (1-\varepsilon)} < 1/2$ and the whole bound tends exponentially quickly to zero in the number of rounds $ T$. On the other hand, if we raise $ \theta$ to about 1/3, then in order to maintain the LHS tending to zero we would need $ \varepsilon < \frac{1}{4} ( 3 – \sqrt{5} )$ which is about 20% error.

If you’re interested in learning more about Boosting, there is an excellent book by Freund and Schapire (the inventors of boosting) called Boosting: Foundations and Algorithms. There they include a tighter analysis based on the idea of Rademacher complexity. The bound I presented in this post is nice because the proof doesn’t require any machinery past basic probability, but if you want to reach the cutting edge of knowledge about boosting you need to invest in the technical stuff.

Until next time!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $ p$. If $ X$ denotes the number of trials needed to get the first success, then clearly $ \Pr(X = k) = (1-p)^{k-1} p$ (since first we need $ k-1$ failures which occur independently with probability $ 1-p$, then we need one success which happens with probability $ p$). Thus the expected value of $ X$ is

$ \displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $ p=1/6$ thus the expected time is $ 1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $ \frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $ \left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$ \displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $ 11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $ 11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly $ \frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $ \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $ T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $ T$ dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $ \$26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $ \$26^{11}$. Thus our casino will have to give out $ 26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $ T$ dollars, the expected value of our expenses is $ 26^{11}+26^4+26$ dollars, thus $ E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $ X_0$ will denote the gambler’s fortune before the game starts, $ X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $ X_n$, given $ X_1, X_2, \ldots, X_{n-1}$ is the same as $ X_{n-1}$. This can be written as $ E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $ E(X_n – X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $ X’_n = X_n$ if $ n < T$ and $ X’_n = X_T$ if $ n \ge T$ where $ T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $ T$ itself is a random variable.

We require our stopping time $ T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $ X_n$ be a martingale stopped at step $ T$, and suppose one of the following three conditions hold:

  1. The stopping time $ T$ is almost surely bounded by some constant;
  2. The stopping time $ T$ is almost surely finite and every step of the stopped martingale $ X_n$ is almost surely bounded by some constant; or
  3. The expected stopping time $ E(T)$ is finite and the absolute value of the martingale increments $ |X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $ E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $ 26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $ 26^{11}+26^4+26$. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $ a$ dollars, the second player has $ b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $ X_n$ denote the change in the second player’s fortune, and set $ X_0 = 0$. Let $ T_k$ denote the first time $ s$ when $ X_s = k$. Then our first question can be formalized as trying to determine $ \Pr(T_{-b} < T_a)$. Let $ t = \min \{ T_{-b}, T_a\}$. Clearly $ t$ is a stopping time. By the optional stopping theorem we have that

$ \displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $ \Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $ X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $ k$-SAT for some $ k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $ k$ since $ k$-SAT is a special case of $ (k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $ k$-SAT for any $ k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $ x$ and $ y$ are connected by a directed edge if there is a clause $ (\bar x \lor y)$. Recall that $ \bar x \lor y$ is equivalent to $ x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $ x \to \bar x$ and $ \bar x \to x$ (since $ x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $ O(n^2)$ rounds where $ n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $ O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $ \pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $ \frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $ n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $ n$, on the other hand, is called a reflecting barrier: we cannot reach $ n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $ O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $ X, Y$. Here, if we want to be completely formal, $ X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $ X$, to be the independent variable and the other, say $ Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $ X$ and $ Y$, and that we intend to use the value of $ X$ to predict the value of $ Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $ X$ could be age, and $ Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $ X$ corresponds to a proportional change in $ Y$, and so one could hope there exist constants $ a,b$ so that (as random variables) $ Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $ X$ and $ Y$, and try to estimate the value of $ a$ and $ b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $ a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $ \hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $ (x,y)$ where $ x$ is a value of $ X$ and $ y$ of $ Y$, then the error of our predictor on this example is $ |y – \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $ y$ value to our prediction for the same $ x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $ |x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $ (y – \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $ (x_i, y_i)_{i=1}^n$ and a potential prediction line $ \hat{f}(x) = ax + b$, we define the error of $ \hat{f}$ on the examples to be

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – \hat{f}(x_i))^2$

Which can also be written as

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – ax_i – b)^2$

Note that since we’re fixing our data sample, the function $ S$ is purely a function of the variables $ a,b$. Now we want to minimize this quantity with respect to $ a,b$, so we can take a gradient,

$ \displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i – ax_i – b) x_i$

$ \displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i – b)$

and set them simultaneously equal to zero. In the first we solve for $ b$:

$ \displaystyle 0 = -2 \sum_{i=1}^n y_i – ax_i – b = -2 \left (-nb + \sum_{i=1}^n y_i – ax_i \right )$

$ \displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i – ax_i$

If we denote by $ x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $ b = y_{\textup{avg}} – ax_{\textup{avg}}$. Substituting $ b$ into the other equation we get

$ \displaystyle -2 \sum_{i=1}^n (y_ix_i – ax_i^2 – y_{\textup{avg}}x_i – ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $ a$, further simplifies to

$ \displaystyle 0 = \sum_{i=1}^n y_ix_i – y_{\textup{avg}}x_i – a \sum_{i=1}^n (x_i^2 – x_{\textup{avg}}x_i)$

And so

$ \displaystyle a = \frac{\sum_{i=1}^n (y_i – y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i – x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
   xAvg, yAvg = map(avg, zip(*points))

   aNum = 0
   aDenom = 0
   for (x,y) in points:
      aNum += (y - yAvg) * x
      aDenom += (x - xAvg) * x

   a = float(aNum) / aDenom
   b = yAvg - a * xAvg
   return (a, b), lambda x: a*x + b

and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

Many Variables and Matrix Form

If we take those two variables $ x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $ x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $ \mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$ \displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $ \mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $ \mathbf{x} = (x_0, x_1)$. The output is still $ y$. Here we understand implicitly that the indices line up: if $ w_0$ is the constant term, then that makes $ x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $ x_1$ is the linear term with coefficient $ w_1$. Now we can just write the prediction function as

$ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $ Y$ depending on the $ n$ random variables $ X_1, \dots, X_n$, then our data will come in tuples of the form $ (\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $ x_0$ is fixed to 1. Then expanding our line $ \mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $ \mathbf{x}_j$ as rows of a matrix $ X$ and the output values $ y_j$ as entries of a vector $ \mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$ y – Xw$

Indeed, each entry of the vector $ Xw$ is a dot product of a row of $ X$ (an input data point) with the coefficients of the line $ w$. It’s just $ \hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$ \displaystyle S(w) = \| y – Xw \|^2$

This is just taking a dot product of $ y-Xw$ with itself. This form is awkward to differentiate because the variable $ w$ is nested in the norm. Luckily, we can get the same result by viewing $ y – Xw$ as a 1-by-$ n$ matrix, transposing it, and multiplying by $ y-Xw$.

$ \displaystyle S(w) = (y – Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $ S$ with respect to each of the $ w_i$ variables in $ w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$ \displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $ x$, the partial $ \frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $ w$ gives $ w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $ X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
   # input points are n+1 tuples of n inputs and 1 output
   X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
   y = array([p[-1] for p in points])

   Xt = transpose(X)
   theInverse = inv(dot(Xt, X))
   w = dot(dot(theInverse, Xt), y)
   return w, lambda x: dot(w, x)

Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $ X,Y$. If we assume that $ Y$ is perfectly determined by $ X$ via some linear equation $ Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $ x$ example, its corresponding $ y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $ y_i$ would be $ y_i = ax_i + b + \eta_i$, and we say all the $ \eta_i$ are drawn independently and uniformly at random from the set $ \left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $ \eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $ H$ the random variable producing the noise (in Greek $ H$ is the capital letter for $ \eta$), and write $ Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $ \pm 1$, but is rather chosen to be Gaussian with mean $ 0$ and some variance $ \sigma$. We’d denote this by $ \eta_i \sim N(\mu, \sigma)$, and say the $ \eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $ y$ are given by $ y = Xw + H$ where $ X$ is the observed data and $ H$ is Gaussian noise with mean zero and some (unknown) standard deviation $ \sigma$. Then if we call $ \hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$ \displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $ y$ by $ Xw + H$,

$ \displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($ X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$ \textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $ w$ is constant (so its expected value is just itself) and $ \textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $ \hat{w}$ is just $ w$. Because this is true for all choices of data $ X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $ \hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $ w_i$ as variables depending on $ X$, we apply the definition:

$ \textup{Var}(\hat{w} | X) = \textup{E}(\| w – \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $ H$ is just the diagonal matrix $ \sigma^2 I_n$, we get that

$ \textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $ (X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!