Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, X, Y. Here, if we want to be completely formal, X,Y are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say X, to be the independent variable and the other, say Y, to be the dependent variable. All that means in is that we are assuming there is a relationship between X and Y, and that we intend to use the value of X to predict the value of Y. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, X could be age, and Y could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in X corresponds to a proportional change in Y, and so one could hope there exist constants a,b so that (as random variables) Y = aX + b.  If this were the case then we could just draw many pairs of sample values of X and Y, and try to estimate the value of a and b.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants a, b, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function \hat{f}. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point (x,y) where x is a value of X and y of Y, then the error of our predictor on this example is |y - \hat{f}(x)|. Geometrically this is the vertical distance from the actual y value to our prediction for the same x, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, |x| = \sqrt(x^2), and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as (y - \hat{f}(x))^2, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points (x_i, y_i)_{i=1}^n and a potential prediction line \hat{f}(x) = ax + b, we define the error of \hat{f} on the examples to be

\displaystyle S(a,b) = \sum_{i=1}^n (y_i - \hat{f}(x_i))^2

Which can also be written as

\displaystyle S(a,b) = \sum_{i=1}^n (y_i - ax_i - b)^2

Note that since we’re fixing our data sample, the function S is purely a function of the variables a,b. Now we want to minimize this quantity with respect to a,b, so we can take a gradient,

\displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i - ax_i - b) x_i

\displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i - b)

and set them simultaneously equal to zero. In the first we solve for b:

\displaystyle 0 = -2 \sum_{i=1}^n y_i - ax_i - b = -2 \left ( nb + \sum_{i=1}^n y_i - ax_i \right )

\displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i - ax_i

If we denote by x_{\textup{avg}} = \frac{1}{n} \sum_i x_i this is just b = y_{\textup{avg}} - ax_{\textup{avg}}. Substituting b into the other equation we get

\displaystyle -2 \sum_{i=1}^n (y_ix_i - ax_i^2 - y_{\textup{avg}}x_i - ax_{\textup{avg}}x_i ) = 0

Which, by factoring out a, further simplifies to

\displaystyle 0 = \sum_{i=1}^n y_ix_i - y_{\textup{avg}}x_i - a \sum_{i=1}^n (x_i^2 - x_{\textup{avg}}x_i)

And so

\displaystyle a = \frac{\sum_{i=1}^n (y_i - y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i - x_{\textup{avg}})x_i}

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
   xAvg, yAvg = map(avg, zip(*points))

   aNum = 0
   aDenom = 0
   for (x,y) in points:
      aNum += (y - yAvg) * x
      aDenom += (x - xAvg) * x

   a = float(aNum) / aDenom
   b = yAvg - a * xAvg
   return (a, b), lambda x: a*x + b

and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

Many Variables and Matrix Form

If we take those two variables x,y and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable x_0 which we force to be 1, and then we can write our prediction line in a vector form as \mathbf{w} = (a,b). What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

\displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line \mathbf{w} = (w_0, w_1), and the coefficients of the input data \mathbf{x} = (x_0, x_1). The output is still y. Here we understand implicitly that the indices line up: if w_0 is the constant term, then that makes x_0 = 1 our extra variable (often called a bias variable by statistically minded folks), and x_1 is the linear term with coefficient w_1. Now we can just write the prediction function as

\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable Y depending on the n random variables X_1, \dots, X_n, then our data will come in tuples of the form (\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y), where again the x_0 is fixed to 1. Then expanding our line \mathbf{w} = (w_0 , \dots, w_n), our evaluation function is still \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points \mathbf{x}_j as rows of a matrix X and the output values y_j as entries of a vector \mathbf{y}. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

y - Xw

Indeed, each entry of the vector Xw is a dot product of a row of X (an input data point) with the coefficients of the line w. It’s just \hat{f} applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

\displaystyle S(w) = \| y - Xw \|^2

This is just taking a dot product of y-Xw with itself. This form is awkward to differentiate because the variable w is nested in the norm. Luckily, we can get the same result by viewing y - Xw as a 1-by-n matrix, transposing it, and multiplying by y-Xw.

\displaystyle S(w) = (y - Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of S with respect to each of the w_i variables in w at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

\displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw

Indeed, it’s quite trivial to prove the latter formula, that for any vector x, the partial \frac{\partial x^{\textup{T}}x}{\partial x} = 2x. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for w gives w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y, assuming the inverse of X^{\textup{T}}X exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
   # input points are n+1 tuples of n inputs and 1 output
   X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
   y = array([p[-1] for p in points])

   Xt = transpose(X)
   theInverse = inv(dot(Xt, X))
   w = dot(dot(theInverse, Xt), y)
   return w, lambda x: dot(w, x)

Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables X,Y. If we assume that Y is perfectly determined by X via some linear equation Y = aX + b, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new x example, its corresponding y value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of y_i would be y_i = ax_i + b + \eta_i, and we say all the \eta_i are drawn independently and uniformly at random from the set \left \{ -1,1 \right \}. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the \eta_i depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call H the random variable producing the noise (in Greek H is the capital letter for \eta), and write Y = aX + b + H.

More realistically, the noise isn’t chosen uniformly from \pm 1, but is rather chosen to be Gaussian with mean 0 and some variance \sigma. We’d denote this by \eta_i \sim N(\mu, \sigma), and say the \eta_i are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points y are given by y = Xw + H where X is the observed data and H is Gaussian noise with mean zero and some (unknown) standard deviation \sigma. Then if we call \hat{w} our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

\displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)

Replacing y by Xw + H,

\displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}

Notice that the first term is a fat matrix (X^{\textup{T}}X) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

\textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)

but w is constant (so its expected value is just itself) and \textup{E}(H | X) = 0 by assumption that the noise is symmetric. So then the expected value of \hat{w} is just w. Because this is true for all choices of data X, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of \hat{w} actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the w_i as variables depending on X, we apply the definition:

\textup{Var}(\hat{w} | X) = \textup{E}(\| w - \textup{E}(w) \|^2 | X)

And after some tedious expanding and rewriting and recalling that the covariance matrix of H is just the diagonal matrix \sigma^2 I_n, we get that

\textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}

This means that if we get unlucky and draw some sample which makes some entries of (X^{\textup{T}}X)^{-1} big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

About these ads

Probabilistic Bounds — A Primer

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable X. Moreover suppose that the expected value of X is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

Two Easy Bounds of Markov and Chebyshev

The first bound we’ll investigate is almost trivial in nature, but comes in handy. Suppose we have a random variable X which is non-negative (as a function). Markov’s inequality is the statement that, for any constant a > 0,

\displaystyle \textup{P}(X \geq a) \leq \frac{\textup{E}(X)}{a}

In words, the probability that X grows larger than some fixed constant is bounded by a quantity that is inversely proportional to the constant.

The proof is quite simple. Let \chi_a be the indicator random variable for the event that X \geq a (\chi_a = 1 when X \geq a and zero otherwise). As with all indicator random variables, the expected value of \chi_a is the probability that the event happens (if this is mysterious, use the definition of expected value). So \textup{E}(\chi_a) = \textup{P}(X \geq a), and linearity of expectation allows us to include a factor of a:

\textup{E}(a \chi_a) = a \textup{P}(X \geq a)

The rest of the proof is simply the observation that \textup{E}(a \chi_a) \leq \textup{E}(X). Indeed, as random variables we have the inequality a \chi_a \leq X. Whenever a < X, the former is equal to zero while the latter is nonnegative. And whenever a \geq X, the former is precisely a while the latter is by assumption at least a. It follows that \textup{E}(a \chi_a) \leq \textup{E}(X).

This last point is a simple property of expectation we omitted from our first primer. It usually goes by monotonicity of expectation, and we prove it here. First, if X \geq 0 then \textup{E}(X) \geq 0 (this is trivial). Second, if 0 \leq X \leq Y, then define a new random variable Z = Y-X. Since Z \geq 0 and using linearity of expectation, it must be that \textup{E}(Z) = \textup{E}(Y) - \textup{E}(X) \geq 0. Hence \textup{E}(X) \leq \textup{E}(Y). Note that we do require that X has a finite expected value for this argument to work, but if this is not the case then Markov’s inequality is nonsensical anyway.

Markov’s inequality by itself is not particularly impressive or useful. For example, if X is the number of heads in a hundred coin flips, Markov’s inequality ensures us that the probability of getting at least 99 heads is at most 50/99, which is about 1/2. Shocking. We know that the true probability is much closer to 2^{-100}, so Markov’s inequality is a bust.

However, it does give us a more useful bound as a corollary. This bound is known as Chebyshev’s inequality, and its use is sometimes referred to as the second moment method because it gives a bound based on the variance of a random variable (instead of the expected value, the “first moment”).

The statement is as follows.

Chebyshev’s Inequality: Let X be a random variable with finite expected value and positive variance. Then we can bound the probability that X deviates from its expected value by a quantity that is proportional to the variance of X. In particular, for any \lambda > 0,

\displaystyle \textup{P}(|X - \textup{E}(X)| \geq \lambda) \leq \frac{\textup{Var}(X)}{\lambda^2}

And without any additional assumptions on X, this bound is sharp.

Proof. The proof is a simple application of Markov’s inequality. Let Y = (X - \textup{E}(X))^2, so that \textup{E}(Y) = \textup{Var}(X). Then by Markov’s inequality

\textup{P}(Y \geq \lambda^2) \leq \frac{\textup{E}(Y)}{\lambda^2}

Since Y is nonnegative |X - \textup{E}(X)| = \sqrt(Y), and \textup{P}(Y \geq \lambda^2) = \textup{P}(|X - \textup{E}(X)| \geq \lambda). The theorem is proved. \square

Chebyshev’s inequality shows up in so many different places (and usually in rather dry, technical bits), that it’s difficult to give a good example application.  Here is one that shows up somewhat often.

Say X is a nonnegative integer-valued random variable, and we want to argue about when X = 0 versus when X > 0, given that we know \textup{E}(X). No matter how large \textup{E}(X) is, it can still be possible that \textup{P}(X = 0) is arbitrarily close to 1. As a colorful example, let X is the number of alien lifeforms discovered in the next ten years. We might debate that \textup{E}(X) can arbitrarily large: if some unexpected scientific and technological breakthroughs occur tomorrow, we could discover an unbounded number of lifeforms. On the other hand, we are very likely not to discover any, and probability theory allows for such a random variable to exist.

If we know everything about \textup{Var}(X), however, we can get more informed bounds.

Theorem: If \textup{E}(X) \neq 0, then \displaystyle \textup{P}(X = 0) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}.

Proof. Simply choose \lambda = \textup{E}(X) and apply Chebyshev’s inequality.

\displaystyle \textup{P}(X = 0) \leq \textup{P}(|X - \textup{E}(X)| \geq \textup{E}(X)) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}

The first inequality follows from the fact that the only time X can ever be zero is when |X - \textup{E}(X)| = \textup{E}(X), and X=0 only accounts for one such possibility. \square

This theorem says more. If we know that \textup{Var}(X) is significantly smaller than \textup{E}(X)^2, then X > 0 is more certain to occur. More precisely, and more computationally minded, suppose we have a sequence of random variables X_n so that \textup{E}(X_n) \to \infty as n \to \infty. Then the theorem says that if \textup{Var}(X_n) = o(\textup{E}(X_n)^2), then \textup{P}(X_n > 0) \to 1. Remembering one of our very early primers on asymptotic notation, f = o(g) means that f grows asymptotically slower than g, and in terms of this fraction \textup{Var}(X) / \textup{E}(X)^2, this means that the denominator dominates the fraction so that the whole thing tends to zero.

The Chernoff Bound

The Chernoff bound takes advantage of an additional hypothesis: our random variable is a sum of independent coin flips. We can use this to get exponential bounds on the deviation of the sum. More rigorously,

Theorem: Let X_1 , \dots, X_n be independent random \left \{ 0,1 \right \}-valued variables, and let X = \sum X_i. Suppose that \mu = \textup{E}(X). Then the probability that X deviates from \mu by more than a factor of \lambda > 0 is bounded from above:

\displaystyle \textup{P}(X > (1+\lambda)\mu) \leq \frac{e^{\lambda \mu}}{(1+\lambda)^{(1+\lambda)\mu}}

The proof is beyond the scope of this post, but we point the interested reader to these lecture notes.

We can apply the Chernoff bound in an easy example. Say all X_i are fair coin flips, and we’re interested in the probability of getting more than 3/4 of the coins heads. Here \mu = n/2 and \lambda = 1/2, so the probability is bounded from above by

\displaystyle \left ( \frac{e}{(3/2)^3} \right )^{n/4} \approx \frac{1}{5^n}

So as the number of coin flips grows, the probability of seeing such an occurrence diminishes extremely quickly to zero. This is important because if we want to test to see if, say, the coins are biased toward flipping heads, we can simply run an experiment with n sufficiently large. If we observe that more than 3/4 of the flips give heads, then we proclaim the coins are biased and we can be assured we are correct with high probability. Of course, after seeing 3/4 of more heads we’d be really confident that the coin is biased. A more realistic approach is to define some \varepsilon that is small enough so as to say, “if some event occurs whose probability is smaller than \varepsilon, then I call shenanigans.” Then decide how many coins and what bound one would need to make the bad event have probability approximately \varepsilon. Finding this balance is one of the more difficult aspects of probabilistic algorithms, and as we’ll see later all of these quantities are left as variables and the correct values are discovered in the course of the proof.

Chernoff-Hoeffding Inequality

The Hoeffding inequality (named after the Finnish statistician, Wassily Høffding) is a variant of the Chernoff bound, but often the bounds are collectively known as Chernoff-Hoeffding inequalities. The form that Hoeffding is known for can be thought of as a simplification and a slight generalization of Chernoff’s bound above.

Theorem: Let X_1, \dots, X_n be independent random variables whose values are within some range [a,b]. Call \mu_i = \textup{E}(X_i), X = \sum_i X_i, and \mu = \textup{E}(X) = \sum_i \mu_i. Then for all t > 0,

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{-2t^2 / n(b-a)^2}

For example, if we are interested in the sum of n rolls of a fair six-sided die, then the probability that we deviate from (7/2)n by more than 5 \sqrt{n \log n} is bounded by 2e^{(-2 \log n)} = 2/n^2. Supposing we want to know how many rolls we need to guarantee with probability 0.01 that we don’t deviate too much, we just do the algebra:

2n^{-2} < 0.01
n^2 > 200
n > \sqrt{200} \approx 14

So with 15 rolls we can be confident that the sum of the rolls will lie between 20 and 85. It’s not the best possible bound we could come up with, because we’re completely ignoring the known structure on dice rolls (that they follow a uniform distribution!). The benefit is that it’s a quick and easy bound that works for any kind of random variable with that expected value.

Another version of this theorem concerns the average of the X_i, and is only a minor modification of the above.

Theorem: If X_1, \dots, X_n are as above, and X = \frac{1}{n} \sum_i X_i, with \mu = \frac{1}{n}(\sum_i \mu_i), then for all t > 0, we get the following bound

\displaystyle \textup{P}(|X - \mu| > t) \leq 2e^{-2nt^2/(b-a)^2}

The only difference here is the extra factor of n in the exponent. So the deviation is exponential both in the amount of deviation (t^2), and in the number of trials.

This theorem comes up very often in learning theory, in particular to prove Boosting works. Mathematicians will joke about how all theorems in learning theory are just applications of Chernoff-Hoeffding-type bounds. We’ll of course be seeing it again as we investigate boosting and the PAC-learning model in future posts, so we’ll see the theorems applied to their fullest extent then.

Until next time!

Probability Theory — A Primer

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to  future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set \left \{ 0,1 \right \} might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

Definition: A finite set \Omega equipped with a function f: \Omega \to [0,1] is a probability space if the function f satisfies the property

\displaystyle \sum_{\omega \in \Omega} f(\omega) = 1

That is, the sum of all the values of f must be 1.

Sometimes the set \Omega is called the sample space, and the act of choosing an element of \Omega according to the probabilities given by f is called drawing an example. The function f is usually called the probability mass function. Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an event.

Definition: An event E \subset \Omega is a subset of a sample space.

For instance, suppose our probability space is \Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \} and f is defined by setting f(x) = 1/6 for all x \in \Omega (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset \left \{ 2, 4, 6 \right \}, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function f (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a probability measure \textup{P}: 2^\Omega \to \mathbb{R}, where 2^\Omega denotes the set of all subsets of \Omega. The example of a die roll guides our intuition: the probability of any event should be the sum of the probabilities of the outcomes contained in it. i.e. we define

\displaystyle \textup{P}(E) = \sum_{e \in E} f(e)

where by convention the empty sum has value zero. Note that the function \textup{P} is often denoted \textup{Pr}.

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: \textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1 by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities p and 1-p for some p. Such a probability space is often called a Bernoulli trial.

Now that the function \textup{P} is defined on all events, we can simplify our notation considerably. Because the probability mass function f uniquely determines \textup{P} and because \textup{P} contains all information about f in it (\textup{P}(\left \{ \omega \right \}) = f(\omega)), we may speak of \textup{P} as the probability measure of \Omega, and leave f out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of \textup{P} is understood as above.

There are some other quick properties we can state or prove about probability measures: \textup{P}(\left \{ \right \}) = 0 by convention, if E, F are disjoint then \textup{P}(E \cup F) = \textup{P}(E) + \textup{P}(F), and if E \subset F \subset \Omega then \textup{P}(E) \leq \textup{P}(F). The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

\displaystyle \Omega = \left \{ (1,1), (1,2), (1,3), \dots, (6,4), (6,5), (6,6) \right \}

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of \Omega because all of the randomness inherent in the game comes from the generation of an output in \Omega (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

Definition: random variable X is a real-valued function on the sample space \Omega \to \mathbb{R}.

So for example the random variable for the sum of the two dice would be X(a,b) = a+b. We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of all random variables \textup{RV}(\Omega). It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if X, Y are two random variables, so is aX + bY for real numbers a, b. This function operates linearly, in the sense that its value is (aX + bY)(\omega) = aX(\omega) + bY(\omega). We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, XY is the product of two random variables (defined by XY(\omega) = X(\omega)Y(\omega)) and one can imagine such awkward constructions as X/Y or X^Y. We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event E, we can define a random variable which is 0 or 1 depending on whether the input is a member of E. That is,

Definition: An indicator random variable 1_E is defined by setting 1_E(\omega) = 1 when \omega \in E and 0 otherwise. A common abuse of notation for singleton sets is to denote 1_{\left \{ \omega \right \} } by 1_\omega.

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

Definition: Denote by \left \{ X = k \right \} the set of outcomes \omega \in \Omega for which X(\omega) = k. With the function notation, \left \{ X = k \right \} = X^{-1}(k).

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define \left \{ X < 5 \right \} or \left \{ X \textup{ is even} \right \} just as we would naively construct sets. It works in general for any subset of S \subset \mathbb{R}. The notation is \left \{ X \in S \right \} = X^{-1}(S), and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure \textup{P}. That is, we want to write things like \textup{P}(X \textup{ is even}) and read it in our head “the probability that X is even”.

This is made rigorous by simply setting

\displaystyle \textup{P}(X \in S) = \sum_{\omega \in X^{-1}(S)} \textup{P}(\omega)

In words, it is just the sum of the probabilities that individual outcomes will have a value under X that lands in S. We will also use for \textup{P}(\left \{ X \in S \right \} \cap \left \{ Y \in T \right \}) the shorthand notation \textup{P}(X \in S, Y \in T) or \textup{P}(X \in S \textup{ and } Y \in T).

Often times \left \{ X \in S \right \} will be smaller than \Omega itself, even if S is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let X be the profit function. Then \textup{P}(X > 0) is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable \textup{im}(X) is a finite subset of real numbers. In other words, the set of all events of the form \left \{ X = x_i \right \} where x_i \in \textup{im}(X) form a partition of \Omega. As such, we get the following immediate identity:

\displaystyle 1 = \sum_{x_i \in \textup{im} (X)} P(X = x_i)

The set of such events is called the probability distribution of the random variable X.

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events E,F \subset \Omega are independent if the probability of both E, F occurring is the product of the probabilities of each event occurring. That is, \textup{P}(E \cap F) = \textup{P}(E)\textup{P}(F). There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from E,F being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, X,Y are independent random variables if

\displaystyle \textup{P}(X = x, Y = y) = \textup{P}(X=x)\textup{P}(Y=y)

for all x,y \in \mathbb{R}.

Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

Definition: Let X be a random variable on a finite probability space \Omega. The expected value of X, denoted \textup{E}(X), is the quantity

\displaystyle \textup{E}(X) = \sum_{\omega \in \Omega} X(\omega) \textup{P}(\omega)

Note that if we label the image of X by x_1, \dots, x_n then this is equivalent to

\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)

The most important fact about expectation is that it is a linear functional on random variables. That is,

Theorem: If X,Y are random variables on a finite probability space and a,b \in \mathbb{R}, then

\displaystyle \textup{E}(aX + bY) = a\textup{E}(X) + b\textup{E}(Y)

Proof. The only real step in the proof is to note that for each possible pair of values x, y in the images of X,Y resp., the events E_{x,y} = \left \{ X = x, Y=y \right \} form a partition of the sample space \Omega. That is, because aX + bY has a constant value on E_{x,y}, the second definition of expected value gives

\displaystyle \textup{E}(aX + bY) = \sum_{x \in \textup{im} (X)} \sum_{y \in \textup{im} (Y)} (ax + by) \textup{P}(X = x, Y = y)

and a little bit of algebraic elbow grease reduces this expression to a\textup{E}(X) + b\textup{E}(Y). We leave this as an exercise to the reader, with the additional note that the sum \sum_{y \in \textup{im}(Y)} \textup{P}(X = x, Y = y) is identical to \textup{P}(X = x). \square

If we additionally know that X,Y are independent random variables, then the same technique used above allows one to say something about the expectation of the product \textup{E}(XY) (again by definition, XY(\omega) = X(\omega)Y(\omega)). In this case \textup{E}(XY) = \textup{E}(X)\textup{E}(Y). We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that \textup{E}(X) \in \textup{im}(X). For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

\displaystyle 1 \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} = \frac{5}{2}

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable X into simpler (usually indicator) random variables, the computation of \textup{E}(X) becomes trivial.

tournament  T is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a Hamiltonian path, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers (v_1, \dots, v_n), where we visit vertex v_i at stage i of the traversal. The condition for this to be a valid Hamiltonian path is that (v_i, v_{i+1}) is an edge in T for all i.

Now if we construct a tournament on n vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, X is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in \textup{E}(X).

To compute this, simply note that we can break X = \sum_p X_p, where p ranges over all possible lists of the vertices. Then \textup{E}(X) = \sum_p \textup{E}(X_p), and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is n! as this is the number of possible lists of n items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

\textup{P}(\textup{first edge and second edge and } \dots \textup{ and last edge})

which by independence is

\displaystyle \prod_{i = 1}^n \textup{P}(i^\textup{th} \textup{ edge is chosen}) = \frac{1}{2^{n-1}}

That is, the expected number of Hamiltonian paths is n!2^{-(n-1)}.

Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable X are throughout the real line.

Definition: The variance of a random variable X is the quantity \textup{E}((X - \textup{E}(X))^2).

That is, \textup{E}(X) is a number, and so X - \textup{E}(X) is the random variable defined by (X - \textup{E}(X))(\omega) = X(\omega) - \textup{E}(X). It is the expectation of the square of the deviation of X from its expected value.

One often denotes the variance by \textup{Var}(X) or \sigma^2. The square is for silly reasons: the standard deviation, denoted \sigma and equivalent to \sqrt{\textup{Var}(X)} has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

  • \textup{Var}(X) = \textup{E}(X^2) - \textup{E}(X)^2.
  • \textup{Var}(aX) = a^2\textup{Var}(X).
  • When X,Y are independent then variance is additive: \textup{Var}(X+Y) = \textup{Var}(X) + \textup{Var}(Y).
  • Variance is invariant under constant additives: \textup{Var}(X+c) = \textup{Var}(X).

In addition, the quantity \textup{Var}(aX + bY) is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is covariance.

Definition: Let X,Y be random variables. The covariance of X and Y, denoted \textup{Cov}(X,Y), is the quantity \textup{E}((X - \textup{E}(X))(Y - \textup{E}(Y))).

Note the similarities between the variance definition and this one: if X=Y then the two quantities coincide. That is, \textup{Cov}(X,X) = \textup{Var}(X).

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

Theorem: Let X,Y be random variables with variances \sigma_X^2, \sigma_Y^2. Then their covariance is at most the product of the standard deviations in magnitude:

|\textup{Cov}(X,Y)| \leq \sigma_X \sigma_Y

Proof. Take any two non-constant random variables X and Y (we will replace these later with X - \textup{E}(X), Y - \textup{E}(Y)). Construct a new random variable (tX + Y)^2 where t is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, \textup{E}((tX + Y)^2). Expanding this and using linearity gives

\displaystyle f(t) = t^2 \textup{E}(X^2) + 2t \textup{E}(XY) + \textup{E}(Y^2) \geq 0

This is a quadratic function of a single variable t which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

\displaystyle 4 \textup{E}(XY)^2 - 4 \textup{E}(X^2) \textup{E}(Y^2) \leq 0

and so dividing by 4 and replacing X,Y with X - \textup{E}(X), Y - \textup{E}(Y), resp., gives

\textup{Cov}(X,Y)^2 \leq \sigma_X^2 \sigma_Y^2

and the result follows. \square

Note that equality holds in the discriminant formula precisely when Y = -tX (the discriminant is zero), and after the replacement this translates to Y - \textup{E}(Y) = -t(X - \textup{E}(X)) for some fixed value of t. In other words, for some real numbers a,b we have Y = aX + b.

This has important consequences even in English: the covariance is maximized when Y is a linear function of X, and otherwise is bounded from above and below. By dividing both sides of the inequality by \sigma_X \sigma_Y we get the following definition:

Definition: The Pearson correlation coefficient of two random variables X,Y is defined by

\displaystyle r= \frac{\textup{Cov}(X,Y)}{\sigma_X \sigma_Y}

If r is close to 1, we call X and Y positively correlated. If r is close to -1 we call them negatively correlated, and if r is close to zero we call them uncorrelated.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

covariance

The  horizontal axis plots a sample of values of the random variable X and the vertical plots a sample of Y. The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Various examples of the Pearson correlation coefficient, credit Wikipedia.

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of Y given values of X (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for \textup{Var}(aX + bY). Expanding the definition, it is not hard to see that this amounts to the following proposition:

Proposition: The variance operator satisfies

\displaystyle \textup{Var}(aX+bY) = a^2\textup{Var}(X) + b^2\textup{Var}(Y) + 2ab \textup{Cov}(X,Y)

And using induction we get a general formula:

\displaystyle \textup{Var} \left ( \sum_{i=1}^n a_i X_i \right ) = \sum_{i=1}^n \sum_{j = 1}^n a_i a_j \textup{Cov}(X_i,X_j)

Note that in the general sum, we get a bunch of terms \textup{Cov}(X_i,X_i) = \textup{Var}(X_i).

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

Definition: The covariance matrix of a collection of random variables X_1, \dots, X_n is the matrix whose (i,j) entry is \textup{Cov}(X_i,X_j).

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables Y_j (which are linear combinations of the original variables X_i) such that the covariance of distinct pairs Y_j, Y_k are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called principal component analysis.

Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!