Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $ p$. If $ X$ denotes the number of trials needed to get the first success, then clearly $ \Pr(X = k) = (1-p)^{k-1} p$ (since first we need $ k-1$ failures which occur independently with probability $ 1-p$, then we need one success which happens with probability $ p$). Thus the expected value of $ X$ is

$ \displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $ p=1/6$ thus the expected time is $ 1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $ \frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $ \left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$ \displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $ 11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $ 11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly $ \frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $ \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $ T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $ T$ dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $ \$26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $ \$26^{11}$. Thus our casino will have to give out $ 26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $ T$ dollars, the expected value of our expenses is $ 26^{11}+26^4+26$ dollars, thus $ E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $ X_0$ will denote the gambler’s fortune before the game starts, $ X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $ X_n$, given $ X_1, X_2, \ldots, X_{n-1}$ is the same as $ X_{n-1}$. This can be written as $ E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $ E(X_n – X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $ X’_n = X_n$ if $ n < T$ and $ X’_n = X_T$ if $ n \ge T$ where $ T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $ T$ itself is a random variable.

We require our stopping time $ T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $ X_n$ be a martingale stopped at step $ T$, and suppose one of the following three conditions hold:

  1. The stopping time $ T$ is almost surely bounded by some constant;
  2. The stopping time $ T$ is almost surely finite and every step of the stopped martingale $ X_n$ is almost surely bounded by some constant; or
  3. The expected stopping time $ E(T)$ is finite and the absolute value of the martingale increments $ |X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $ E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $ 26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $ 26^{11}+26^4+26$. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $ a$ dollars, the second player has $ b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $ X_n$ denote the change in the second player’s fortune, and set $ X_0 = 0$. Let $ T_k$ denote the first time $ s$ when $ X_s = k$. Then our first question can be formalized as trying to determine $ \Pr(T_{-b} < T_a)$. Let $ t = \min \{ T_{-b}, T_a\}$. Clearly $ t$ is a stopping time. By the optional stopping theorem we have that

$ \displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $ \Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $ X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $ k$-SAT for some $ k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $ k$ since $ k$-SAT is a special case of $ (k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $ k$-SAT for any $ k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $ x$ and $ y$ are connected by a directed edge if there is a clause $ (\bar x \lor y)$. Recall that $ \bar x \lor y$ is equivalent to $ x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $ x \to \bar x$ and $ \bar x \to x$ (since $ x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $ O(n^2)$ rounds where $ n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $ O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $ \pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $ \frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $ n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $ n$, on the other hand, is called a reflecting barrier: we cannot reach $ n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $ O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $ X, Y$. Here, if we want to be completely formal, $ X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $ X$, to be the independent variable and the other, say $ Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $ X$ and $ Y$, and that we intend to use the value of $ X$ to predict the value of $ Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $ X$ could be age, and $ Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $ X$ corresponds to a proportional change in $ Y$, and so one could hope there exist constants $ a,b$ so that (as random variables) $ Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $ X$ and $ Y$, and try to estimate the value of $ a$ and $ b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $ a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $ \hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $ (x,y)$ where $ x$ is a value of $ X$ and $ y$ of $ Y$, then the error of our predictor on this example is $ |y – \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $ y$ value to our prediction for the same $ x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $ |x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $ (y – \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $ (x_i, y_i)_{i=1}^n$ and a potential prediction line $ \hat{f}(x) = ax + b$, we define the error of $ \hat{f}$ on the examples to be

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – \hat{f}(x_i))^2$

Which can also be written as

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – ax_i – b)^2$

Note that since we’re fixing our data sample, the function $ S$ is purely a function of the variables $ a,b$. Now we want to minimize this quantity with respect to $ a,b$, so we can take a gradient,

$ \displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i – ax_i – b) x_i$

$ \displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i – b)$

and set them simultaneously equal to zero. In the first we solve for $ b$:

$ \displaystyle 0 = -2 \sum_{i=1}^n y_i – ax_i – b = -2 \left (-nb + \sum_{i=1}^n y_i – ax_i \right )$

$ \displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i – ax_i$

If we denote by $ x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $ b = y_{\textup{avg}} – ax_{\textup{avg}}$. Substituting $ b$ into the other equation we get

$ \displaystyle -2 \sum_{i=1}^n (y_ix_i – ax_i^2 – y_{\textup{avg}}x_i – ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $ a$, further simplifies to

$ \displaystyle 0 = \sum_{i=1}^n y_ix_i – y_{\textup{avg}}x_i – a \sum_{i=1}^n (x_i^2 – x_{\textup{avg}}x_i)$

And so

$ \displaystyle a = \frac{\sum_{i=1}^n (y_i – y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i – x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
   xAvg, yAvg = map(avg, zip(*points))

   aNum = 0
   aDenom = 0
   for (x,y) in points:
      aNum += (y - yAvg) * x
      aDenom += (x - xAvg) * x

   a = float(aNum) / aDenom
   b = yAvg - a * xAvg
   return (a, b), lambda x: a*x + b

and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

Many Variables and Matrix Form

If we take those two variables $ x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $ x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $ \mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$ \displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $ \mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $ \mathbf{x} = (x_0, x_1)$. The output is still $ y$. Here we understand implicitly that the indices line up: if $ w_0$ is the constant term, then that makes $ x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $ x_1$ is the linear term with coefficient $ w_1$. Now we can just write the prediction function as

$ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $ Y$ depending on the $ n$ random variables $ X_1, \dots, X_n$, then our data will come in tuples of the form $ (\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $ x_0$ is fixed to 1. Then expanding our line $ \mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $ \mathbf{x}_j$ as rows of a matrix $ X$ and the output values $ y_j$ as entries of a vector $ \mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$ y – Xw$

Indeed, each entry of the vector $ Xw$ is a dot product of a row of $ X$ (an input data point) with the coefficients of the line $ w$. It’s just $ \hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$ \displaystyle S(w) = \| y – Xw \|^2$

This is just taking a dot product of $ y-Xw$ with itself. This form is awkward to differentiate because the variable $ w$ is nested in the norm. Luckily, we can get the same result by viewing $ y – Xw$ as a 1-by-$ n$ matrix, transposing it, and multiplying by $ y-Xw$.

$ \displaystyle S(w) = (y – Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $ S$ with respect to each of the $ w_i$ variables in $ w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$ \displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $ x$, the partial $ \frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $ w$ gives $ w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $ X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
   # input points are n+1 tuples of n inputs and 1 output
   X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
   y = array([p[-1] for p in points])

   Xt = transpose(X)
   theInverse = inv(dot(Xt, X))
   w = dot(dot(theInverse, Xt), y)
   return w, lambda x: dot(w, x)

Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $ X,Y$. If we assume that $ Y$ is perfectly determined by $ X$ via some linear equation $ Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $ x$ example, its corresponding $ y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $ y_i$ would be $ y_i = ax_i + b + \eta_i$, and we say all the $ \eta_i$ are drawn independently and uniformly at random from the set $ \left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $ \eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $ H$ the random variable producing the noise (in Greek $ H$ is the capital letter for $ \eta$), and write $ Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $ \pm 1$, but is rather chosen to be Gaussian with mean $ 0$ and some variance $ \sigma$. We’d denote this by $ \eta_i \sim N(\mu, \sigma)$, and say the $ \eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $ y$ are given by $ y = Xw + H$ where $ X$ is the observed data and $ H$ is Gaussian noise with mean zero and some (unknown) standard deviation $ \sigma$. Then if we call $ \hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$ \displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $ y$ by $ Xw + H$,

$ \displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($ X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$ \textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $ w$ is constant (so its expected value is just itself) and $ \textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $ \hat{w}$ is just $ w$. Because this is true for all choices of data $ X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $ \hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $ w_i$ as variables depending on $ X$, we apply the definition:

$ \textup{Var}(\hat{w} | X) = \textup{E}(\| w – \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $ H$ is just the diagonal matrix $ \sigma^2 I_n$, we get that

$ \textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $ (X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

Probability Theory — A Primer

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to  future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set $ \left \{ 0,1 \right \}$ might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

Definition: A finite set $ \Omega$ equipped with a function $ f: \Omega \to [0,1]$ is a probability space if the function $ f$ satisfies the property

$ \displaystyle \sum_{\omega \in \Omega} f(\omega) = 1$

That is, the sum of all the values of $ f$ must be 1.

Sometimes the set $ \Omega$ is called the sample space, and the act of choosing an element of $ \Omega$ according to the probabilities given by $ f$ is called drawing an example. The function $ f$ is usually called the probability mass function. Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an event.

Definition: An event $ E \subset \Omega$ is a subset of a sample space.

For instance, suppose our probability space is $ \Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \}$ and $ f$ is defined by setting $ f(x) = 1/6$ for all $ x \in \Omega$ (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset $ \left \{ 2, 4, 6 \right \}$, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function $ f$ (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a probability measure $ \textup{P}: 2^\Omega \to \mathbb{R}$, where $ 2^\Omega$ denotes the set of all subsets of $ \Omega$. The example of a die roll guides our intuition: the probability of any event should be the sum of the probabilities of the outcomes contained in it. i.e. we define

$ \displaystyle \textup{P}(E) = \sum_{e \in E} f(e)$

where by convention the empty sum has value zero. Note that the function $ \textup{P}$ is often denoted $ \textup{Pr}$.

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: $ \textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1$ by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities $ p$ and $ 1-p$ for some $ p$. Such a probability space is often called a Bernoulli trial.

Now that the function $ \textup{P}$ is defined on all events, we can simplify our notation considerably. Because the probability mass function $ f$ uniquely determines $ \textup{P}$ and because $ \textup{P}$ contains all information about $ f$ in it ($ \textup{P}(\left \{ \omega \right \}) = f(\omega)$), we may speak of $ \textup{P}$ as the probability measure of $ \Omega$, and leave $ f$ out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of $ \textup{P}$ is understood as above.

There are some other quick properties we can state or prove about probability measures: $ \textup{P}(\left \{ \right \}) = 0$ by convention, if $ E, F$ are disjoint then $ \textup{P}(E \cup F) = \textup{P}(E) + \textup{P}(F)$, and if $ E \subset F \subset \Omega$ then $ \textup{P}(E) \leq \textup{P}(F)$. The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

$ \displaystyle \Omega = \left \{ (1,1), (1,2), (1,3), \dots, (6,4), (6,5), (6,6) \right \}$

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of $ \Omega$ because all of the randomness inherent in the game comes from the generation of an output in $ \Omega$ (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

Definition: random variable $ X$ is a real-valued function on the sample space $ \Omega \to \mathbb{R}$.

So for example the random variable for the sum of the two dice would be $ X(a,b) = a+b$. We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of all random variables $ \textup{RV}(\Omega)$. It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if $ X, Y$ are two random variables, so is $ aX + bY$ for real numbers $ a, b$. This function operates linearly, in the sense that its value is $ (aX + bY)(\omega) = aX(\omega) + bY(\omega)$. We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, $ XY$ is the product of two random variables (defined by $ XY(\omega) = X(\omega)Y(\omega)$) and one can imagine such awkward constructions as $ X/Y$ or $ X^Y$. We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event $ E$, we can define a random variable which is 0 or 1 depending on whether the input is a member of $ E$. That is,

Definition: An indicator random variable $ 1_E$ is defined by setting $ 1_E(\omega) = 1$ when $ \omega \in E$ and 0 otherwise. A common abuse of notation for singleton sets is to denote $ 1_{\left \{ \omega \right \} }$ by $ 1_\omega$.

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

Definition: Denote by $ \left \{ X = k \right \}$ the set of outcomes $ \omega \in \Omega$ for which $ X(\omega) = k$. With the function notation, $ \left \{ X = k \right \} = X^{-1}(k)$.

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define $ \left \{ X < 5 \right \}$ or $ \left \{ X \textup{ is even} \right \}$ just as we would naively construct sets. It works in general for any subset of $ S \subset \mathbb{R}$. The notation is $ \left \{ X \in S \right \} = X^{-1}(S)$, and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure $ \textup{P}$. That is, we want to write things like $ \textup{P}(X \textup{ is even})$ and read it in our head “the probability that $ X$ is even”.

This is made rigorous by simply setting

$ \displaystyle \textup{P}(X \in S) = \sum_{\omega \in X^{-1}(S)} \textup{P}(\omega)$

In words, it is just the sum of the probabilities that individual outcomes will have a value under $ X$ that lands in $ S$. We will also use for $ \textup{P}(\left \{ X \in S \right \} \cap \left \{ Y \in T \right \})$ the shorthand notation $ \textup{P}(X \in S, Y \in T)$ or $ \textup{P}(X \in S \textup{ and } Y \in T)$.

Often times $ \left \{ X \in S \right \}$ will be smaller than $ \Omega$ itself, even if $ S$ is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let $ X$ be the profit function. Then $ \textup{P}(X > 0)$ is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable $ \textup{im}(X)$ is a finite subset of real numbers. In other words, the set of all events of the form $ \left \{ X = x_i \right \}$ where $ x_i \in \textup{im}(X)$ form a partition of $ \Omega$. As such, we get the following immediate identity:

$ \displaystyle 1 = \sum_{x_i \in \textup{im} (X)} P(X = x_i)$

The set of such events is called the probability distribution of the random variable $ X$.

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events $ E,F \subset \Omega$ are independent if the probability of both $ E, F$ occurring is the product of the probabilities of each event occurring. That is, $ \textup{P}(E \cap F) = \textup{P}(E)\textup{P}(F)$. There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from $ E,F$ being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, $ X,Y$ are independent random variables if

$ \displaystyle \textup{P}(X = x, Y = y) = \textup{P}(X=x)\textup{P}(Y=y)$

for all $ x,y \in \mathbb{R}$.

Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

Definition: Let $ X$ be a random variable on a finite probability space $ \Omega$. The expected value of $ X$, denoted $ \textup{E}(X)$, is the quantity

$ \displaystyle \textup{E}(X) = \sum_{\omega \in \Omega} X(\omega) \textup{P}(\omega)$

Note that if we label the image of $ X$ by $ x_1, \dots, x_n$ then this is equivalent to

$ \displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)$

The most important fact about expectation is that it is a linear functional on random variables. That is,

Theorem: If $ X,Y$ are random variables on a finite probability space and $ a,b \in \mathbb{R}$, then

$ \displaystyle \textup{E}(aX + bY) = a\textup{E}(X) + b\textup{E}(Y)$

Proof. The only real step in the proof is to note that for each possible pair of values $ x, y$ in the images of $ X,Y$ resp., the events $ E_{x,y} = \left \{ X = x, Y=y \right \}$ form a partition of the sample space $ \Omega$. That is, because $ aX + bY$ has a constant value on $ E_{x,y}$, the second definition of expected value gives

$ \displaystyle \textup{E}(aX + bY) = \sum_{x \in \textup{im} (X)} \sum_{y \in \textup{im} (Y)} (ax + by) \textup{P}(X = x, Y = y)$

and a little bit of algebraic elbow grease reduces this expression to $ a\textup{E}(X) + b\textup{E}(Y)$. We leave this as an exercise to the reader, with the additional note that the sum $ \sum_{y \in \textup{im}(Y)} \textup{P}(X = x, Y = y)$ is identical to $ \textup{P}(X = x)$. $ \square$

If we additionally know that $ X,Y$ are independent random variables, then the same technique used above allows one to say something about the expectation of the product $ \textup{E}(XY)$ (again by definition, $ XY(\omega) = X(\omega)Y(\omega)$). In this case $ \textup{E}(XY) = \textup{E}(X)\textup{E}(Y)$. We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that $ \textup{E}(X) \in \textup{im}(X)$. For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

$ \displaystyle 1 \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} = \frac{5}{2}$

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable $ X$ into simpler (usually indicator) random variables, the computation of $ \textup{E}(X)$ becomes trivial.

tournament  $ T$ is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a Hamiltonian path, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers $ (v_1, \dots, v_n)$, where we visit vertex $ v_i$ at stage $ i$ of the traversal. The condition for this to be a valid Hamiltonian path is that $ (v_i, v_{i+1})$ is an edge in $ T$ for all $ i$.

Now if we construct a tournament on $ n$ vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, $ X$ is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in $ \textup{E}(X)$.

To compute this, simply note that we can break $ X = \sum_p X_p$, where $ p$ ranges over all possible lists of the vertices. Then $ \textup{E}(X) = \sum_p \textup{E}(X_p)$, and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is $ n!$ as this is the number of possible lists of $ n$ items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

$ \textup{P}(\textup{first edge and second edge and } \dots \textup{ and last edge})$

which by independence is

$ \displaystyle \prod_{i = 1}^n \textup{P}(i^\textup{th} \textup{ edge is chosen}) = \frac{1}{2^{n-1}}$

That is, the expected number of Hamiltonian paths is $ n!2^{-(n-1)}$.

Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable $ X$ are throughout the real line.

Definition: The variance of a random variable $ X$ is the quantity $ \textup{E}((X – \textup{E}(X))^2)$.

That is, $ \textup{E}(X)$ is a number, and so $ X – \textup{E}(X)$ is the random variable defined by $ (X – \textup{E}(X))(\omega) = X(\omega) – \textup{E}(X)$. It is the expectation of the square of the deviation of $ X$ from its expected value.

One often denotes the variance by $ \textup{Var}(X)$ or $ \sigma^2$. The square is for silly reasons: the standard deviation, denoted $ \sigma$ and equivalent to $ \sqrt{\textup{Var}(X)}$ has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

  • $ \textup{Var}(X) = \textup{E}(X^2) – \textup{E}(X)^2$.
  • $ \textup{Var}(aX) = a^2\textup{Var}(X)$.
  • When $ X,Y$ are independent then variance is additive: $ \textup{Var}(X+Y) = \textup{Var}(X) + \textup{Var}(Y)$.
  • Variance is invariant under constant additives: $ \textup{Var}(X+c) = \textup{Var}(X)$.

In addition, the quantity $ \textup{Var}(aX + bY)$ is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is covariance.

Definition: Let $ X,Y$ be random variables. The covariance of $ X$ and $ Y$, denoted $ \textup{Cov}(X,Y)$, is the quantity $ \textup{E}((X – \textup{E}(X))(Y – \textup{E}(Y)))$.

Note the similarities between the variance definition and this one: if $ X=Y$ then the two quantities coincide. That is, $ \textup{Cov}(X,X) = \textup{Var}(X)$.

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

Theorem: Let $ X,Y$ be random variables with variances $ \sigma_X^2, \sigma_Y^2$. Then their covariance is at most the product of the standard deviations in magnitude:

$ |\textup{Cov}(X,Y)| \leq \sigma_X \sigma_Y$

Proof. Take any two non-constant random variables $ X$ and $ Y$ (we will replace these later with $ X – \textup{E}(X), Y – \textup{E}(Y)$). Construct a new random variable $ (tX + Y)^2$ where $ t$ is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, $ \textup{E}((tX + Y)^2)$. Expanding this and using linearity gives

$ \displaystyle f(t) = t^2 \textup{E}(X^2) + 2t \textup{E}(XY) + \textup{E}(Y^2) \geq 0$

This is a quadratic function of a single variable $ t$ which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

$ \displaystyle 4 \textup{E}(XY)^2 – 4 \textup{E}(X^2) \textup{E}(Y^2) \leq 0$

and so dividing by 4 and replacing $ X,Y$ with $ X – \textup{E}(X), Y – \textup{E}(Y)$, resp., gives

$ \textup{Cov}(X,Y)^2 \leq \sigma_X^2 \sigma_Y^2$

and the result follows. $ \square$

Note that equality holds in the discriminant formula precisely when $ Y = -tX$ (the discriminant is zero), and after the replacement this translates to $ Y – \textup{E}(Y) = -t(X – \textup{E}(X))$ for some fixed value of $ t$. In other words, for some real numbers $ a,b$ we have $ Y = aX + b$.

This has important consequences even in English: the covariance is maximized when $ Y$ is a linear function of $ X$, and otherwise is bounded from above and below. By dividing both sides of the inequality by $ \sigma_X \sigma_Y$ we get the following definition:

Definition: The Pearson correlation coefficient of two random variables $ X,Y$ is defined by

$ \displaystyle r= \frac{\textup{Cov}(X,Y)}{\sigma_X \sigma_Y}$

If $ r$ is close to 1, we call $ X$ and $ Y$ positively correlated. If $ r$ is close to -1 we call them negatively correlated, and if $ r$ is close to zero we call them uncorrelated.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

covariance

The  horizontal axis plots a sample of values of the random variable $ X$ and the vertical plots a sample of $ Y$. The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Various examples of the Pearson correlation coefficient, credit Wikipedia.

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of $ Y$ given values of $ X$ (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for $ \textup{Var}(aX + bY)$. Expanding the definition, it is not hard to see that this amounts to the following proposition:

Proposition: The variance operator satisfies

$ \displaystyle \textup{Var}(aX+bY) = a^2\textup{Var}(X) + b^2\textup{Var}(Y) + 2ab \textup{Cov}(X,Y)$

And using induction we get a general formula:

$ \displaystyle \textup{Var} \left ( \sum_{i=1}^n a_i X_i \right ) = \sum_{i=1}^n \sum_{j = 1}^n a_i a_j \textup{Cov}(X_i,X_j)$

Note that in the general sum, we get a bunch of terms $ \textup{Cov}(X_i,X_i) = \textup{Var}(X_i)$.

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

Definition: The covariance matrix of a collection of random variables $ X_1, \dots, X_n$ is the matrix whose $ (i,j)$ entry is $ \textup{Cov}(X_i,X_j)$.

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables $ Y_j$ (which are linear combinations of the original variables $ X_i$) such that the covariance of distinct pairs $ Y_j, Y_k$ are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called principal component analysis.

Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!