Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability p. If X denotes the number of trials needed to get the first success, then clearly \Pr(X = k) = (1-p)^{k-1} p (since first we need k-1 failures which occur independently with probability 1-p, then we need one success which happens with probability p). Thus the expected value of X is

\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p

by basic calculus. In particular, if success is defined as getting a six, then p=1/6 thus the expected time is 1/p=6.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is \frac{1}{26}, thus the probability of a random eleven-letter word being ABRACADABRA is exactly \left(\frac{1}{26}\right)^{11}. So if typing 11 letters is one trial, the expected number of trials is

\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}

which means 11\cdot 26^{11} keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that 11\cdot 26^{11} is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly \frac{1}{26} (thus the expected value of the change in the gambler’s fortune is \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as T. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly T dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins \$26^4. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be \$26^{11}. Thus our casino will have to give out 26^{11}+26^4+26 dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is T dollars, the expected value of our expenses is 26^{11}+26^4+26 dollars, thus E(T)=26^{11}+26^4+26. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. X_0 will denote the gambler’s fortune before the game starts, X_1 the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of X_n, given X_1, X_2, \ldots, X_{n-1} is the same as X_{n-1}. This can be written as E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1} or, equivalently, E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let X'_n = X_n if n < T and X'_n = X_T if n \ge T where T denotes the stopping time, i.e. the time at which the desired event occurs. Notice that T itself is a random variable.

We require our stopping time T to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let X_n be a martingale stopped at step T, and suppose one of the following three conditions hold:

  1. The stopping time T is almost surely bounded by some constant;
  2. The stopping time T is almost surely finite and every step of the stopped martingale X_n is almost surely bounded by some constant; or
  3. The expected stopping time E(T) is finite and the absolute value of the martingale increments |X_n-X_{n-1}| are almost surely bounded by a constant.

Then E(X_T) = E(X_0).

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than 26^{12}) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by 26^{11}+26^4+26. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has a dollars, the second player has b dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let X_n denote the change in the second player’s fortune, and set X_0 = 0. Let T_k denote the first time s when X_s = k. Then our first question can be formalized as trying to determine \Pr(T_{-b} < T_a). Let t = \min \{ T_{-b}, T_a\}. Clearly t is a stopping time. By the optional stopping theorem we have that

\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))

thus \Pr(T_{-b} < T_a)=\frac{a}{a+b}.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: X_n^2-n is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with k-SAT for some k \ne 3 instead? Clearly the hardness of the problem is monotone increasing in k since k-SAT is a special case of (k+1)-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as k-SAT for any k>3. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals x and y are connected by a directed edge if there is a clause (\bar x \lor y). Recall that \bar x \lor y is equivalent to x \implies y, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths x \to \bar x and \bar x \to x (since x \Leftrightarrow \bar x is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after O(n^2) rounds where n denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in O(n^2) steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by \pm 1 in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least \frac{1}{2}. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond n. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number n, on the other hand, is called a reflecting barrier: we cannot reach n+1, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in O(n^2) steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, X, Y. Here, if we want to be completely formal, X,Y are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say X, to be the independent variable and the other, say Y, to be the dependent variable. All that means in is that we are assuming there is a relationship between X and Y, and that we intend to use the value of X to predict the value of Y. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, X could be age, and Y could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in X corresponds to a proportional change in Y, and so one could hope there exist constants a,b so that (as random variables) Y = aX + b.  If this were the case then we could just draw many pairs of sample values of X and Y, and try to estimate the value of a and b.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants a, b, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function \hat{f}. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point (x,y) where x is a value of X and y of Y, then the error of our predictor on this example is |y - \hat{f}(x)|. Geometrically this is the vertical distance from the actual y value to our prediction for the same x, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, |x| = \sqrt(x^2), and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as (y - \hat{f}(x))^2, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points (x_i, y_i)_{i=1}^n and a potential prediction line \hat{f}(x) = ax + b, we define the error of \hat{f} on the examples to be

\displaystyle S(a,b) = \sum_{i=1}^n (y_i - \hat{f}(x_i))^2

Which can also be written as

\displaystyle S(a,b) = \sum_{i=1}^n (y_i - ax_i - b)^2

Note that since we’re fixing our data sample, the function S is purely a function of the variables a,b. Now we want to minimize this quantity with respect to a,b, so we can take a gradient,

\displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i - ax_i - b) x_i

\displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i - b)

and set them simultaneously equal to zero. In the first we solve for b:

\displaystyle 0 = -2 \sum_{i=1}^n y_i - ax_i - b = -2 \left ( nb + \sum_{i=1}^n y_i - ax_i \right )

\displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i - ax_i

If we denote by x_{\textup{avg}} = \frac{1}{n} \sum_i x_i this is just b = y_{\textup{avg}} - ax_{\textup{avg}}. Substituting b into the other equation we get

\displaystyle -2 \sum_{i=1}^n (y_ix_i - ax_i^2 - y_{\textup{avg}}x_i - ax_{\textup{avg}}x_i ) = 0

Which, by factoring out a, further simplifies to

\displaystyle 0 = \sum_{i=1}^n y_ix_i - y_{\textup{avg}}x_i - a \sum_{i=1}^n (x_i^2 - x_{\textup{avg}}x_i)

And so

\displaystyle a = \frac{\sum_{i=1}^n (y_i - y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i - x_{\textup{avg}})x_i}

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
   xAvg, yAvg = map(avg, zip(*points))

   aNum = 0
   aDenom = 0
   for (x,y) in points:
      aNum += (y - yAvg) * x
      aDenom += (x - xAvg) * x

   a = float(aNum) / aDenom
   b = yAvg - a * xAvg
   return (a, b), lambda x: a*x + b

and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

Many Variables and Matrix Form

If we take those two variables x,y and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable x_0 which we force to be 1, and then we can write our prediction line in a vector form as \mathbf{w} = (a,b). What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

\displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line \mathbf{w} = (w_0, w_1), and the coefficients of the input data \mathbf{x} = (x_0, x_1). The output is still y. Here we understand implicitly that the indices line up: if w_0 is the constant term, then that makes x_0 = 1 our extra variable (often called a bias variable by statistically minded folks), and x_1 is the linear term with coefficient w_1. Now we can just write the prediction function as

\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable Y depending on the n random variables X_1, \dots, X_n, then our data will come in tuples of the form (\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y), where again the x_0 is fixed to 1. Then expanding our line \mathbf{w} = (w_0 , \dots, w_n), our evaluation function is still \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points \mathbf{x}_j as rows of a matrix X and the output values y_j as entries of a vector \mathbf{y}. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

y - Xw

Indeed, each entry of the vector Xw is a dot product of a row of X (an input data point) with the coefficients of the line w. It’s just \hat{f} applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

\displaystyle S(w) = \| y - Xw \|^2

This is just taking a dot product of y-Xw with itself. This form is awkward to differentiate because the variable w is nested in the norm. Luckily, we can get the same result by viewing y - Xw as a 1-by-n matrix, transposing it, and multiplying by y-Xw.

\displaystyle S(w) = (y - Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of S with respect to each of the w_i variables in w at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

\displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw

Indeed, it’s quite trivial to prove the latter formula, that for any vector x, the partial \frac{\partial x^{\textup{T}}x}{\partial x} = 2x. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for w gives w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y, assuming the inverse of X^{\textup{T}}X exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
   # input points are n+1 tuples of n inputs and 1 output
   X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
   y = array([p[-1] for p in points])

   Xt = transpose(X)
   theInverse = inv(dot(Xt, X))
   w = dot(dot(theInverse, Xt), y)
   return w, lambda x: dot(w, x)

Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables X,Y. If we assume that Y is perfectly determined by X via some linear equation Y = aX + b, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new x example, its corresponding y value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of y_i would be y_i = ax_i + b + \eta_i, and we say all the \eta_i are drawn independently and uniformly at random from the set \left \{ -1,1 \right \}. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the \eta_i depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call H the random variable producing the noise (in Greek H is the capital letter for \eta), and write Y = aX + b + H.

More realistically, the noise isn’t chosen uniformly from \pm 1, but is rather chosen to be Gaussian with mean 0 and some variance \sigma. We’d denote this by \eta_i \sim N(\mu, \sigma), and say the \eta_i are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points y are given by y = Xw + H where X is the observed data and H is Gaussian noise with mean zero and some (unknown) standard deviation \sigma. Then if we call \hat{w} our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

\displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)

Replacing y by Xw + H,

\displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}

Notice that the first term is a fat matrix (X^{\textup{T}}X) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

\textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)

but w is constant (so its expected value is just itself) and \textup{E}(H | X) = 0 by assumption that the noise is symmetric. So then the expected value of \hat{w} is just w. Because this is true for all choices of data X, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of \hat{w} actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the w_i as variables depending on X, we apply the definition:

\textup{Var}(\hat{w} | X) = \textup{E}(\| w - \textup{E}(w) \|^2 | X)

And after some tedious expanding and rewriting and recalling that the covariance matrix of H is just the diagonal matrix \sigma^2 I_n, we get that

\textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}

This means that if we get unlucky and draw some sample which makes some entries of (X^{\textup{T}}X)^{-1} big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

Conditional (Partitioned) Probability — A Primer

One of the main areas of difficulty in elementary probability, and one that requires the highest levels of scrutiny and rigor, is conditional probability. The ideas are simple enough: that we assign probabilities relative to the occurrence of some event. But shrewd applications of conditional probability (and in particular, efficient ways to compute conditional probability) are key to successful applications of this subject. This is the basis for Nate Silver‘s success, the logical flaws of many a political pundit, and the ability for a robot to tell where it is in an environment. As this author usually touts, the best way to avoid the pitfalls of such confusing subjects is to be mathematically rigorous. In doing so we will develop intuition for when conditional probability that experts show off as if it were trivial.

But before we can get to all of that, we will cover a few extra ideas from finite probability theory that were left out of the last post.

Our entire discussion will revolve around a finite probability space, as defined last time. Let’s briefly (and densely) recall some of the notation presented there. We will always denote our probability space by \Omega, and the corresponding probability mass function will be f: \Omega \to [0,1]. Recall that events are subsets E \subset \Omega, and the probability function P accepts as inputs events E, and produces as output the sum of the probabilities of members of E. We abuse notation by saying \textup{P}(x) = \textup{P}(\left \{ x \right \}) = f(x) and disregarding f for the most part. We really think of \textup{P} as an extension of f to subsets of \Omega instead of just single values of \Omega. Further recall that a random variable X is a real-valued function function \Omega \to \mathbb{R}.

Partitions and Total Probability

A lot of reasoning in probability theory involves decomposing a complicated event into simpler events, or decomposing complicated random variables into simpler ones. Conditional probability is one way to do that, and conditional probability has very nice philosophical interpretations, but it fits into this more general scheme of “decomposing” events and variables into components.

The usual way to break up a set into pieces is via a partition. Recall the following set-theoretic definition.

Definition: partition of a set X is a collection of subsets X_i \in X so that every element x \in X occurs in exactly one of the X_i.

Here are a few examples. We can partition the natural numbers \mathbb{N} into even and odd numbers. We can partition the set of people in the world into subsets where each subset corresponds to a country and a person is placed in the subset corresponding to where they were born (an obvious simplification of the real world, but illustrates the point). The avid reader of this blog will remember how we used partitions to define quotient groups and quotient spaces. With a more applied flavor, finding a “good” partition is the ultimate goal of the clustering problem, and we saw a heuristic approach to this in our post on Lloyd’s algorithm.

You should think of a partition as a way to “cut up” a set into pieces. This colorful diagram is an example of a partition of a disc.

In fact, any time we have a canonical way to associate two things in a set, we can create a partition by putting all mutually associated things in the same piece of the partition. The rigorous name for this is an equivalence relation, but we won’t need that for the present discussion (partitions are the same thing as equivalence relations, just viewed in a different way).

Of course, the point is to apply this idea to probability spaces. Points (elements) in our probability space \Omega are outcomes of some random experiment, and subsets E \subset \Omega are events. So we can rephrase a partition for probability spaces as a choice of events E_i \subset \Omega so that every outcome in \Omega is part of exactly one event. Our first observation is quite a trivial one: the probabilities of the events in a partition sum to one. In symbols, if E_1, \dots, E_m form our partition, then

\displaystyle \sum_{i=1}^m \textup{P}(E_i) = 1

Indeed, the definition of \textup{P} is to sum over the probabilities of outcomes in an event. Since each outcome occurs exactly once among all the E_i, the above sum expands to

\displaystyle \sum_{\omega \in \Omega} \textup{P}(\omega)

Which by our axioms for a probability space is just one. We will give this observation the (non-standard) name the Lemma of Total Probability.

This was a nice warmup proof, but we can beef it up to make it more useful. If we have some other event A which is not related to a partition in any way, we can break up A with respect to the partition. Then, assuming this is simpler, we compute the probability that A happens in terms of the probabilities of the pieces.

Theorem: Let E_1, \dots , E_m be a partition of \Omega, and let A be an arbitrary event. Then

\displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(E_i \cap A)

Proof. The proof is only marginally more complicated than that of the lemma of total probability. The probability of the event A occurring is the sum of the probabilities of each of its outcomes occurring. Each outcome in A occurs in exactly one of the E_i, and hence in exactly one of the sets E_i \cap A. If E_i \cap A is empty, then its probability of occurring is zero (as per our definitions last time). So the sum on the right expands directly into the definition of \textup{P}(A). \square

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E's

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E’s. That is, the E’s give us a partition of A.

A more useful way of thinking of this is that we can use the E_i to define a partition of A in a natural way. The subsets in the partition will just be the sets E_i \cap A, and we will throw out any of these that turn out to be empty. Then we can think of our “new” probability space being A, and the theorem is just a special case of the lemma of total probability. Interestingly enough, this special case is often called the Theorem of Total Probability.

The idea to think of the event A as our “new” probability space is extremely useful. It shows its worth most prominently when we interpret the shift as, “gaining the information that A has occurred.” Then the question becomes: given that A occurs, what is the probability that some other event will occur? That is, we’re interested in the probability of some event B relative to A. This is called the conditional probability of B with respect to A, and is denoted P(B | A) (read “the probability of B given A”).

To compute the conditional probability, simply scale \textup{P}(A \cap B) by the assumed event \textup{P}(A). That is,

\displaystyle \textup{P}(B | A) = \frac{\textup{P}(A \cap B)}{\textup{P}(A)}

Wikipedia provides a straightforward derivation of the formula, but the spirit of the proof is exactly what we said above. The denominator is our new sample space, and the numerator is the probability of outcomes that cause B to occur which also cause A to occur. Multiplying both sides of this formula by \textup{P}(A), this identity can be used to arrive at another version of the theorem of total probability:

 \displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(A | E_i) \textup{P}(E_i)

That is, if we know how to compute the probabilities of the E_i, and we know how likely A is to occur in each of those scenarios, then we can compute the total probability of A occurring independently of the E_i.

We can come up with loads of more or less trivial examples of the theorem of total probability on simple probability spaces. Say you play a craps-like game where you roll a die twice. If you get a one on the first roll, you lose, and otherwise you have to match your initial roll on the second to win. The probability you win can be analyzed with the theorem on total probability. We partition the sample space into events corresponding to the outcome of the first roll.

\displaystyle \textup{P}(\textup{Win}) = \sum_{i=1}^6 \textup{P}(\textup{Win } | \textup{ 1st roll }= i) \textup{P}(\textup{1st roll } = i)

The probability the first roll is i is 1/6, and if the first roll is a 1 then the probability of winning after that is zero. In the other 5 cases the conditional probability is the same regardless of i: to match i on the second roll has a 1/6 chance. So the probability of winning is

\displaystyle 5 \cdot \frac{1}{6} \cdot \frac{1}{6} = \frac{5}{36}

For the working mathematician, these kinds of examples are relatively low-tech, but it illustrates the main way conditional probability is used in practice. We have some process we want to analyze, and we break it up into steps and condition on the results of a given step. We will see in a moment a more complicated example of this.

Partitions via Random Variables

The most common kind of partition is created via a random variable with finitely many values (or countably many, but we haven’t breached infinite probability spaces yet). In this case, we can partition the sample space \Omega based on the values of X. That is, for each value x = X(\omega), we will have a subset of the partition S_x be the set of all \omega which map to x. In the parlance of functions, it is the preimage of a single value x;

\displaystyle S_x = X^{-1}(x) = \left \{ \omega \in \Omega : X(\omega) = x\right \}

And as the reader is probably expecting, we can use this to define a “relative” expected value of a random variable. Recall that if the image of X is a finite set x_1, \dots, x_n, the expected value of X is a sum

\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)

Suppose X,Y are two such random variables, then the conditional probability of X relative to the event Y=y is the quantity

\displaystyle \textup{P}(X=x | Y=y) = \frac{\textup{P}(X=x \textup{ and } Y=y)}{\textup{P}(Y=y)}

And the conditional expectation of X relative to the event Y = y, denoted \textup{E}(X | Y = y) is a similar sum

\displaystyle \textup{E}(X|Y=y) = \sum_{i=1}^n x_i \textup{P}(X = x_i | Y = y)

Indeed, just as we implicitly “defined” a new sample space when we were partitioning based on events, here we are defining a new random variable (with the odd notation X | Y=y) whose domain is the preimage Y^{-1}(y). We can then ask what the probability of it assuming a value x is, and moreover what its expected value is.

Of course there is an analogue to the theorem of total probability lurking here. We want to say something like the true expected value of X is a sum of the conditional expectations over all possible values of Y. We have to remember, though, that different values of y can occur with very different probabilities, and the expected values of X | Y=y can change wildly between them. Just as a quick (and morbid) example, if X is the number of people who die on a randomly chosen day, and Y is the number of atomic bombs dropped on that day, it is clear that the probability of Y being positive is quite small, and the expected value of X = Y=y will be dramatically larger if y is positive than if it’s zero. (A few quick calculations based on tragic historic events show it would roughly double, using contemporary non-violent death rate estimates.)

And so instead of simply summing the expectation, we need to take an expectation over the values of Y. Thinking again of X | Y=y as a random variable based on values of Y, it makes sense mathematically to take expectation. To distinguish between the two types of expectation, we will subscript the variable being “expected,” as in \textup{E}_X(X|Y). That is, we have the following theorem.

TheoremThe expected value of X satisfies

\textup{E}_X(X) = \textup{E}_Y(\textup{E}_X(X|Y))

Proof. Expand the definitions of what these values mean, and use the definition of conditional probability \textup{P}(A \cap B) = \textup{P}(A | B) \textup{P}(B). We leave the proof as a trivial exercise to the reader, but if one cannot bear it, see Wikipedia for a full proof. \square

Let’s wrap up this post with a non-trivial example of all of this theory in action.

A Nontrivial Example: the Galton-Watson Branching Process

We are interested (as was the eponymous Sir Francis Galton in the 1800’s) in the survival of surnames through generations of marriage and children. The main tool to study such a generational phenomenon is the Galton-Watson branching process. The idea is quite simple, but its analysis quickly blossoms into a rich and detailed theoretical puzzle and a more general practical tool. Just before we get too deep into things, we should note that these ideas (along with other types of branching processes) are used to analyze a whole host of problems in probability theory and computer science. A few the author has recently been working with are the evolution of random graphs and graph property testing.

The gist is as follows: say we live in a patriarchal society in which surnames are passed down on the male side. We can image a family tree being grown step by step in this way At the root there is a single male, and he has k children, some of which are girls and some of which are boys. They all go on to have some number of children, but only the men pass on the family name to their children, and only their male children pass on the family name further. If we only record the family tree along the male lines, we can ask whether the tree will be finite; that is, whether the family name will die out.

To make this rigorous, let us define an infinite sequence of random variables X_1 X_2, \dots which represent the number of children each person in the tree has, and suppose further that all of these variables are independent and uniformly distributed from 1, \dots, n for some fixed n. This may be an unrealistic assumption, but it makes the analysis a bit simpler. The number of children more likely follows a Poisson distribution where the mean is a parameter we would estimate from real-world data, but we haven’t spoken of Poisson distributions on this blog yet so we will leave it out.

We further imagine the tree growing step by step: at step i the i-th individual in the tree has X_i children and then dies. If the individual is a woman we by default set X_i = 0. We can recursively describe the size of the tree at each step by another random variable Y_i. Clearly Y_0 = 1, and the recursion is Y_n = Y_{n-1} + X_i - 1. In words, Y_i represents the current living population with the given surname. We say the tree is finite (the family name dies off), if for some i we get Y_i = 0. The first time at which this happens is when the family name dies off, but abstractly we can imagine the sequence of random variables continuing forever. This is sometimes called fictitious continuation.

At last, we assume that the probability of having a boy or girl is a split 1/2. Now we can start asking questions. What is the probability that the surname dies off? What is the expected size of the tree in terms of n?

For the first question we use the theorem of total probability. In particular, suppose the first person has two boys. Then the whole tree is finite precisely when both boys’ sub-trees are finite. Indeed, the two boys’ sub-trees are independent of one another, and so the probability of both being finite is the product of the probabilities of each being finite. That is, more generally

\displaystyle \textup{P}(\textup{finite } | k \textup{ boys}) = \textup{P}(\textup{finite})^k \textup{P}(\textup{two boys})

Setting z = \textup{P}(\textup{the tree is finite}), we can compute z directly by conditioning on all possibilities of the first person’s children. Notice how we must condition twice here.

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \textup{P}(k \textup{ boys } | i \textup{ children}) \textup{P}(i \textup{ children}) z^k

The probability of getting k boys is the same as flipping i coins and getting k heads, which is just

\displaystyle \textup{P}(k \textup{ boys } | i \textup{ children}) = \binom{i}{k}\frac{1}{2^i}

So the equation is

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \binom{i}{k} \frac{1}{2^i} \cdot \frac{1}{n} z^k

From here, we’ve reduced the problem down to picking the correct root of a polynomial. For example, when n=4, the polynomial equation to solve is

\displaystyle 64z = 5 + 10z + 10z^2 + 5z^3 + z^4

We have to be a bit careful, here though. Not all solutions to this equation are valid answers. For instance, the roots must be between 0 and 1 (inclusive), and if there are multiple then one must rule out the irrelevant roots by some additional argument. Moreover, we would need to use a calculus argument to prove there is always a solution between 0 and 1 in the first place. But after all that is done, we can estimate the correct root computationally (or solve for exactly when our polynomials have small degree). Here for n=4, the probability of being finite is about 0.094.

We leave the second question, on the expected size of the tree, for the reader to ponder. Next time we’ll devote an entire post to Bayes Theorem (a trivial consequence of the definition of conditional probability), and see how it helps us compute probabilities for use in programs.

Until then!