It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a *huge *portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

## Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

**Definition: **A finite set equipped with a function is a *probability space* if the function satisfies the property

That is, the sum of all the values of must be 1.

Sometimes the set is called the *sample space,* and the act of choosing an element of according to the probabilities given by is called *drawing* an example.* *The function is usually called the *probability mass function. *Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an *event*.

**Definition: **An *event* is a subset of a sample space.

For instance, suppose our probability space is and is defined by setting for all (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is *even*? This event would be the subset , and if any of these are the outcome of the experiment, the event is said to *occur*. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a *probability measure* , where denotes the set of all subsets of . The example of a die roll guides our intuition: the probability of any event should be the *sum* of the probabilities of the outcomes contained in it. i.e. we define

where by convention the empty sum has value zero. Note that the function is often denoted .

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities and for some . Such a probability space is often called a *Bernoulli trial*.

Now that the function is defined on all events, we can simplify our notation considerably. Because the probability mass function uniquely determines and because contains all information about in it (), we may speak of as the *probability measure* of , and leave out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of is understood as above.

There are some other quick properties we can state or prove about probability measures: by convention, if are disjoint then , and if then . The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

## Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the *sum* of the two dice than the actual numbers on the dice. In fact, we are *really *more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of because all of the randomness inherent in the game comes from the generation of an output in (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

**Definition: **A *random variable * is a real-valued function on the sample space .

So for example the random variable for the sum of the two dice would be . We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of *all *random variables . It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if are two random variables, so is for real numbers . This function operates *linearly*, in the sense that its value is . We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, is the product of two random variables (defined by ) and one can imagine such awkward constructions as or . We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event , we can define a random variable which is 0 or 1 depending on whether the input is a member of . That is,

**Definition: **An *indicator random variable* is defined by setting when and 0 otherwise. A common abuse of notation for singleton sets is to denote by .

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

**Definition:** Denote by the set of outcomes for which . With the function notation, .

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define or just as we would naively construct sets. It works in general for any subset of . The notation is , and we will also call these sets *events*. The notation becomes useful and elegant when we combine it with the probability measure . That is, we want to write things like and read it in our head “the probability that is even”.

This is made rigorous by simply setting

In words, it is just the sum of the probabilities that individual outcomes will have a value under that lands in . We will also use for the shorthand notation or .

Often times will be smaller than itself, even if is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let be the profit function. Then is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable is a finite subset of real numbers. In other words, the set of all events of the form where form a partition of . As such, we get the following immediate identity:

The set of such events is called the *probability distribution* of the random variable .

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events are *independent *if the probability of both occurring is the product of the probabilities of each event occurring. That is, . There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, are *independent* random variables if

for all .

## Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

**Definition: **Let be a random variable on a finite probability space . The *expected value* of , denoted , is the quantity

Note that if we label the image of by then this is equivalent to

The most important fact about expectation is that it is a linear functional on random variables. That is,

**Theorem: **If are random variables on a finite probability space and , then

*Proof*. The only real step in the proof is to note that for each possible pair of values in the images of resp., the events form a partition of the sample space . That is, because has a constant value on , the second definition of expected value gives

and a little bit of algebraic elbow grease reduces this expression to . We leave this as an exercise to the reader, with the additional note that the sum is identical to .

If we additionally know that are independent random variables, then the same technique used above allows one to say something about the expectation of the product (again by definition, ). In this case . We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that . For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable into simpler (usually indicator) random variables, the computation of becomes trivial.

A *tournament* is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a *Hamiltonian path*, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers , where we visit vertex at stage of the traversal. The condition for this to be a valid Hamiltonian path is that is an edge in for all .

Now if we construct a tournament on vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in .

To compute this, simply note that we can break , where ranges over all possible lists of the vertices. Then , and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is as this is the number of possible lists of items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

which by independence is

That is, the expected number of Hamiltonian paths is .

## Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable are throughout the real line.

**Definition:** The *variance* of a random variable is the quantity .

That is, is a number, and so is the random variable defined by . It is the expectation of the square of the deviation of from its expected value.

One often denotes the variance by or . The square is for silly reasons: the *standard deviation*, denoted and equivalent to has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

- .
- .
- When are independent then variance is additive: .
- Variance is invariant under constant additives: .

In addition, the quantity is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is *covariance.*

**Definition:** Let be random variables. The *covariance *of and , denoted , is the quantity .

Note the similarities between the variance definition and this one: if then the two quantities coincide. That is, .

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

**Theorem: **Let be random variables with variances . Then their covariance is at most the product of the standard deviations in magnitude:

*Proof. *Take any two non-constant random variables and (we will replace these later with ). Construct a new random variable where is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, . Expanding this and using linearity gives

This is a quadratic function of a single variable which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

and so dividing by 4 and replacing with , resp., gives

and the result follows.

Note that equality holds in the discriminant formula precisely when (the discriminant is zero), and after the replacement this translates to for some fixed value of . In other words, for some real numbers we have .

This has important consequences even in English: the covariance is maximized when is a linear function of , and otherwise is bounded from above and below. By dividing both sides of the inequality by we get the following definition:

**Definition: **The *Pearson correlation coefficient* of two random variables is defined by

If is close to 1, we call and *positively correlated*. If is close to -1 we call them *negatively correlated, *and if is close to zero we call them *uncorrelated*.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

The horizontal axis plots a sample of values of the random variable and the vertical plots a sample of . The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which *have* a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of given values of (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for . Expanding the definition, it is not hard to see that this amounts to the following proposition:

**Proposition: **The variance operator satisfies

And using induction we get a general formula:

Note that in the general sum, we get a bunch of terms .

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

**Definition:** The *covariance matrix* of a collection of random variables is the matrix whose entry is .

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a *new *collection of random variables (which are *linear* combinations of the original variables ) such that the covariance of distinct pairs are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called *principal component analysis*.

## Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!

I’m definitely looking forward to read this in more detail, thanks for sharing!

Hello Jeremy, would you consider using MathJax or something similar on your blog? It would help tremendously with content consumption on mobile devices (like my smartphone). Right now, all the images are centered on their own separate line.

Keep up the good work, I love reading this blog.

Oh I’ve wished for wordpress to provide a lot of features (or not implement these anti-features) like that. Unfortunately they don’t, and I can’t really justify setting up my own hosting. Maybe I can file a bug.

Nice! You write very clearly, for a mathematician.

A primer I never might that is so clear and intuitive, for a non mathematician.

Look forward “Next up”

could you shed some light on some intuitive way to interpret the standard deviation? Often with data sets the standard deviation is given along with the mean, mode, etc. But I’ve never quite understood the significance of the standard deviation. What meaningful information does a std dev give us? Thanks!

In general the standard deviation doesn’t tell you anything more than the variance. They are both measures of how “spread out” the data is in the following sense: if the variance is small, then randomly sampled outcomes tend to be closer to the expected value; if the variance is large, they tend to be farther away. This makes sense because the image of a random variable is the real numbers (we can talk about real numbers being “close”). It gives you even more power if you know more information about the distribution of your random variable. For instance, if it’s a normal distribution, then you can start making confidence estimates, a la classical statistics. In particular, you can say that about 65% of the data lies within one standard deviation of the expected value, 95% within two, etc. Coupling this with more beefy theorems like the central limit theorem, you can start saying interesting things about collections of (non-normally distributed) random variables.

The beauty of standard deviation is that it’s on the same scale as the data.

The central limit theorem only tells you about means of i.i.d. draws from distributions with finite means and variances (that is, not the Cauchy distribution, which has very fat tails).

If your distribution is very skewed, measuring deviations can be a very bad approximation to posterior probability mass, which is why people use exact tests rather than normal approximations where possible.

Also, the highest density interval, such as the narrowest interval with 95% of the probability mass, can be quite different than a central interval; the Beta(0.1,10) distribution provides an example.

It is still not apparent to me why the “scale” makes it a more elegant or beautiful quantity. For normal distributions we get these nice interval bounds, but in general the Chernoff bounds always use variance. Maybe this is my mathematical training inhibiting me.

You do say “this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation …”, but it’s perhaps worth saying explicitly that a small or zero correlation between two variables does not imply that they are unrelated. (X = Y^2 is the simplest example). People often say “there’s no relationship because there’s no correlation”, even in press reports and on TV.

Your blogs are excellent.

It’s a lot of work, but keep it up!

Precisely! I believe I mentioned that (if perhaps only briefly) in a figure with some examples of nonlinear relationships giving zero correlation. It’s alarming how little the public understands about maths, even when the very same maths go into law-making and politics :)

I reckon you meant intersection instead of union, when you typed the formula related to the independence of two events as:

P( union(E, F) ) = P(E)P(F)

Quite right!

I think you mean , not

I have no idea why math people think that expressing things in symbols not found in the typical programming language or keyboard is in any way clearer to a programmer.

“More coherently”, we read, and I honestly believe that line was written with no deliberate sense of irony, “\textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1″ – making even a medley of Perl and Lisp look like line noise.

Like regular expressions, such notation is not designed for either clarity or readability.

It is a concise way to represent things – but a very bad way indeed to explain things.

You repeated the LaTeX code, but I hope you really actually saw the image as well.

I hate to say it but if you don’t like the sigma notation and the letter omega, you probably shouldn’t read this blog. Everything in that equation is defined in this primer; if you’re not willing to remember or refer back to the definitions of things, it would be equally incoherent to express it in any notation. As all web developers know, all things being the same the simplest option is the best. Mathematical notation is the simplest and most widely understood, and making up my own programmer-friendly notation would be far worse: then

allof my readers would have to puzzle over it (instead of just my readers who are afraid of math, which is a small fraction to be sure).I think you have to remember that math came along

waybefore keyboards and programming languages, and I’m not targeting programmers. Indeed, the whole point of this blog is that I believe programmers stand to learn a lot by getting involved with some real mathematics. In my mind, the programming world has too much of a focus on vapid web development and useless mobile apps, and not enough on doing anything interesting. This blog is a testament to the fact that many of those interesting things require (or stand to benefit from) mathematics.Well said!

I gave up half-way true… But really nice article… I’ll give it a go again very shortly… It’s the beauty of advanced maths all over again… Well done… But just thinking, maybe I’m a more visually biased learner for abstract concepts (typical of advanced mathematics), however if you could add up a few more illustrations, to explain some of the basics progressively, this would be very helpful.. Well anyways, great article… And your delivery is awesome…

I recommend reading with a pencil and paper handy. Jotting down a quick example of your own invention is much more illustrative to the learning mind than reading an example made by someone else. A fluid comprehension of working with sets is also important. I have an old primer for that too :)

But of course, these primers are supposed to be terse; I don’t expect any of my primers to replace the time-tested textbooks on the subjects.

Error?

> That is, \textup{E}(X) is a number, and so X – \textup{E}(X) is the random variable defined by $latex