It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.
And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).
We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.
So let us begin with probability spaces and random variables.
Finite Probability Spaces
We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set
Definition: A finite set
That is, the sum of all the values of
Sometimes the set
Definition: An event
For instance, suppose our probability space is
As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?
Of course, we want to extend the probability mass function
where by convention the empty sum has value zero. Note that the function
So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently:
Now that the function
There are some other quick properties we can state or prove about probability measures:
Random Variables
The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be
Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.
Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of
And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.
Definition: A random variable
So for example the random variable for the sum of the two dice would be
We can further define the set of all random variables
Of course, there are plenty of other things one can do to functions. For example,
The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event
Definition: An indicator random variable
This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.
The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.
Definition: Denote by
This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define
This is made rigorous by simply setting
In words, it is just the sum of the probabilities that individual outcomes will have a value under
Often times
We should also note that because our probability spaces are finite, the image of the random variable
The set of such events is called the probability distribution of the random variable
The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events
The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular,
for all
Expectation
We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.
Definition: Let
Note that if we label the image of
The most important fact about expectation is that it is a linear functional on random variables. That is,
Theorem: If
Proof. The only real step in the proof is to note that for each possible pair of values
and a little bit of algebraic elbow grease reduces this expression to
If we additionally know that
Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that
But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).
Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable
A tournament
Now if we construct a tournament on
To compute this, simply note that we can break
which by independence is
That is, the expected number of Hamiltonian paths is
Variance and Covariance
Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable
Definition: The variance of a random variable
That is,
One often denotes the variance by
The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:
. .- When
are independent then variance is additive: . - Variance is invariant under constant additives:
.
In addition, the quantity
Definition: Let
Note the similarities between the variance definition and this one: if
There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.
Theorem: Let
Proof. Take any two non-constant random variables
This is a quadratic function of a single variable
and so dividing by 4 and replacing
and the result follows.
Note that equality holds in the discriminant formula precisely when
This has important consequences even in English: the covariance is maximized when
Definition: The Pearson correlation coefficient of two random variables
If
The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

covariance
The horizontal axis plots a sample of values of the random variable
Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:
Various examples of the Pearson correlation coefficient, credit Wikipedia.
Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of
But this strays a bit far from our original point: we still want to find a formula for
Proposition: The variance operator satisfies
And using induction we get a general formula:
Note that in the general sum, we get a bunch of terms
Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.
Definition: The covariance matrix of a collection of random variables
As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables
Next Up
So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.
Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.
Until then!
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.