One of the main areas of difficulty in elementary probability, and one that requires the highest levels of scrutiny and rigor, is conditional probability. The ideas are simple enough: that we assign probabilities relative to the occurrence of some event. But shrewd applications of conditional probability (and in particular, efficient ways to compute conditional probability) are key to successful applications of this subject. This is the basis for Nate Silver‘s success, the logical flaws of many a political pundit, and the ability for a robot to tell where it is in an environment. As this author usually touts, the best way to avoid the pitfalls of such confusing subjects is to be mathematically rigorous. In doing so we will develop intuition for when conditional probability that experts show off as if it were trivial.
But before we can get to all of that, we will cover a few extra ideas from finite probability theory that were left out of the last post.
Our entire discussion will revolve around a finite probability space, as defined last time. Let’s briefly (and densely) recall some of the notation presented there. We will always denote our probability space by
Partitions and Total Probability
A lot of reasoning in probability theory involves decomposing a complicated event into simpler events, or decomposing complicated random variables into simpler ones. Conditional probability is one way to do that, and conditional probability has very nice philosophical interpretations, but it fits into this more general scheme of “decomposing” events and variables into components.
The usual way to break up a set into pieces is via a partition. Recall the following set-theoretic definition.
Definition: A partition of a set
Here are a few examples. We can partition the natural numbers
You should think of a partition as a way to “cut up” a set into pieces. This colorful diagram is an example of a partition of a disc.
In fact, any time we have a canonical way to associate two things in a set, we can create a partition by putting all mutually associated things in the same piece of the partition. The rigorous name for this is an equivalence relation, but we won’t need that for the present discussion (partitions are the same thing as equivalence relations, just viewed in a different way).
Of course, the point is to apply this idea to probability spaces. Points (elements) in our probability space
Indeed, the definition of
Which by our axioms for a probability space is just one. We will give this observation the (non-standard) name the Lemma of Total Probability.
This was a nice warmup proof, but we can beef it up to make it more useful. If we have some other event
Theorem: Let
Proof. The proof is only marginally more complicated than that of the lemma of total probability. The probability of the event

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E's
A more useful way of thinking of this is that we can use the
The idea to think of the event
To compute the conditional probability, simply scale
Wikipedia provides a straightforward derivation of the formula, but the spirit of the proof is exactly what we said above. The denominator is our new sample space, and the numerator is the probability of outcomes that cause
That is, if we know how to compute the probabilities of the
We can come up with loads of more or less trivial examples of the theorem of total probability on simple probability spaces. Say you play a craps-like game where you roll a die twice. If you get a one on the first roll, you lose, and otherwise you have to match your initial roll on the second to win. The probability you win can be analyzed with the theorem on total probability. We partition the sample space into events corresponding to the outcome of the first roll.
The probability the first roll is
For the working mathematician, these kinds of examples are relatively low-tech, but it illustrates the main way conditional probability is used in practice. We have some process we want to analyze, and we break it up into steps and condition on the results of a given step. We will see in a moment a more complicated example of this.
Partitions via Random Variables
The most common kind of partition is created via a random variable with finitely many values (or countably many, but we haven’t breached infinite probability spaces yet). In this case, we can partition the sample space
And as the reader is probably expecting, we can use this to define a “relative” expected value of a random variable. Recall that if the image of
Suppose
And the conditional expectation of
Indeed, just as we implicitly “defined” a new sample space when we were partitioning based on events, here we are defining a new random variable (with the odd notation
Of course there is an analogue to the theorem of total probability lurking here. We want to say something like the true expected value of
And so instead of simply summing the expectation, we need to take an expectation over the values of
Theorem: The expected value of
Proof. Expand the definitions of what these values mean, and use the definition of conditional probability
Let’s wrap up this post with a non-trivial example of all of this theory in action.
A Nontrivial Example: the Galton-Watson Branching Process
We are interested (as was the eponymous Sir Francis Galton in the 1800’s) in the survival of surnames through generations of marriage and children. The main tool to study such a generational phenomenon is the Galton-Watson branching process. The idea is quite simple, but its analysis quickly blossoms into a rich and detailed theoretical puzzle and a more general practical tool. Just before we get too deep into things, we should note that these ideas (along with other types of branching processes) are used to analyze a whole host of problems in probability theory and computer science. A few the author has recently been working with are the evolution of random graphs and graph property testing.
The gist is as follows: say we live in a patriarchal society in which surnames are passed down on the male side. We can image a family tree being grown step by step in this way At the root there is a single male, and he has
To make this rigorous, let us define an infinite sequence of random variables
We further imagine the tree growing step by step: at step
At last, we assume that the probability of having a boy or girl is a split 1/2. Now we can start asking questions. What is the probability that the surname dies off? What is the expected size of the tree in terms of
For the first question we use the theorem of total probability. In particular, suppose the first person has two boys. Then the whole tree is finite precisely when both boys’ sub-trees are finite. Indeed, the two boys’ sub-trees are independent of one another, and so the probability of both being finite is the product of the probabilities of each being finite. That is, more generally
Setting
The probability of getting
So the equation is
From here, we’ve reduced the problem down to picking the correct root of a polynomial. For example, when
We have to be a bit careful, here though. Not all solutions to this equation are valid answers. For instance, the roots must be between 0 and 1 (inclusive), and if there are multiple then one must rule out the irrelevant roots by some additional argument. Moreover, we would need to use a calculus argument to prove there is always a solution between 0 and 1 in the first place. But after all that is done, we can estimate the correct root computationally (or solve for exactly when our polynomials have small degree). Here for
We leave the second question, on the expected size of the tree, for the reader to ponder. Next time we’ll devote an entire post to Bayes Theorem (a trivial consequence of the definition of conditional probability), and see how it helps us compute probabilities for use in programs.
Until then!
Want to respond? Send me an email, post a webmention, or find me elsewhere on the internet.