Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability p. If X denotes the number of trials needed to get the first success, then clearly \Pr(X = k) = (1-p)^{k-1} p (since first we need k-1 failures which occur independently with probability 1-p, then we need one success which happens with probability p). Thus the expected value of X is

\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p

by basic calculus. In particular, if success is defined as getting a six, then p=1/6 thus the expected time is 1/p=6.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is \frac{1}{26}, thus the probability of a random eleven-letter word being ABRACADABRA is exactly \left(\frac{1}{26}\right)^{11}. So if typing 11 letters is one trial, the expected number of trials is

\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}

which means 11\cdot 26^{11} keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that 11\cdot 26^{11} is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly \frac{1}{26} (thus the expected value of the change in the gambler’s fortune is \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as T. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly T dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins \$26^4. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be \$26^{11}. Thus our casino will have to give out 26^{11}+26^4+26 dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is T dollars, the expected value of our expenses is 26^{11}+26^4+26 dollars, thus E(T)=26^{11}+26^4+26. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. X_0 will denote the gambler’s fortune before the game starts, X_1 the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of X_n, given X_1, X_2, \ldots, X_{n-1} is the same as X_{n-1}. This can be written as E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1} or, equivalently, E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let X'_n = X_n if n < T and X'_n = X_T if n \ge T where T denotes the stopping time, i.e. the time at which the desired event occurs. Notice that T itself is a random variable.

We require our stopping time T to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let X_n be a martingale stopped at step T, and suppose one of the following three conditions hold:

  1. The stopping time T is almost surely bounded by some constant;
  2. The stopping time T is almost surely finite and every step of the stopped martingale X_n is almost surely bounded by some constant; or
  3. The expected stopping time E(T) is finite and the absolute value of the martingale increments |X_n-X_{n-1}| are almost surely bounded by a constant.

Then E(X_T) = E(X_0).

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than 26^{12}) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by 26^{11}+26^4+26. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has a dollars, the second player has b dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let X_n denote the change in the second player’s fortune, and set X_0 = 0. Let T_k denote the first time s when X_s = k. Then our first question can be formalized as trying to determine \Pr(T_{-b} < T_a). Let t = \min \{ T_{-b}, T_a\}. Clearly t is a stopping time. By the optional stopping theorem we have that

\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))

thus \Pr(T_{-b} < T_a)=\frac{a}{a+b}.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: X_n^2-n is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with k-SAT for some k \ne 3 instead? Clearly the hardness of the problem is monotone increasing in k since k-SAT is a special case of (k+1)-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as k-SAT for any k>3. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals x and y are connected by a directed edge if there is a clause (\bar x \lor y). Recall that \bar x \lor y is equivalent to x \implies y, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths x \to \bar x and \bar x \to x (since x \Leftrightarrow \bar x is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after O(n^2) rounds where n denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in O(n^2) steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by \pm 1 in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least \frac{1}{2}. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond n. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number n, on the other hand, is called a reflecting barrier: we cannot reach n+1, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in O(n^2) steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

How to Conquer Tensorphobia

A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke tensor products… People run in terror from the \otimes symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. Tensors are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a tensor product. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word product is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual direct product of vector spaces. In fact they are related (in very precise sense), but they are far from similar. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces V, W of dimensions n, m. Recall that the direct product  V \times W is the vector space of pairs (v,w) where v comes from V and w from W. Recall that addition in this vector space is defined componentwise ((v_1,w_1) + (v_2, w_2) = (v_1 + v_2, w_1 + w_2)) and scalar multiplication scales both components \lambda (v,w) = (\lambda v, \lambda w).

To get the tensor product space V \otimes W, we make the following modifications. First, we redefine what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is declared to be the same as scalar multiplication of any component you want. In symbols,

\displaystyle \lambda (v, w) = (\lambda v, w) = (v, \lambda w)

for all choices of scalars \lambda and vectors v, w. Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

(v, w) + (v', w) = (v + v', w)

only works because w is the same in both pieces, and with the same rule applying if we switch the positions of v,w above. All other additions are simply declared to be new vectors. I.e. (x,y) + (z,w) is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that y=w or x=z. To say it still one more time, a general element of the tensor V \otimes W is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we rename the pair (v,w) to v \otimes w, to distinguish it from the old vector space V \times W that we’ve totally butchered and reanimated, and we call the tensor product space as a whole V \otimes W. Those familiar with this kind of abstract algebra will recognize quotient spaces at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of \mathbb{R}. This means that our space \mathbb{R} \otimes \mathbb{R} is comprised of vectors like 3 \otimes 5, and moreover that the following operations are completely legitimate.

3 \otimes 5 + 1 \otimes (-5) = 3 \otimes 5 + (-1) \otimes 5 = 2 \otimes 5

6 \otimes 1 + 3\pi \otimes \pi = 3 \otimes 2 + 3 \otimes \pi^2 = 3 \otimes (2 + \pi^2)

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just declaring that all the axioms of a vector space hold when we want them to. So if you were wondering whether

\lambda (a \otimes b + c \otimes d) = \lambda(a \otimes b) + \lambda(c \otimes d)

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space V \otimes W are

  1. The “basic” vectors are v \otimes w for v \in V, w \in W, and they’re used to build up all other vectors.
  2. Addition is symbolic, unless one of the components is the same in both addends, in which case (v_1, w) + (v_2, w) = (v_1+ v_2, w) and (v, w_1) + (v,w_2) = (v, w_1 + w_2).
  3. You can freely move scalar multiples around the components of v \otimes w.
  4. The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to n-fold tensor products, like V_1 \otimes V_2 \otimes \dots \otimes V_d. Here we write the vectors as sums of things like v_1 \otimes \dots \otimes v_d, and we enforce that addition can only be combined if all but one coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the maps between objects are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a linear map between vector spaces. That is, a map f: V \to W that factors through addition and scalar multiplication (i.e. f(v + v') = f(v) + f(v') and f(\lambda v) = \lambda f(v)).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it m, then m takes in two matrices and spits out their product

m(A,B) = AB

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix B then

m(A+C, B) = (A+C)B = AB + CB = m(A,B) + m(C,B)

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates simultaneously. In other words,

m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD \neq AC + BD = m(A,C) + m(B,D)

In fact, if the only operation satisfying linearity in its two coordinates separately and also this kind of linearity is the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that m is linear in each coordinate when all else is fixed. Note that this property allows us to shift around scalar multiples, too. For example,

\displaystyle m(\lambda A, B) = \lambda AB = A (\lambda B) = m(A, \lambda B) = \lambda m(A,B)

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called bilinearity when we only have two coordinates, and multilinearity when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

  • If V is an inner product space over \mathbb{R}, then the inner product is bilinear.
  • The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
  • The cross product of vectors in \mathbb{R}^3 is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of classification of multilinear maps. The idea is that every multilinear map f from a product vector space U_1 \times \dots \times U_d to any vector space Y can be written first as a multilinear map to the tensor space

\displaystyle \alpha : U_1 \times \dots \times U_d \to U_1 \otimes \dots \otimes U_d

Followed by a linear map to Y,

\displaystyle \hat{f} : U_1 \otimes \dots \otimes U_d \to Y

And the important part is that \alpha doesn’t depend on the original f (but \hat{f} does). One usually draws this as a single diagram:


And to say this diagram commutes is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. f = \hat{f} \alpha).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and \alpha is the gate. Yet another way to say this is that \alpha is the most general possible multilinear map that can be constructed from U_1 \times \dots \times U_d. Moreover, the tensor product itself is uniquely defined by having a “most-general” \alpha (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of \alpha (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix V, W as vector spaces and look at the diagram


What is \alpha in this case? Well it just sends (v,w) \mapsto v \otimes w. Is this map multilinear? Well if we fix w then

\displaystyle \alpha(v_1 + v_2, w) = (v_1 + v_2) \otimes w = v_1 \otimes w + v_2 \otimes w = \alpha(v_1, w) + \alpha (v_2, w)


\displaystyle \alpha(\lambda v, w) = (\lambda v) \otimes w = (\lambda) (v \otimes w) = \lambda \alpha(v,w)

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we designed tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps f : V \times W \to Y can be decomposed into an \alpha part and a \hat{f} part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of V \times W is, it’s just the union of the bases of the pieces. Say that V has a basis v_1, \dots, v_n and W has w_1, \dots, w_m, then a basis for the product is just ((v_1, 0), \dots, (v_n,0), (0,w_1), \dots, (0,w_m)).

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on all possible pairs of basis elements. For how else could we determine f(v_1 + v_2, w_1)? If there are n of the v_i‘s and m of the w_i‘s, then there are nm such pairs f(v_i, w_j).

Uncoincidentally, as V \otimes W is a vector space, its basis can also be constructed in terms of the bases of V and W. You simply take all possible tensors v_i \otimes w_j. Since every v \in V, w \in W can be written in terms of their bases, it’s clear than any tensor \sum_{k} a_k \otimes b_k can also be written in terms of the basis tensors v_i \otimes w_j (by simply expanding each a_k, b_k in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if (e_1, e_2, e_3) is a basis for \mathbb{R}^3, and (g_1, g_2) a basis for \mathbb{R}^2, then the tensor space \mathbb{R}^3 \otimes \mathbb{R}^2 has basis

(e_1 \otimes g_1, e_1 \otimes g_2, e_2 \otimes g_1, e_2 \otimes g_2, e_3 \otimes g_1, e_3 \otimes g_2)

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that \mathbb{R}^3 \otimes \mathbb{R}^2 \cong \mathbb{R}^6.

So fine, back to decomposing f. All we have left to do is use the data given by f (the effect on pairs of basis elements) to define \hat{f} : V \otimes W \to Y. The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space (v_i \otimes w_j) and the definition of f(v_i, w_j) are essentially the same.

That is, just take \hat{f}(v_i \otimes w_j) = f(v_i, w_j). Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining \hat{f} to split across sums of tensors as needed). Is this well defined? Well, multilinearity of f forces it to be so. For if we had two equal tensors, say, \lambda v \otimes w = v \otimes \lambda w, then we know that f has to respect their equality, because f(\lambda v_i, w_j) = f(v_i, \lambda w_j), so \hat{f} will take the same value on equal tensors regardless of which representative we pick (where we decide to put the \lambda). The same idea works for sums, so everything checks out, and f(v,w) is equal to \hat{f} \alpha, as desired. Moreover, we didn’t make any choices in constructing \hat{f}. If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining V \otimes W). Since the construction would be isomorphic if we changed the basis, our choice of \hat{f} is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear why we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

Computability and Stuff

It should be clear by now that plain product spaces V \times W and tensor product spaces V \otimes W are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in V and W. Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space U_1 \otimes \dots \otimes U_d is as a sum of the basic-looking tensors.

\sum_k \displaystyle a_{1,k} \otimes a_{2,k} \otimes \dots \otimes a_{d,k}

where the a_{i,d} are linear combinations of basis vectors in the U_i. But as we saw with our examples over \mathbb{R}, there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor a_1 \otimes \dots \otimes a_d. If you can do this we call the tensor a pure tensor, or a rank 1 tensor. We then have the following natural definition and problem:

Definition: The rank of a tensor x \in U_1 \otimes \dots \otimes U_d is the minimum number of terms in any representation of x as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

Problem: Given a tensor x \in k^{n_1} \otimes k^{n_2} \otimes k^{n_3} where k is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict k to be either the rational numbers \mathbb{Q} or a finite field \mathbb{F}_{q}.

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

Theorem: Computing tensor rank is NP-hard when k = \mathbb{Q} and NP-complete when k is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!

Homology Theory — A Primer

This series on topology has been long and hard, but we’re are quickly approaching the topics where we can actually write programs. For this and the next post on homology, the most important background we will need is a solid foundation in linear algebra, specifically in row-reducing matrices (and the interpretation of row-reduction as a change of basis of a linear operator).

Last time we engaged in a whirlwind tour of the fundamental group and homotopy theory. And we mean “whirlwind” as it sounds; it was all over the place in terms of organization. The most important fact that one should take away from that discussion is the idea that we can compute, algebraically, some qualitative features about a topological space related to “n-dimensional holes.” For one-dimensional things, a hole would look like a circle, and for two dimensional things, it would look like a hollow sphere, etc. More importantly, we saw that this algebraic data, which we called the fundamental group, is a topological invariant. That is, if two topological spaces have different fundamental groups, then they are “fundamentally” different under the topological lens (they are not homeomorphic, and not even homotopy equivalent).

Unfortunately the main difficulty of homotopy theory (and part of what makes it so interesting) is that these “holes” interact with each other in elusive and convoluted ways, and the algebra reflects it almost too well. Part of the problem with the fundamental group is that it deftly eludes our domain of interest: computing them is complicated!

What we really need is a coarser invariant. If we can find a “stupider” invariant, it might just be simple enough to compute efficiently. Perhaps unsurprisingly, these will take the form of finitely-generated abelian groups (the most well-understood class of groups), with one for each dimension. Now we’re starting to see exactly why algebraic topology is so difficult; it has an immense list of prerequisite topics! If we’re willing to skip over some of the more nitty gritty details (and we must lest we take a huge diversion to discuss Tor and the exact sequences in the universal coefficient theorem), then we can also do the same calculations over a field. In other words, the algebraic objects we’ll define called “homology groups” are really vector spaces, and so row-reduction will be our basic computational tool to analyze them.

Once we have the basic theory down, we’ll see how we can write a program which accepts as input any topological space (represented in a particular form) and produces as output a list of the homology groups in every dimension. The dimensions of these vector spaces (their ranks, as finitely-generated abelian groups) are interpreted as the number of holes in the space for each dimension.

Recall Simplicial Complexes

In our post on constructing topological spaces, we defined the standard k-simplex and the simplicial complex. We recall the latter definition here, and expand upon it.

Definition: A simplicial complex is a topological space realized as a union of any collection of simplices (of possibly varying dimension) \Sigma which has the following two properties:

  • Any face of a simplex \Sigma is also in \Sigma.
  • The intersection of any two simplices of \Sigma is also a simplex of \Sigma.

We can realize a simplicial complex by gluing together pieces of increasing dimension. First start by taking a collection of vertices (0-simplices) X_0. Then take a collection of intervals (1-simplices) X_1 and glue their endpoints onto the vertices in any way. Note that because we require every face of an interval to again be a simplex in our complex, we must glue each endpoint of an interval onto a vertex in X_0. Continue this process with X_2, a set of 2-simplices, we must glue each edge precisely along an edge of X_1. We can continue this process until we reach a terminating set X_n. It is easy to see that the union of the X_i form a simplicial complex. Define the dimension of the cell complex to be n.

There are some picky restrictions on how we glue things that we should mention. For instance, we could not contract all edges of a 2-simplex \sigma and glue it all to a single vertex in X_0. The reason for this is that \sigma would no longer be a 2-simplex! Indeed, we’ve destroyed its original vertex set. The gluing process hence needs to preserve the original simplex’s boundary. Moreover, one property that follows from the two conditions above is that any simplex in the complex is uniquely determined by its vertices (for otherwise, the intersection of two such non-uniquely specified simplices would not be a single simplex).

We also have to remember that we’re imposing a specific ordering on the vertices of a simplex. In particular, if we label the vertices of an n-simplex 0, \dots, n, then this imposes an orientation on the edges where an edge of the form \left \{ i,j \right \} has the orientation (i,j) if i < j, and (j,i) otherwise. The faces, then, are “oriented” in increasing order of their three vertices. Higher dimensional simplices are oriented in a similar way, though we rarely try to picture this (the theory of orientations is a question best posted for smooth manifolds; we won’t be going there any time soon). Here are, for example, two different ways to pick orientations of a 2-simplex:

Two possible orientations of a 2-simplex

Two possible orientations of a 2-simplex.

It is true, but a somewhat lengthy exercise, that the topology of a simplicial complex does not change under a consistent shuffling of the orientations across all its simplices. Nor does it change depending on how we realize a space as a simplicial complex. These kinds of results are crucial to the welfare of the theory, but have been proved once and we won’t bother reproving them here.

As a larger example, here is a simplicial complex representing the torus. It’s quite a bit more complicated than our usual quotient of a square, but it’s based on the same idea. The left and right edges are glued together, as are the top and bottom, with appropriate orientations. The only difficulty is that we need each simplex to be uniquely determined by its vertices. While this construction does not use the smallest possible number of simplices to satisfy that condition, it is the simplest to think about.

A possible realization of the torus as a simplicial complex. As an exercise, the reader is invited to fill in the orientations on the simplices to be consistent across the entire complex.

A possible realization of the torus as a simplicial complex. As an exercise, the reader is invited to label the edges and fill in the orientations on the simplices to be consistent across the entire complex. Remember that the result should coincide with our classical construction via the quotient of the disk, so some of the edges on the sides will coincide with those on the opposite sides, and the orientations must line up.

Taking a known topological space (like the torus) and realizing it as a simplicial complex is known as triangulating the space. A space which can be realized as a simplicial complex is called triangulable.

The nicest thing about the simplex is that it has an easy-to-describe boundary. Geometrically, it’s obvious: the boundary of the line segment is the two endpoints; the boundary of the triangle is the union of all three of its edges; the tetrahedron has four triangular faces as its boundary; etc. But because we need an algebraic way to describe holes, we want an algebraic way to describe the boundary. In particular, we have two important criterion that any algebraic definition must satisfy to be reasonable:

  1. A boundary itself has no boundary.
  2. The property of being boundariless (at least in low dimensions) coincides with our intuitive idea of what it means to be a loop.

Of course, just as with homotopy these holes interact in ways we’re about to see, so we need to be totally concrete before we can proceed.

The Chain Group and the Boundary Operator

In order to define an algebraic boundary, we have to realize simplices themselves as algebraic objects.  This is not so difficult to do: just take all “formal sums” of simplices in the complex. More rigorously, let X_k be the set of k-simplices in the simplicial complex X. Define the chain group C_k(X) to be the \mathbb{Q}-vector space with X_k for a basis. The elements of the k-th chain group are called k-chainon X. That’s right, if \sigma, \sigma' are two k-simplices, then we just blindly define a bunch of new “chains” as all possible “sums” and scalar multiples of the simplices. For example, sums involving two elements would look like a\sigma + b\sigma' for some a,b \in \mathbb{Q}. Indeed, we include any finite sum of such simplices, as is standard in taking the span of a set of basis vectors in linear algebra.

Just for a quick example, take this very simple simplicial complex:


We’ve labeled all of the simplices above, and we can describe the chain groups quite easily. The zero-th chain group C_0(X) is the \mathbb{Q}-linear span of the set of vertices \left \{ v_1, v_2, v_3, v_4 \right \}. Geometrically, we might think of “the union” of two points as being, e.g., the sum v_1 + v_3. And if we want to have two copies of v_1 and five copies of v_3, that might be thought of as 2v_1 + 5v_3. Of course, there are geometrically meaningless sums like \frac{1}{2}v_4 - v_2 - \frac{11}{6}v_1, but it will turn out that the algebra we use to talk about holes will not falter because of it. It’s nice to have this geometric idea of what an algebraic expression can “mean,” but in light of this nonsense it’s not a good idea to get too wedded to the interpretations.

Likewise, C_1(X) is the linear span of the set \left \{ e_1, e_2, e_3, e_4, e_5 \right \} with coefficients in \mathbb{Q}. So we can talk about a “path” as a sum of simplices like e_1 + e_4 - e_5 + e_3. Here we use a negative coefficient to signify that we’re travelling “against” the orientation of an edge. Note that since the order of the terms is irrelevant, the same “path” is given by, e.g. -e_5 + e_4 + e_1 + e_3, which geometrically is ridiculous if we insist on reading the terms from left to right.

The same idea extends to higher dimensional groups, but as usual the visualization grows difficult. For example, in C_2(X) above, the chain group is the vector space spanned by \left \{ \sigma_1, \sigma_2 \right \}. But does it make sense to have a path of triangles? Perhaps, but the geometric analogies certainly become more tenuous as dimension grows. The benefit, however, is if we come up with good algebraic definitions for the low-dimensional cases, the algebra is easy to generalize to higher dimensions.

So now we will define the boundary operator on chain groups, a linear map \partial : C_k(X) \to C_{k-1}(X) by starting in lower dimensions and generalizing. A single vertex should always be boundariless, so \partial v = 0 for each vertex. Extending linearly to the entire chain group, we have \partial is identically the zero map on zero-chains. For 1-simplices we have a more substantial definition: if a simplex has its orientation as (v_1, v_2), then the boundary \partial (v_1, v_2) should be v_2 - v_1. That is, it’s the front end of the edge minus the back end. This defines the boundary operator on the basis elements, and we can again extend linearly to the entire group of 1-chains.

Why is this definition more sensible than, say, v_1 + v_2? Using our example above, let’s see how it operates on a “path.” If we have a sum like e_1 + e_4 - e_5 - e_3, then the boundary is computed as

\displaystyle \partial (e_1 + e_4 - e_5 - e_3) = \partial e_1 + \partial e_4 - \partial e_5 - \partial e_3
\displaystyle = (v_2 - v_1) + (v_4 - v_2) - (v_4 - v_3) - (v_3 - v_2) = v_2 - v_1

That is, the result was the endpoint of our path v_2 minus the starting point of our path v_1. It is not hard to prove that this will work in general, since each successive edge in a path will cancel out the ending vertex of the edge before it and the starting vertex of the edge after it: the result is just one big alternating sum.

Even more importantly is that if the “path” is a loop (the starting and ending points are the same in our naive way to write the paths), then the boundary is zero. Indeed, any time the boundary is zero then one can rewrite the sum as a sum of “loops,” (though one might have to trivially introduce cancelling factors). And so our condition for a chain to be a “loop,” which is just one step away from being a “hole,” is if it is in the kernel of the boundary operator. We have a special name for such chains: they are called cycles.

For 2-simplices, the definition is not so much harder: if we have a simplex like (v_0, v_1, v_2), then the boundary should be (v_1,v_2) - (v_0,v_2) + (v_0,v_1). If one rewrites this in a different order, then it will become apparent that this is just a path traversing the boundary of the simplex with the appropriate orientations. We wrote it in this “backwards” way to lead into the general definition: the simplices are ordered by which vertex does not occur in the face in question (v_0 omitted from the first, v_1 from the second, and v_2 from the third).

We are now ready to extend this definition to arbitrary simplices, but a nice-looking definition requires a bit more notation. Say we have a k-simplex which looks like (v_0, v_1, \dots, v_k). Abstractly, we can write it just using the numbers, as [0,1,\dots, k]. And moreover, we can denote the removal of a vertex from this list by putting a hat over the removed index. So [0,1,\dots, \hat{i}, \dots, k] represents the simplex which has all of the vertices from 0 to k excluding the vertex v_i. To represent a single-vertex simplex, we will often drop the square brackets, e.g. 3 for [3]. This can make for some awkward looking math, but is actually standard notation once the correct context has been established.

Now the boundary operator is defined on the standard n-simplex with orientation [0,1,\dots, n] via the alternating sum

\displaystyle \partial([0,1,\dots, n]) = \sum_{k=0}^n (-1)^k [0, \dots, \hat{k}, \dots, n]

It is trivial (but perhaps notationally hard to parse) to see that this coincides with our low-dimensional examples above. But now that we’ve defined it for the basis elements of a chain group, we automatically get a linear operator on the entire chain group by extending \partial linearly on chains.

Definition: The k-cycles on X are those chains in the kernel of \partial. We will call k-cycles boundariless. The k-boundaries are the image of \partial.

We should note that we are making a serious abuse of notation here, since technically \partial is defined on only a single chain group. What we should do is define \partial_k : C_k(X) \to C_{k-1}(X) for a fixed dimension, and always put the subscript. In practice this is only done when it is crucial to be clear which dimension is being talked about, and otherwise the dimension is always inferred from the context. If we want to compose the boundary operator in one dimension with the boundary operator in another dimension (say, \partial_{k-1} \partial_k), it is usually written \partial^2. This author personally supports the abolition of the subscripts for the boundary map, because subscripts are a nuisance in algebraic topology.

All of that notation discussion is so we can make the following observation: \partial^2 = 0. That is, every chain which is a boundary of a higher-dimensional chain is boundariless! This should make sense in low-dimension: if we take the boundary of a 2-simplex, we get a cycle of three 1-simplices, and the boundary of this chain is zero. Indeed, we can formally prove it from the definition for general simplices (and extend linearly to achieve the result for all simplices) by writing out \partial^2([0,1,\dots, n]). With a keen eye, the reader will notice that the terms cancel out and we get zero. The reason is entirely in which coefficients are negative; the second time we apply the boundary operator the power on (-1) shifts by one index. We will leave the full details as an exercise to the reader.

So this fits our two criteria: low-dimensional examples make sense, and boundariless things (cycles) represent loops.

Recasting in Algebraic Terms, and the Homology Group

For the moment let’s give boundary operators subscripts \partial_k : C_k(X) \to C_{k-1}(X). If we recast things in algebraic terms, we can call the k-cycles Z_k(X) = \textup{ker}(\partial_k), and this will be a subspace (and a subgroup) of C_k(X) since kernels are always linear subspaces. Moreover, the set B_k(X) of k-boundaries, that is, the image of \partial_{k+1}, is a subspace (subgroup) of Z_k(X). As we just saw, every boundary is itself boundariless, so B_k(X) is a subset of Z_k(X), and since the image of a linear map is always a linear subspace of the range, we get that it is a subspace too.

All of this data is usually expressed in one big diagram: each of the chain groups are organized in order of decreasing dimension, and the boundary maps connect them.


Since our example (the “simple space” of two triangles from the previous section) only has simplices in dimensions zero, one, and two, we additionally extend the sequence of groups to an infinite sequence by adding trivial groups and zero maps, as indicated. The condition that \textup{im} \partial_{k+1} \subset \textup{ker} \partial_k, which is equivalent to \partial^2 = 0, is what makes this sequence a chain complex. As a side note, every sequence of abelian groups and group homomorphisms which satisfies the boundary requirement is called an algebraic chain complex. This foreshadows that there are many different types of homology theory, and they are unified by these kinds of algebraic conditions.

Now, geometrically we want to say, “The holes are all those cycles (loops) which don’t arise as the boundaries of higher-dimensional things.” In algebraic terms, this would correspond to a quotient space (really, a quotient group, which we covered in our first primer on groups) of the k-cycles by the k-boundaries. That is, a cycle would be considered a “trivial hole” if it is a boundary, and two “different” cycles would be considered the same hole if their difference is a k-boundary. This is the spirit of homology, and formally, we define the homology group (vector space) as follows.

Definition: The k-th homology group of a simplicial complex X, denoted H_k(X), is the quotient vector space Z_k(X) / B_k(X) = \textup{ker}(\partial_k) / \textup{im}(\partial_{k+1}). Two elements of a homology group which are equivalent (their difference is a boundary) are called homologous.

The number of k-dimensional holes in X is thus realized as the dimension of H_k(X) as a vector space.

The quotient mechanism really is doing all of the work for us here. Any time we have two holes and we’re wondering whether they represent truly different holes in the space (perhaps we have a closed loop of edges, and another which is slightly longer but does not quite use the same edges), we can determine this by taking their difference and seeing if it bounds a higher-dimensional chain. If it does, then the two chains are the same, and if it doesn’t then the two chains carry intrinsically different topological information.

For particular dimensions, there are some neat facts (which obviously require further proof) that make this definition more concrete.

  • The dimension of H_0(X) is the number of connected components of X. Therefore, computing homology generalizes the graph-theoretic methods of computing connected components.
  • H_1(X) is the abelianization of the fundamental group of X. Roughly speaking, H_1(X) is the closest approximation of \pi_1(X) by a \mathbb{Q} vector space.

Now that we’ve defined the homology group, let’s end this post by computing all the homology groups for this example space:


This is a sphere (which can be triangulated as the boundary of a tetrahedron) with an extra “arm.” Note how the edge needs an extra vertex to maintain uniqueness. This space is a nice example because it has one-dimensional homology in dimension zero (one connected component), dimension one (the arm is like a copy of the circle), and dimension two (the hollow sphere part). Let’s verify this algebraically.

Let’s start by labelling the vertices of the tetrahedron 0, 1, 2, 3, so that the extra arm attaches at 0 and 2, and call the extra vertex on the arm 4. This completely determines the orientations for the entire simplex, as seen below.

Indeed, the chain groups are easy to write down:

\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}

\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}

\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}

We can easily write down the images of each \partial_k, they’re just the span of the images of each basis element under \partial_k.

\displaystyle \textup{im} \partial_1 = \textup{span} \left \{ 1 - 0, 2 - 0, 3 - 0, 4 - 0, 2 - 1, 3 - 1, 3 - 2, 4 - 2 \right \}

The zero-th homology H_0(X) is the kernel of \partial_0 modulo the image of \partial_1. The angle brackets are a shorthand for “span.”

\displaystyle \frac{\left \langle 0,1,2,3,4 \right \rangle}{\left \langle 1-0,2-0,3-0,4-0,2-1,3-1,3-2,4-2 \right \rangle}

Since \partial_0 is actually the zero map, Z_0(X) = C_0(X) and all five vertices generate the kernel. The quotient construction imposes that two vertices (two elements of the homology group) are considered equivalent if their difference is a boundary. It is easy to see that (indeed, just by the first four generators of the image) all vertices are equivalent to 0, so there is a unique generator of homology, and the vector space is isomorphic to \mathbb{Q}. There is exactly one connected component. Geometrically we can realize this, because two vertices are homologous if and only if there is a “path” of edges from one vertex to the other. This chain will indeed have as its image the difference of the two vertices.

We can compute the first homology H_1(X) in an analogous way, compute the kernel and image separately, and then compute the quotient.

\textup{ker} \partial_1 = \textup{span} \left \{ [0,1] + [0,3] - [1,3], [0,2] + [2,3] - [0,3], [1,2] + [2,3] - [1,3], [0,1] + [1,2] - [0,2], [0,2] + [2,4] - [0,4] \right \}

It takes a bit of combinatorial analysis to show that this is precisely the kernel of \partial_1, and we will have a better method for it in the next post, but indeed this is it. As the image of \partial_2 is precisely the first four basis elements, the quotient is just the one-dimensional vector space spanned by [0,2] + [2,4] - [0,4]. Hence H_1(X) = \mathbb{Q}, and there is one one-dimensional hole.

Since there are no 3-simplices, the homology group H_2(X) is simply the kernel of \partial_2, which is not hard to see is just generated by the chain representing the “sphere” part of the space: [1,2,3] - [0,2,3] + [0,1,3] - [0,1,2]. The second homology group is thus again \mathbb{Q} and there is one two-dimensional hole in X.

So there we have it!

Looking Forward

Next time, we will give a more explicit algorithm for computing homology for finite simplicial complexes, and it will essentially be a variation of row-reduction which simultaneously rewrites the matrix representations of the boundary operators \partial_{k+1}, \partial_k with respect to a canonical basis. This will allow us to simply count entries on the digaonals of the two matrices, and the difference will be the dimension of the quotient space, and hence the number of holes.

Until then!

Conditional (Partitioned) Probability — A Primer

One of the main areas of difficulty in elementary probability, and one that requires the highest levels of scrutiny and rigor, is conditional probability. The ideas are simple enough: that we assign probabilities relative to the occurrence of some event. But shrewd applications of conditional probability (and in particular, efficient ways to compute conditional probability) are key to successful applications of this subject. This is the basis for Nate Silver‘s success, the logical flaws of many a political pundit, and the ability for a robot to tell where it is in an environment. As this author usually touts, the best way to avoid the pitfalls of such confusing subjects is to be mathematically rigorous. In doing so we will develop intuition for when conditional probability that experts show off as if it were trivial.

But before we can get to all of that, we will cover a few extra ideas from finite probability theory that were left out of the last post.

Our entire discussion will revolve around a finite probability space, as defined last time. Let’s briefly (and densely) recall some of the notation presented there. We will always denote our probability space by \Omega, and the corresponding probability mass function will be f: \Omega \to [0,1]. Recall that events are subsets E \subset \Omega, and the probability function P accepts as inputs events E, and produces as output the sum of the probabilities of members of E. We abuse notation by saying \textup{P}(x) = \textup{P}(\left \{ x \right \}) = f(x) and disregarding f for the most part. We really think of \textup{P} as an extension of f to subsets of \Omega instead of just single values of \Omega. Further recall that a random variable X is a real-valued function function \Omega \to \mathbb{R}.

Partitions and Total Probability

A lot of reasoning in probability theory involves decomposing a complicated event into simpler events, or decomposing complicated random variables into simpler ones. Conditional probability is one way to do that, and conditional probability has very nice philosophical interpretations, but it fits into this more general scheme of “decomposing” events and variables into components.

The usual way to break up a set into pieces is via a partition. Recall the following set-theoretic definition.

Definition: partition of a set X is a collection of subsets X_i \in X so that every element x \in X occurs in exactly one of the X_i.

Here are a few examples. We can partition the natural numbers \mathbb{N} into even and odd numbers. We can partition the set of people in the world into subsets where each subset corresponds to a country and a person is placed in the subset corresponding to where they were born (an obvious simplification of the real world, but illustrates the point). The avid reader of this blog will remember how we used partitions to define quotient groups and quotient spaces. With a more applied flavor, finding a “good” partition is the ultimate goal of the clustering problem, and we saw a heuristic approach to this in our post on Lloyd’s algorithm.

You should think of a partition as a way to “cut up” a set into pieces. This colorful diagram is an example of a partition of a disc.

In fact, any time we have a canonical way to associate two things in a set, we can create a partition by putting all mutually associated things in the same piece of the partition. The rigorous name for this is an equivalence relation, but we won’t need that for the present discussion (partitions are the same thing as equivalence relations, just viewed in a different way).

Of course, the point is to apply this idea to probability spaces. Points (elements) in our probability space \Omega are outcomes of some random experiment, and subsets E \subset \Omega are events. So we can rephrase a partition for probability spaces as a choice of events E_i \subset \Omega so that every outcome in \Omega is part of exactly one event. Our first observation is quite a trivial one: the probabilities of the events in a partition sum to one. In symbols, if E_1, \dots, E_m form our partition, then

\displaystyle \sum_{i=1}^m \textup{P}(E_i) = 1

Indeed, the definition of \textup{P} is to sum over the probabilities of outcomes in an event. Since each outcome occurs exactly once among all the E_i, the above sum expands to

\displaystyle \sum_{\omega \in \Omega} \textup{P}(\omega)

Which by our axioms for a probability space is just one. We will give this observation the (non-standard) name the Lemma of Total Probability.

This was a nice warmup proof, but we can beef it up to make it more useful. If we have some other event A which is not related to a partition in any way, we can break up A with respect to the partition. Then, assuming this is simpler, we compute the probability that A happens in terms of the probabilities of the pieces.

Theorem: Let E_1, \dots , E_m be a partition of \Omega, and let A be an arbitrary event. Then

\displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(E_i \cap A)

Proof. The proof is only marginally more complicated than that of the lemma of total probability. The probability of the event A occurring is the sum of the probabilities of each of its outcomes occurring. Each outcome in A occurs in exactly one of the E_i, and hence in exactly one of the sets E_i \cap A. If E_i \cap A is empty, then its probability of occurring is zero (as per our definitions last time). So the sum on the right expands directly into the definition of \textup{P}(A). \square

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E's

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E’s. That is, the E’s give us a partition of A.

A more useful way of thinking of this is that we can use the E_i to define a partition of A in a natural way. The subsets in the partition will just be the sets E_i \cap A, and we will throw out any of these that turn out to be empty. Then we can think of our “new” probability space being A, and the theorem is just a special case of the lemma of total probability. Interestingly enough, this special case is often called the Theorem of Total Probability.

The idea to think of the event A as our “new” probability space is extremely useful. It shows its worth most prominently when we interpret the shift as, “gaining the information that A has occurred.” Then the question becomes: given that A occurs, what is the probability that some other event will occur? That is, we’re interested in the probability of some event B relative to A. This is called the conditional probability of B with respect to A, and is denoted P(B | A) (read “the probability of B given A”).

To compute the conditional probability, simply scale \textup{P}(A \cap B) by the assumed event \textup{P}(A). That is,

\displaystyle \textup{P}(B | A) = \frac{\textup{P}(A \cap B)}{\textup{P}(A)}

Wikipedia provides a straightforward derivation of the formula, but the spirit of the proof is exactly what we said above. The denominator is our new sample space, and the numerator is the probability of outcomes that cause B to occur which also cause A to occur. Multiplying both sides of this formula by \textup{P}(A), this identity can be used to arrive at another version of the theorem of total probability:

 \displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(A | E_i) \textup{P}(E_i)

That is, if we know how to compute the probabilities of the E_i, and we know how likely A is to occur in each of those scenarios, then we can compute the total probability of A occurring independently of the E_i.

We can come up with loads of more or less trivial examples of the theorem of total probability on simple probability spaces. Say you play a craps-like game where you roll a die twice. If you get a one on the first roll, you lose, and otherwise you have to match your initial roll on the second to win. The probability you win can be analyzed with the theorem on total probability. We partition the sample space into events corresponding to the outcome of the first roll.

\displaystyle \textup{P}(\textup{Win}) = \sum_{i=1}^6 \textup{P}(\textup{Win } | \textup{ 1st roll }= i) \textup{P}(\textup{1st roll } = i)

The probability the first roll is i is 1/6, and if the first roll is a 1 then the probability of winning after that is zero. In the other 5 cases the conditional probability is the same regardless of i: to match i on the second roll has a 1/6 chance. So the probability of winning is

\displaystyle 5 \cdot \frac{1}{6} \cdot \frac{1}{6} = \frac{5}{36}

For the working mathematician, these kinds of examples are relatively low-tech, but it illustrates the main way conditional probability is used in practice. We have some process we want to analyze, and we break it up into steps and condition on the results of a given step. We will see in a moment a more complicated example of this.

Partitions via Random Variables

The most common kind of partition is created via a random variable with finitely many values (or countably many, but we haven’t breached infinite probability spaces yet). In this case, we can partition the sample space \Omega based on the values of X. That is, for each value x = X(\omega), we will have a subset of the partition S_x be the set of all \omega which map to x. In the parlance of functions, it is the preimage of a single value x;

\displaystyle S_x = X^{-1}(x) = \left \{ \omega \in \Omega : X(\omega) = x\right \}

And as the reader is probably expecting, we can use this to define a “relative” expected value of a random variable. Recall that if the image of X is a finite set x_1, \dots, x_n, the expected value of X is a sum

\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)

Suppose X,Y are two such random variables, then the conditional probability of X relative to the event Y=y is the quantity

\displaystyle \textup{P}(X=x | Y=y) = \frac{\textup{P}(X=x \textup{ and } Y=y)}{\textup{P}(Y=y)}

And the conditional expectation of X relative to the event Y = y, denoted \textup{E}(X | Y = y) is a similar sum

\displaystyle \textup{E}(X|Y=y) = \sum_{i=1}^n x_i \textup{P}(X = x_i | Y = y)

Indeed, just as we implicitly “defined” a new sample space when we were partitioning based on events, here we are defining a new random variable (with the odd notation X | Y=y) whose domain is the preimage Y^{-1}(y). We can then ask what the probability of it assuming a value x is, and moreover what its expected value is.

Of course there is an analogue to the theorem of total probability lurking here. We want to say something like the true expected value of X is a sum of the conditional expectations over all possible values of Y. We have to remember, though, that different values of y can occur with very different probabilities, and the expected values of X | Y=y can change wildly between them. Just as a quick (and morbid) example, if X is the number of people who die on a randomly chosen day, and Y is the number of atomic bombs dropped on that day, it is clear that the probability of Y being positive is quite small, and the expected value of X = Y=y will be dramatically larger if y is positive than if it’s zero. (A few quick calculations based on tragic historic events show it would roughly double, using contemporary non-violent death rate estimates.)

And so instead of simply summing the expectation, we need to take an expectation over the values of Y. Thinking again of X | Y=y as a random variable based on values of Y, it makes sense mathematically to take expectation. To distinguish between the two types of expectation, we will subscript the variable being “expected,” as in \textup{E}_X(X|Y). That is, we have the following theorem.

TheoremThe expected value of X satisfies

\textup{E}_X(X) = \textup{E}_Y(\textup{E}_X(X|Y))

Proof. Expand the definitions of what these values mean, and use the definition of conditional probability \textup{P}(A \cap B) = \textup{P}(A | B) \textup{P}(B). We leave the proof as a trivial exercise to the reader, but if one cannot bear it, see Wikipedia for a full proof. \square

Let’s wrap up this post with a non-trivial example of all of this theory in action.

A Nontrivial Example: the Galton-Watson Branching Process

We are interested (as was the eponymous Sir Francis Galton in the 1800’s) in the survival of surnames through generations of marriage and children. The main tool to study such a generational phenomenon is the Galton-Watson branching process. The idea is quite simple, but its analysis quickly blossoms into a rich and detailed theoretical puzzle and a more general practical tool. Just before we get too deep into things, we should note that these ideas (along with other types of branching processes) are used to analyze a whole host of problems in probability theory and computer science. A few the author has recently been working with are the evolution of random graphs and graph property testing.

The gist is as follows: say we live in a patriarchal society in which surnames are passed down on the male side. We can image a family tree being grown step by step in this way At the root there is a single male, and he has k children, some of which are girls and some of which are boys. They all go on to have some number of children, but only the men pass on the family name to their children, and only their male children pass on the family name further. If we only record the family tree along the male lines, we can ask whether the tree will be finite; that is, whether the family name will die out.

To make this rigorous, let us define an infinite sequence of random variables X_1 X_2, \dots which represent the number of children each person in the tree has, and suppose further that all of these variables are independent and uniformly distributed from 1, \dots, n for some fixed n. This may be an unrealistic assumption, but it makes the analysis a bit simpler. The number of children more likely follows a Poisson distribution where the mean is a parameter we would estimate from real-world data, but we haven’t spoken of Poisson distributions on this blog yet so we will leave it out.

We further imagine the tree growing step by step: at step i the i-th individual in the tree has X_i children and then dies. If the individual is a woman we by default set X_i = 0. We can recursively describe the size of the tree at each step by another random variable Y_i. Clearly Y_0 = 1, and the recursion is Y_n = Y_{n-1} + X_i - 1. In words, Y_i represents the current living population with the given surname. We say the tree is finite (the family name dies off), if for some i we get Y_i = 0. The first time at which this happens is when the family name dies off, but abstractly we can imagine the sequence of random variables continuing forever. This is sometimes called fictitious continuation.

At last, we assume that the probability of having a boy or girl is a split 1/2. Now we can start asking questions. What is the probability that the surname dies off? What is the expected size of the tree in terms of n?

For the first question we use the theorem of total probability. In particular, suppose the first person has two boys. Then the whole tree is finite precisely when both boys’ sub-trees are finite. Indeed, the two boys’ sub-trees are independent of one another, and so the probability of both being finite is the product of the probabilities of each being finite. That is, more generally

\displaystyle \textup{P}(\textup{finite } | k \textup{ boys}) = \textup{P}(\textup{finite})^k \textup{P}(\textup{two boys})

Setting z = \textup{P}(\textup{the tree is finite}), we can compute z directly by conditioning on all possibilities of the first person’s children. Notice how we must condition twice here.

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \textup{P}(k \textup{ boys } | i \textup{ children}) \textup{P}(i \textup{ children}) z^k

The probability of getting k boys is the same as flipping i coins and getting k heads, which is just

\displaystyle \textup{P}(k \textup{ boys } | i \textup{ children}) = \binom{i}{k}\frac{1}{2^i}

So the equation is

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \binom{i}{k} \frac{1}{2^i} \cdot \frac{1}{n} z^k

From here, we’ve reduced the problem down to picking the correct root of a polynomial. For example, when n=4, the polynomial equation to solve is

\displaystyle 64z = 5 + 10z + 10z^2 + 5z^3 + z^4

We have to be a bit careful, here though. Not all solutions to this equation are valid answers. For instance, the roots must be between 0 and 1 (inclusive), and if there are multiple then one must rule out the irrelevant roots by some additional argument. Moreover, we would need to use a calculus argument to prove there is always a solution between 0 and 1 in the first place. But after all that is done, we can estimate the correct root computationally (or solve for exactly when our polynomials have small degree). Here for n=4, the probability of being finite is about 0.094.

We leave the second question, on the expected size of the tree, for the reader to ponder. Next time we’ll devote an entire post to Bayes Theorem (a trivial consequence of the definition of conditional probability), and see how it helps us compute probabilities for use in programs.

Until then!