Homology Theory — A Primer

This series on topology has been long and hard, but we’re are quickly approaching the topics where we can actually write programs. For this and the next post on homology, the most important background we will need is a solid foundation in linear algebra, specifically in row-reducing matrices (and the interpretation of row-reduction as a change of basis of a linear operator).

Last time we engaged in a whirlwind tour of the fundamental group and homotopy theory. And we mean “whirlwind” as it sounds; it was all over the place in terms of organization. The most important fact that one should take away from that discussion is the idea that we can compute, algebraically, some qualitative features about a topological space related to “n-dimensional holes.” For one-dimensional things, a hole would look like a circle, and for two dimensional things, it would look like a hollow sphere, etc. More importantly, we saw that this algebraic data, which we called the fundamental group, is a topological invariant. That is, if two topological spaces have different fundamental groups, then they are “fundamentally” different under the topological lens (they are not homeomorphic, and not even homotopy equivalent).

Unfortunately the main difficulty of homotopy theory (and part of what makes it so interesting) is that these “holes” interact with each other in elusive and convoluted ways, and the algebra reflects it almost too well. Part of the problem with the fundamental group is that it deftly eludes our domain of interest: we don’t know a general method to compute the damn things!

What we really need is a coarser invariant. If we can find a “stupider” invariant, it might just be simple enough to compute. Perhaps unsurprisingly, these will take the form of finitely-generated abelian groups (the most well-understood class of groups), with one for each dimension. Now we’re starting to see exactly why algebraic topology is so difficult; it has an immense list of prerequisite topics! If we’re willing to skip over some of the more nitty gritty details (and we must lest we take a huge diversion to discuss Tor and the exact sequences in the universal coefficient theorem), then we can also do the same calculations over a field. In other words, the algebraic objects we’ll define called “homology groups” are really vector spaces, and so row-reduction will be our basic computational tool to analyze them.

Once we have the basic theory down, we’ll see how we can write a program which accepts as input any topological space (represented in a particular form) and produces as output a list of the homology groups in every dimension. The dimensions of these vector spaces (their ranks, as finitely-generated abelian groups) are interpreted as the number of holes in the space for each dimension.

Recall Simplicial Complexes

In our post on constructing topological spaces, we defined the standard k-simplex and the simplicial complex. We recall the latter definition here, and expand upon it.

Definition: A simplicial complex is a topological space realized as a union of any collection of simplices (of possibly varying dimension) \Sigma which has the following two properties:

  • Any face of a simplex \Sigma is also in \Sigma.
  • The intersection of any two simplices of \Sigma is also a simplex of \Sigma.

We can realize a simplicial complex by gluing together pieces of increasing dimension. First start by taking a collection of vertices (0-simplices) X_0. Then take a collection of intervals (1-simplices) X_1 and glue their endpoints onto the vertices in any way. Note that because we require every face of an interval to again be a simplex in our complex, we must glue each endpoint of an interval onto a vertex in X_0. Continue this process with X_2, a set of 2-simplices, we must glue each edge precisely along an edge of X_1. We can continue this process until we reach a terminating set X_n. It is easy to see that the union of the X_i form a simplicial complex. Define the dimension of the cell complex to be n.

There are some picky restrictions on how we glue things that we should mention. For instance, we could not contract all edges of a 2-simplex \sigma and glue it all to a single vertex in X_0. The reason for this is that \sigma would no longer be a 2-simplex! Indeed, we’ve destroyed its original vertex set. The gluing process hence needs to preserve the original simplex’s boundary. Moreover, one property that follows from the two conditions above is that any simplex in the complex is uniquely determined by its vertices (for otherwise, the intersection of two such non-uniquely specified simplices would not be a single simplex).

We also have to remember that we’re imposing a specific ordering on the vertices of a simplex. In particular, if we label the vertices of an n-simplex 0, \dots, n, then this imposes an orientation on the edges where an edge of the form \left \{ i,j \right \} has the orientation (i,j) if i < j, and (j,i) otherwise. The faces, then, are “oriented” in increasing order of their three vertices. Higher dimensional simplices are oriented in a similar way, though we rarely try to picture this (the theory of orientations is a question best posted for smooth manifolds; we won’t be going there any time soon). Here are, for example, two different ways to pick orientations of a 2-simplex:

Two possible orientations of a 2-simplex

Two possible orientations of a 2-simplex.

It is true, but a somewhat lengthy exercise, that the topology of a simplicial complex does not change under a consistent shuffling of the orientations across all its simplices. Nor does it change depending on how we realize a space as a simplicial complex. These kinds of results are crucial to the welfare of the theory, but have been proved once and we won’t bother reproving them here.

As a larger example, here is a simplicial complex representing the torus. It’s quite a bit more complicated than our usual quotient of a square, but it’s based on the same idea. The left and right edges are glued together, as are the top and bottom, with appropriate orientations. The only difficulty is that we need each simplex to be uniquely determined by its vertices. While this construction does not use the smallest possible number of simplices to satisfy that condition, it is the simplest to think about.

A possible realization of the torus as a simplicial complex. As an exercise, the reader is invited to fill in the orientations on the simplices to be consistent across the entire complex.

A possible realization of the torus as a simplicial complex. As an exercise, the reader is invited to label the edges and fill in the orientations on the simplices to be consistent across the entire complex. Remember that the result should coincide with our classical construction via the quotient of the disk, so some of the edges on the sides will coincide with those on the opposite sides, and the orientations must line up.

Taking a known topological space (like the torus) and realizing it as a simplicial complex is known as triangulating the space. A space which can be realized as a simplicial complex is called triangulable.

The nicest thing about the simplex is that it has an easy-to-describe boundary. Geometrically, it’s obvious: the boundary of the line segment is the two endpoints; the boundary of the triangle is the union of all three of its edges; the tetrahedron has four triangular faces as its boundary; etc. But because we need an algebraic way to describe holes, we want an algebraic way to describe the boundary. In particular, we have two important criterion that any algebraic definition must satisfy to be reasonable:

  1. A boundary itself has no boundary.
  2. The property of being boundariless (at least in low dimensions) coincides with our intuitive idea of what it means to be a loop.

Of course, just as with homotopy these holes interact in ways we’re about to see, so we need to be totally concrete before we can proceed.

The Chain Group and the Boundary Operator

In order to define an algebraic boundary, we have to realize simplices themselves as algebraic objects.  This is not so difficult to do: just take all “formal sums” of simplices in the complex. More rigorously, let X_k be the set of k-simplices in the simplicial complex X. Define the chain group C_k(X) to be the \mathbb{Q}-vector space with X_k for a basis. The elements of the k-th chain group are called k-chainon X. That’s right, if \sigma, \sigma' are two k-simplices, then we just blindly define a bunch of new “chains” as all possible “sums” and scalar multiples of the simplices. For example, sums involving two elements would look like a\sigma + b\sigma' for some a,b \in \mathbb{Q}. Indeed, we include any finite sum of such simplices, as is standard in taking the span of a set of basis vectors in linear algebra.

Just for a quick example, take this very simple simplicial complex:

simple-space

We’ve labeled all of the simplices above, and we can describe the chain groups quite easily. The zero-th chain group C_0(X) is the \mathbb{Q}-linear span of the set of vertices \left \{ v_1, v_2, v_3, v_4 \right \}. Geometrically, we might think of “the union” of two points as being, e.g., the sum v_1 + v_3. And if we want to have two copies of v_1 and five copies of v_3, that might be thought of as 2v_1 + 5v_3. Of course, there are geometrically meaningless sums like \frac{1}{2}v_4 - v_2 - \frac{11}{6}v_1, but it will turn out that the algebra we use to talk about holes will not falter because of it. It’s nice to have this geometric idea of what an algebraic expression can “mean,” but in light of this nonsense it’s not a good idea to get too wedded to the interpretations.

Likewise, C_1(X) is the linear span of the set \left \{ e_1, e_2, e_3, e_4, e_5 \right \} with coefficients in \mathbb{Q}. So we can talk about a “path” as a sum of simplices like e_1 + e_4 - e_5 + e_3. Here we use a negative coefficient to signify that we’re travelling “against” the orientation of an edge. Note that since the order of the terms is irrelevant, the same “path” is given by, e.g. -e_5 + e_4 + e_1 + e_3, which geometrically is ridiculous if we insist on reading the terms from left to right.

The same idea extends to higher dimensional groups, but as usual the visualization grows difficult. For example, in C_2(X) above, the chain group is the vector space spanned by \left \{ \sigma_1, \sigma_2 \right \}. But does it make sense to have a path of triangles? Perhaps, but the geometric analogies certainly become more tenuous as dimension grows. The benefit, however, is if we come up with good algebraic definitions for the low-dimensional cases, the algebra is easy to generalize to higher dimensions.

So now we will define the boundary operator on chain groups, a linear map \partial : C_k(X) \to C_{k-1}(X) by starting in lower dimensions and generalizing. A single vertex should always be boundariless, so \partial v = 0 for each vertex. Extending linearly to the entire chain group, we have \partial is identically the zero map on zero-chains. For 1-simplices we have a more substantial definition: if a simplex has its orientation as (v_1, v_2), then the boundary \partial (v_1, v_2) should be v_2 - v_1. That is, it’s the front end of the edge minus the back end. This defines the boundary operator on the basis elements, and we can again extend linearly to the entire group of 1-chains.

Why is this definition more sensible than, say, v_1 + v_2? Using our example above, let’s see how it operates on a “path.” If we have a sum like e_1 + e_4 - e_5 - e_3, then the boundary is computed as

\displaystyle \partial (e_1 + e_4 - e_5 - e_3) = \partial e_1 + \partial e_4 - \partial e_5 - \partial e_3
\displaystyle = (v_2 - v_1) + (v_4 - v_2) - (v_4 - v_3) - (v_3 - v_2) = v_2 - v_1

That is, the result was the endpoint of our path v_2 minus the starting point of our path v_1. It is not hard to prove that this will work in general, since each successive edge in a path will cancel out the ending vertex of the edge before it and the starting vertex of the edge after it: the result is just one big alternating sum.

Even more importantly is that if the “path” is a loop (the starting and ending points are the same in our naive way to write the paths), then the boundary is zero. Indeed, any time the boundary is zero then one can rewrite the sum as a sum of “loops,” (though one might have to trivially introduce cancelling factors). And so our condition for a chain to be a “loop,” which is just one step away from being a “hole,” is if it is in the kernel of the boundary operator. We have a special name for such chains: they are called cycles.

For 2-simplices, the definition is not so much harder: if we have a simplex like (v_0, v_1, v_2), then the boundary should be (v_1,v_2) - (v_0,v_2) + (v_0,v_1). If one rewrites this in a different order, then it will become apparent that this is just a path traversing the boundary of the simplex with the appropriate orientations. We wrote it in this “backwards” way to lead into the general definition: the simplices are ordered by which vertex does not occur in the face in question (v_0 omitted from the first, v_1 from the second, and v_2 from the third).

We are now ready to extend this definition to arbitrary simplices, but a nice-looking definition requires a bit more notation. Say we have a k-simplex which looks like (v_0, v_1, \dots, v_k). Abstractly, we can write it just using the numbers, as [0,1,\dots, k]. And moreover, we can denote the removal of a vertex from this list by putting a hat over the removed index. So [0,1,\dots, \hat{i}, \dots, k] represents the simplex which has all of the vertices from 0 to k excluding the vertex v_i. To represent a single-vertex simplex, we will often drop the square brackets, e.g. 3 for [3]. This can make for some awkward looking math, but is actually standard notation once the correct context has been established.

Now the boundary operator is defined on the standard n-simplex with orientation [0,1,\dots, n] via the alternating sum

\displaystyle \partial([0,1,\dots, n]) = \sum_{k=0}^n (-1)^k [0, \dots, \hat{k}, \dots, n]

It is trivial (but perhaps notationally hard to parse) to see that this coincides with our low-dimensional examples above. But now that we’ve defined it for the basis elements of a chain group, we automatically get a linear operator on the entire chain group by extending \partial linearly on chains.

Definition: The k-cycles on X are those chains in the kernel of \partial. We will call k-cycles boundariless. The k-boundaries are the image of \partial.

We should note that we are making a serious abuse of notation here, since technically \partial is defined on only a single chain group. What we should do is define \partial_k : C_k(X) \to C_{k-1}(X) for a fixed dimension, and always put the subscript. In practice this is only done when it is crucial to be clear which dimension is being talked about, and otherwise the dimension is always inferred from the context. If we want to compose the boundary operator in one dimension with the boundary operator in another dimension (say, \partial_{k-1} \partial_k), it is usually written \partial^2. This author personally supports the abolition of the subscripts for the boundary map, because subscripts are a nuisance in algebraic topology.

All of that notation discussion is so we can make the following observation: \partial^2 = 0. That is, every chain which is a boundary of a higher-dimensional chain is boundariless! This should make sense in low-dimension: if we take the boundary of a 2-simplex, we get a cycle of three 1-simplices, and the boundary of this chain is zero. Indeed, we can formally prove it from the definition for general simplices (and extend linearly to achieve the result for all simplices) by writing out \partial^2([0,1,\dots, n]). With a keen eye, the reader will notice that the terms cancel out and we get zero. The reason is entirely in which coefficients are negative; the second time we apply the boundary operator the power on (-1) shifts by one index. We will leave the full details as an exercise to the reader.

So this fits our two criteria: low-dimensional examples make sense, and boundariless things (cycles) represent loops.

Recasting in Algebraic Terms, and the Homology Group

For the moment let’s give boundary operators subscripts \partial_k : C_k(X) \to C_{k-1}(X). If we recast things in algebraic terms, we can call the k-cycles Z_k(X) = \textup{ker}(\partial_k), and this will be a subspace (and a subgroup) of C_k(X) since kernels are always linear subspaces. Moreover, the set B_k(X) of k-boundaries, that is, the image of \partial_{k+1}, is a subspace (subgroup) of Z_k(X). As we just saw, every boundary is itself boundariless, so B_k(X) is a subset of Z_k(X), and since the image of a linear map is always a linear subspace of the range, we get that it is a subspace too.

All of this data is usually expressed in one big diagram: each of the chain groups are organized in order of decreasing dimension, and the boundary maps connect them.

chain-complex

Since our example (the “simple space” of two triangles from the previous section) only has simplices in dimensions zero, one, and two, we additionally extend the sequence of groups to an infinite sequence by adding trivial groups and zero maps, as indicated. The condition that \textup{im} \partial_{k+1} \subset \textup{ker} \partial_k, which is equivalent to \partial^2 = 0, is what makes this sequence a chain complex. As a side note, every sequence of abelian groups and group homomorphisms which satisfies the boundary requirement is called an algebraic chain complex. This foreshadows that there are many different types of homology theory, and they are unified by these kinds of algebraic conditions.

Now, geometrically we want to say, “The holes are all those cycles (loops) which don’t arise as the boundaries of higher-dimensional things.” In algebraic terms, this would correspond to a quotient space (really, a quotient group, which we covered in our first primer on groups) of the k-cycles by the k-boundaries. That is, a cycle would be considered a “trivial hole” if it is a boundary, and two “different” cycles would be considered the same hole if their difference is a k-boundary. This is the spirit of homology, and formally, we define the homology group (vector space) as follows.

Definition: The k-th homology group of a simplicial complex X, denoted H_k(X), is the quotient vector space Z_k(X) / B_k(X) = \textup{ker}(\partial_k) / \textup{im}(\partial_{k+1}). Two elements of a homology group which are equivalent (their difference is a boundary) are called homologous.

The number of k-dimensional holes in X is thus realized as the dimension of H_k(X) as a vector space.

The quotient mechanism really is doing all of the work for us here. Any time we have two holes and we’re wondering whether they represent truly different holes in the space (perhaps we have a closed loop of edges, and another which is slightly longer but does not quite use the same edges), we can determine this by taking their difference and seeing if it bounds a higher-dimensional chain. If it does, then the two chains are the same, and if it doesn’t then the two chains carry intrinsically different topological information.

For particular dimensions, there are some neat facts (which obviously require further proof) that make this definition more concrete.

  • The dimension of H_0(X) is the number of connected components of X. Therefore, computing homology generalizes the graph-theoretic methods of computing connected components.
  • H_1(X) is the abelianization of the fundamental group of X. Roughly speaking, H_1(X) is the closest approximation of \pi_1(X) by a \mathbb{Q} vector space.

Now that we’ve defined the homology group, let’s end this post by computing all the homology groups for this example space:

circle-wedge-sphere

This is a sphere (which can be triangulated as the boundary of a tetrahedron) with an extra “arm.” Note how the edge needs an extra vertex to maintain uniqueness. This space is a nice example because it has one-dimensional homology in dimension zero (one connected component), dimension one (the arm is like a copy of the circle), and dimension two (the hollow sphere part). Let’s verify this algebraically.

Let’s start by labelling the vertices of the tetrahedron 0, 1, 2, 3, so that the extra arm attaches at 0 and 2, and call the extra vertex on the arm 4. This completely determines the orientations for the entire simplex, as seen below.

Indeed, the chain groups are easy to write down:

\displaystyle C_0(X) = \textup{span} \left \{ 0,1,2,3,4 \right \}

\displaystyle C_1(X) = \textup{span} \left \{ [0,1], [0,2], [0,3], [0,4], [1,2], [1,3],[2,3],[2,4] \right \}

\displaystyle C_2(X) = \textup{span} \left \{ [0,1,2], [0,1,3], [0,2,3], [1,2,3] \right \}

We can easily write down the images of each \partial_k, they’re just the span of the images of each basis element under \partial_k.

\displaystyle \textup{im} \partial_1 = \textup{span} \left \{ 1 - 0, 2 - 0, 3 - 0, 4 - 0, 2 - 1, 3 - 1, 3 - 2, 4 - 2 \right \}

The zero-th homology H_0(X) is the kernel of \partial_0 modulo the image of \partial_1. The angle brackets are a shorthand for “span.”

\displaystyle \frac{\left \langle 0,1,2,3,4 \right \rangle}{\left \langle 1-0,2-0,3-0,4-0,2-1,3-1,3-2,4-2 \right \rangle}

Since \partial_0 is actually the zero map, Z_0(X) = C_0(X) and all five vertices generate the kernel. The quotient construction imposes that two vertices (two elements of the homology group) are considered equivalent if their difference is a boundary. It is easy to see that (indeed, just by the first four generators of the image) all vertices are equivalent to 0, so there is a unique generator of homology, and the vector space is isomorphic to \mathbb{Q}. There is exactly one connected component. Geometrically we can realize this, because two vertices are homologous if and only if there is a “path” of edges from one vertex to the other. This chain will indeed have as its image the difference of the two vertices.

We can compute the first homology H_1(X) in an analogous way, compute the kernel and image separately, and then compute the quotient.

\textup{ker} \partial_1 = \textup{span} \left \{ [0,1] + [0,3] - [1,3], [0,2] + [2,3] - [0,3], [1,2] + [2,3] - [1,3], [0,1] + [1,2] - [0,2], [0,2] + [2,4] - [0,4] \right \}

It takes a bit of combinatorial analysis to show that this is precisely the kernel of \partial_1, and we will have a better method for it in the next post, but indeed this is it. As the image of \partial_2 is precisely the first four basis elements, the quotient is just the one-dimensional vector space spanned by [0,2] + [2,4] - [0,4]. Hence H_1(X) = \mathbb{Q}, and there is one one-dimensional hole.

Since there are no 3-simplices, the homology group H_2(X) is simply the kernel of \partial_2, which is not hard to see is just generated by the chain representing the “sphere” part of the space: [1,2,3] - [0,2,3] + [0,1,3] - [0,1,2]. The second homology group is thus again \mathbb{Q} and there is one two-dimensional hole in X.

So there we have it!

Looking Forward

Next time, we will give a more explicit algorithm for computing homology for finite simplicial complexes, and it will essentially be a variation of row-reduction which simultaneously rewrites the matrix representations of the boundary operators \partial_{k+1}, \partial_k with respect to a canonical basis. This will allow us to simply count entries on the digaonals of the two matrices, and the difference will be the dimension of the quotient space, and hence the number of holes.

Until then!

About these ads

Conditional (Partitioned) Probability — A Primer

One of the main areas of difficulty in elementary probability, and one that requires the highest levels of scrutiny and rigor, is conditional probability. The ideas are simple enough: that we assign probabilities relative to the occurrence of some event. But shrewd applications of conditional probability (and in particular, efficient ways to compute conditional probability) are key to successful applications of this subject. This is the basis for Nate Silver‘s success, the logical flaws of many a political pundit, and the ability for a robot to tell where it is in an environment. As this author usually touts, the best way to avoid the pitfalls of such confusing subjects is to be mathematically rigorous. In doing so we will develop intuition for when conditional probability that experts show off as if it were trivial.

But before we can get to all of that, we will cover a few extra ideas from finite probability theory that were left out of the last post.

Our entire discussion will revolve around a finite probability space, as defined last time. Let’s briefly (and densely) recall some of the notation presented there. We will always denote our probability space by \Omega, and the corresponding probability mass function will be f: \Omega \to [0,1]. Recall that events are subsets E \subset \Omega, and the probability function P accepts as inputs events E, and produces as output the sum of the probabilities of members of E. We abuse notation by saying \textup{P}(x) = \textup{P}(\left \{ x \right \}) = f(x) and disregarding f for the most part. We really think of \textup{P} as an extension of f to subsets of \Omega instead of just single values of \Omega. Further recall that a random variable X is a real-valued function function \Omega \to \mathbb{R}.

Partitions and Total Probability

A lot of reasoning in probability theory involves decomposing a complicated event into simpler events, or decomposing complicated random variables into simpler ones. Conditional probability is one way to do that, and conditional probability has very nice philosophical interpretations, but it fits into this more general scheme of “decomposing” events and variables into components.

The usual way to break up a set into pieces is via a partition. Recall the following set-theoretic definition.

Definition: partition of a set X is a collection of subsets X_i \in X so that every element x \in X occurs in exactly one of the X_i.

Here are a few examples. We can partition the natural numbers \mathbb{N} into even and odd numbers. We can partition the set of people in the world into subsets where each subset corresponds to a country and a person is placed in the subset corresponding to where they were born (an obvious simplification of the real world, but illustrates the point). The avid reader of this blog will remember how we used partitions to define quotient groups and quotient spaces. With a more applied flavor, finding a “good” partition is the ultimate goal of the clustering problem, and we saw a heuristic approach to this in our post on Lloyd’s algorithm.

You should think of a partition as a way to “cut up” a set into pieces. This colorful diagram is an example of a partition of a disc.

In fact, any time we have a canonical way to associate two things in a set, we can create a partition by putting all mutually associated things in the same piece of the partition. The rigorous name for this is an equivalence relation, but we won’t need that for the present discussion (partitions are the same thing as equivalence relations, just viewed in a different way).

Of course, the point is to apply this idea to probability spaces. Points (elements) in our probability space \Omega are outcomes of some random experiment, and subsets E \subset \Omega are events. So we can rephrase a partition for probability spaces as a choice of events E_i \subset \Omega so that every outcome in \Omega is part of exactly one event. Our first observation is quite a trivial one: the probabilities of the events in a partition sum to one. In symbols, if E_1, \dots, E_m form our partition, then

\displaystyle \sum_{i=1}^m \textup{P}(E_i) = 1

Indeed, the definition of \textup{P} is to sum over the probabilities of outcomes in an event. Since each outcome occurs exactly once among all the E_i, the above sum expands to

\displaystyle \sum_{\omega \in \Omega} \textup{P}(\omega)

Which by our axioms for a probability space is just one. We will give this observation the (non-standard) name the Lemma of Total Probability.

This was a nice warmup proof, but we can beef it up to make it more useful. If we have some other event A which is not related to a partition in any way, we can break up A with respect to the partition. Then, assuming this is simpler, we compute the probability that A happens in terms of the probabilities of the pieces.

Theorem: Let E_1, \dots , E_m be a partition of \Omega, and let A be an arbitrary event. Then

\displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(E_i \cap A)

Proof. The proof is only marginally more complicated than that of the lemma of total probability. The probability of the event A occurring is the sum of the probabilities of each of its outcomes occurring. Each outcome in A occurs in exactly one of the E_i, and hence in exactly one of the sets E_i \cap A. If E_i \cap A is empty, then its probability of occurring is zero (as per our definitions last time). So the sum on the right expands directly into the definition of \textup{P}(A). \square

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E's

The area taken up by the set A is the same as the area taken up by the pieces of A which overlap the E’s. That is, the E’s give us a partition of A.

A more useful way of thinking of this is that we can use the E_i to define a partition of A in a natural way. The subsets in the partition will just be the sets E_i \cap A, and we will throw out any of these that turn out to be empty. Then we can think of our “new” probability space being A, and the theorem is just a special case of the lemma of total probability. Interestingly enough, this special case is often called the Theorem of Total Probability.

The idea to think of the event A as our “new” probability space is extremely useful. It shows its worth most prominently when we interpret the shift as, “gaining the information that A has occurred.” Then the question becomes: given that A occurs, what is the probability that some other event will occur? That is, we’re interested in the probability of some event B relative to A. This is called the conditional probability of B with respect to A, and is denoted P(B | A) (read “the probability of B given A”).

To compute the conditional probability, simply scale \textup{P}(A \cap B) by the assumed event \textup{P}(A). That is,

\displaystyle \textup{P}(B | A) = \frac{\textup{P}(A \cap B)}{\textup{P}(A)}

Wikipedia provides a straightforward derivation of the formula, but the spirit of the proof is exactly what we said above. The denominator is our new sample space, and the numerator is the probability of outcomes that cause B to occur which also cause A to occur. Multiplying both sides of this formula by \textup{P}(A), this identity can be used to arrive at another version of the theorem of total probability:

 \displaystyle \textup{P}(A) = \sum_{i=1}^m \textup{P}(A | E_i) \textup{P}(E_i)

That is, if we know how to compute the probabilities of the E_i, and we know how likely A is to occur in each of those scenarios, then we can compute the total probability of A occurring independently of the E_i.

We can come up with loads of more or less trivial examples of the theorem of total probability on simple probability spaces. Say you play a craps-like game where you roll a die twice. If you get a one on the first roll, you lose, and otherwise you have to match your initial roll on the second to win. The probability you win can be analyzed with the theorem on total probability. We partition the sample space into events corresponding to the outcome of the first roll.

\displaystyle \textup{P}(\textup{Win}) = \sum_{i=1}^6 \textup{P}(\textup{Win } | \textup{ 1st roll }= i) \textup{P}(\textup{1st roll } = i)

The probability the first roll is i is 1/6, and if the first roll is a 1 then the probability of winning after that is zero. In the other 5 cases the conditional probability is the same regardless of i: to match i on the second roll has a 1/6 chance. So the probability of winning is

\displaystyle 5 \cdot \frac{1}{6} \cdot \frac{1}{6} = \frac{5}{36}

For the working mathematician, these kinds of examples are relatively low-tech, but it illustrates the main way conditional probability is used in practice. We have some process we want to analyze, and we break it up into steps and condition on the results of a given step. We will see in a moment a more complicated example of this.

Partitions via Random Variables

The most common kind of partition is created via a random variable with finitely many values (or countably many, but we haven’t breached infinite probability spaces yet). In this case, we can partition the sample space \Omega based on the values of X. That is, for each value x = X(\omega), we will have a subset of the partition S_x be the set of all \omega which map to x. In the parlance of functions, it is the preimage of a single value x;

\displaystyle S_x = X^{-1}(x) = \left \{ \omega \in \Omega : X(\omega) = x\right \}

And as the reader is probably expecting, we can use this to define a “relative” expected value of a random variable. Recall that if the image of X is a finite set x_1, \dots, x_n, the expected value of X is a sum

\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)

Suppose X,Y are two such random variables, then the conditional probability of X relative to the event Y=y is the quantity

\displaystyle \textup{P}(X=x | Y=y) = \frac{\textup{P}(X=x \textup{ and } Y=y)}{\textup{P}(Y=y)}

And the conditional expectation of X relative to the event Y = y, denoted \textup{E}(X | Y = y) is a similar sum

\displaystyle \textup{E}(X|Y=y) = \sum_{i=1}^n x_i \textup{P}(X = x_i | Y = y)

Indeed, just as we implicitly “defined” a new sample space when we were partitioning based on events, here we are defining a new random variable (with the odd notation X | Y=y) whose domain is the preimage Y^{-1}(y). We can then ask what the probability of it assuming a value x is, and moreover what its expected value is.

Of course there is an analogue to the theorem of total probability lurking here. We want to say something like the true expected value of X is a sum of the conditional expectations over all possible values of Y. We have to remember, though, that different values of y can occur with very different probabilities, and the expected values of X | Y=y can change wildly between them. Just as a quick (and morbid) example, if X is the number of people who die on a randomly chosen day, and Y is the number of atomic bombs dropped on that day, it is clear that the probability of Y being positive is quite small, and the expected value of X = Y=y will be dramatically larger if y is positive than if it’s zero. (A few quick calculations based on tragic historic events show it would roughly double, using contemporary non-violent death rate estimates.)

And so instead of simply summing the expectation, we need to take an expectation over the values of Y. Thinking again of X | Y=y as a random variable based on values of Y, it makes sense mathematically to take expectation. To distinguish between the two types of expectation, we will subscript the variable being “expected,” as in \textup{E}_X(X|Y). That is, we have the following theorem.

TheoremThe expected value of X satisfies

\textup{E}_X(X) = \textup{E}_Y(\textup{E}_X(X|Y))

Proof. Expand the definitions of what these values mean, and use the definition of conditional probability \textup{P}(A \cap B) = \textup{P}(A | B) \textup{P}(B). We leave the proof as a trivial exercise to the reader, but if one cannot bear it, see Wikipedia for a full proof. \square

Let’s wrap up this post with a non-trivial example of all of this theory in action.

A Nontrivial Example: the Galton-Watson Branching Process

We are interested (as was the eponymous Sir Francis Galton in the 1800′s) in the survival of surnames through generations of marriage and children. The main tool to study such a generational phenomenon is the Galton-Watson branching process. The idea is quite simple, but its analysis quickly blossoms into a rich and detailed theoretical puzzle and a more general practical tool. Just before we get too deep into things, we should note that these ideas (along with other types of branching processes) are used to analyze a whole host of problems in probability theory and computer science. A few the author has recently been working with are the evolution of random graphs and graph property testing.

The gist is as follows: say we live in a patriarchal society in which surnames are passed down on the male side. We can image a family tree being grown step by step in this way At the root there is a single male, and he has k children, some of which are girls and some of which are boys. They all go on to have some number of children, but only the men pass on the family name to their children, and only their male children pass on the family name further. If we only record the family tree along the male lines, we can ask whether the tree will be finite; that is, whether the family name will die out.

To make this rigorous, let us define an infinite sequence of random variables X_1 X_2, \dots which represent the number of children each person in the tree has, and suppose further that all of these variables are independent and uniformly distributed from 1, \dots, n for some fixed n. This may be an unrealistic assumption, but it makes the analysis a bit simpler. The number of children more likely follows a Poisson distribution where the mean is a parameter we would estimate from real-world data, but we haven’t spoken of Poisson distributions on this blog yet so we will leave it out.

We further imagine the tree growing step by step: at step i the i-th individual in the tree has X_i children and then dies. If the individual is a woman we by default set X_i = 0. We can recursively describe the size of the tree at each step by another random variable Y_i. Clearly Y_0 = 1, and the recursion is Y_n = Y_{n-1} + X_i - 1. In words, Y_i represents the current living population with the given surname. We say the tree is finite (the family name dies off), if for some i we get Y_i = 0. The first time at which this happens is when the family name dies off, but abstractly we can imagine the sequence of random variables continuing forever. This is sometimes called fictitious continuation.

At last, we assume that the probability of having a boy or girl is a split 1/2. Now we can start asking questions. What is the probability that the surname dies off? What is the expected size of the tree in terms of n?

For the first question we use the theorem of total probability. In particular, suppose the first person has two boys. Then the whole tree is finite precisely when both boys’ sub-trees are finite. Indeed, the two boys’ sub-trees are independent of one another, and so the probability of both being finite is the product of the probabilities of each being finite. That is, more generally

\displaystyle \textup{P}(\textup{finite } | k \textup{ boys}) = \textup{P}(\textup{finite})^k \textup{P}(\textup{two boys})

Setting z = \textup{P}(\textup{the tree is finite}), we can compute z directly by conditioning on all possibilities of the first person’s children. Notice how we must condition twice here.

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \textup{P}(k \textup{ boys } | i \textup{ children}) \textup{P}(i \textup{ children}) z^k

The probability of getting k boys is the same as flipping i coins and getting k heads, which is just

\displaystyle \textup{P}(k \textup{ boys } | i \textup{ children}) = \binom{i}{k}\frac{1}{2^i}

So the equation is

\displaystyle z = \sum_{i=0}^n \sum_{k=0}^i \binom{i}{k} \frac{1}{2^i} \cdot \frac{1}{n} z^k

From here, we’ve reduced the problem down to picking the correct root of a polynomial. For example, when n=4, the polynomial equation to solve is

\displaystyle 64z = 5 + 10z + 10z^2 + 5z^3 + z^4

We have to be a bit careful, here though. Not all solutions to this equation are valid answers. For instance, the roots must be between 0 and 1 (inclusive), and if there are multiple then one must rule out the irrelevant roots by some additional argument. Moreover, we would need to use a calculus argument to prove there is always a solution between 0 and 1 in the first place. But after all that is done, we can estimate the correct root computationally (or solve for exactly when our polynomials have small degree). Here for n=4, the probability of being finite is about 0.094.

We leave the second question, on the expected size of the tree, for the reader to ponder. Next time we’ll devote an entire post to Bayes Theorem (a trivial consequence of the definition of conditional probability), and see how it helps us compute probabilities for use in programs.

Until then!

Probability Theory — A Primer

It is a wonder that we have yet to officially write about probability theory on this blog. Probability theory underlies a huge portion of artificial intelligence, machine learning, and statistics, and a number of our future posts will rely on the ideas and terminology we lay out in this post. Our first formal theory of machine learning will be deeply ingrained in probability theory, we will derive and analyze probabilistic learning algorithms, and our entire treatment of mathematical finance will be framed in terms of random variables.

And so it’s about time we got to the bottom of probability theory. In this post, we will begin with a naive version of probability theory. That is, everything will be finite and framed in terms of naive set theory without the aid of measure theory. This has the benefit of making the analysis and definitions simple. The downside is that we are restricted in what kinds of probability we are allowed to speak of. For instance, we aren’t allowed to work with probabilities defined on all real numbers. But for the majority of our purposes on this blog, this treatment will be enough. Indeed, most programming applications restrict infinite problems to finite subproblems or approximations (although in their analysis we often appeal to the infinite).

We should make a quick disclaimer before we get into the thick of things: this primer is not meant to connect probability theory to the real world. Indeed, to do so would be decidedly unmathematical. We are primarily concerned with the mathematical formalisms involved in the theory of probability, and we will leave the philosophical concerns and applications to  future posts. The point of this primer is simply to lay down the terminology and basic results needed to discuss such topics to begin with.

So let us begin with probability spaces and random variables.

Finite Probability Spaces

We begin by defining probability as a set with an associated function. The intuitive idea is that the set consists of the outcomes of some experiment, and the function gives the probability of each event happening. For example, a set \left \{ 0,1 \right \} might represent heads and tails outcomes of a coin flip, while the function assigns a probability of one half (or some other numbers) to the outcomes. As usual, this is just intuition and not rigorous mathematics. And so the following definition will lay out the necessary condition for this probability to make sense.

Definition: A finite set \Omega equipped with a function f: \Omega \to [0,1] is a probability space if the function f satisfies the property

\displaystyle \sum_{\omega \in \Omega} f(\omega) = 1

That is, the sum of all the values of f must be 1.

Sometimes the set \Omega is called the sample space, and the act of choosing an element of \Omega according to the probabilities given by f is called drawing an example. The function f is usually called the probability mass function. Despite being part of our first definition, the probability mass function is relatively useless except to build what follows. Because we don’t really care about the probability of a single outcome as much as we do the probability of an event.

Definition: An event E \subset \Omega is a subset of a sample space.

For instance, suppose our probability space is \Omega = \left \{ 1, 2, 3, 4, 5, 6 \right \} and f is defined by setting f(x) = 1/6 for all x \in \Omega (here the “experiment” is rolling a single die). Then we are likely interested in more exquisite kinds of outcomes; instead of asking the probability that the outcome is 4, we might ask what is the probability that the outcome is even? This event would be the subset \left \{ 2, 4, 6 \right \}, and if any of these are the outcome of the experiment, the event is said to occur. In this case we would expect the probability of the die roll being even to be 1/2 (but we have not yet formalized why this is the case).

As a quick exercise, the reader should formulate a two-dice experiment in terms of sets. What would the probability space consist of as a set? What would the probability mass function look like? What are some interesting events one might consider (if playing a game of craps)?

Of course, we want to extend the probability mass function f (which is only defined on single outcomes) to all possible events of our probability space. That is, we want to define a probability measure \textup{P}: 2^\Omega \to \mathbb{R}, where 2^\Omega denotes the set of all subsets of \Omega. The example of a die roll guides our intuition: the probability of any event should be the sum of the probabilities of the outcomes contained in it. i.e. we define

\displaystyle \textup{P}(E) = \sum_{e \in E} f(e)

where by convention the empty sum has value zero. Note that the function \textup{P} is often denoted \textup{Pr}.

So for example, the coin flip experiment can’t have zero probability for both of the two outcomes 0 and 1; the sum of the probabilities of all outcomes must sum to 1. More coherently: \textup{P}(\Omega) = \sum_{\omega \in \Omega} f(\omega) = 1 by the defining property of a probability space. And so if there are only two outcomes of the experiment, then they must have probabilities p and 1-p for some p. Such a probability space is often called a Bernoulli trial.

Now that the function \textup{P} is defined on all events, we can simplify our notation considerably. Because the probability mass function f uniquely determines \textup{P} and because \textup{P} contains all information about f in it (\textup{P}(\left \{ \omega \right \}) = f(\omega)), we may speak of \textup{P} as the probability measure of \Omega, and leave f out of the picture. Of course, when we define a probability measure, we will allow ourselves to just define the probability mass function and the definition of \textup{P} is understood as above.

There are some other quick properties we can state or prove about probability measures: \textup{P}(\left \{ \right \}) = 0 by convention, if E, F are disjoint then \textup{P}(E \cup F) = \textup{P}(E) + \textup{P}(F), and if E \subset F \subset \Omega then \textup{P}(E) \leq \textup{P}(F). The proofs of these facts are trivial, but a good exercise for the uncomfortable reader to work out.

Random Variables

The next definition is crucial to the entire theory. In general, we want to investigate many different kinds of random quantities on the same probability space. For instance, suppose we have the experiment of rolling two dice. The probability space would be

\displaystyle \Omega = \left \{ (1,1), (1,2), (1,3), \dots, (6,4), (6,5), (6,6) \right \}

Where the probability measure is defined uniformly by setting all single outcomes to have probability 1/36. Now this probability space is very general, but rarely are we interested only in its events. If this probability space were interpreted as part of a game of craps, we would likely be more interested in the sum of the two dice than the actual numbers on the dice. In fact, we are really more interested in the payoff determined by our roll.

Sums of numbers on dice are certainly predictable, but a payoff can conceivably be any function of the outcomes. In particular, it should be a function of \Omega because all of the randomness inherent in the game comes from the generation of an output in \Omega (otherwise we would define a different probability space to begin with).

And of course, we can compare these two different quantities (the amount of money and the sum of the two dice) within the framework of the same probability space. This “quantity” we speak of goes by the name of a random variable.

Definition: random variable X is a real-valued function on the sample space \Omega \to \mathbb{R}.

So for example the random variable for the sum of the two dice would be X(a,b) = a+b. We will slowly phase out the function notation as we go, reverting to it when we need to avoid ambiguity.

We can further define the set of all random variables \textup{RV}(\Omega). It is important to note that this forms a vector space. For those readers unfamiliar with linear algebra, the salient fact is that we can add two random variables together and multiply them by arbitrary constants, and the result is another random variable. That is, if X, Y are two random variables, so is aX + bY for real numbers a, b. This function operates linearly, in the sense that its value is (aX + bY)(\omega) = aX(\omega) + bY(\omega). We will use this property quite heavily, because in most applications the analysis of a random variable begins by decomposing it into a combination of simpler random variables.

Of course, there are plenty of other things one can do to functions. For example, XY is the product of two random variables (defined by XY(\omega) = X(\omega)Y(\omega)) and one can imagine such awkward constructions as X/Y or X^Y. We will see in a bit why it these last two aren’t often used (it is difficult to say anything about them).

The simplest possible kind of random variable is one which identifies events as either occurring or not. That is, for an event E, we can define a random variable which is 0 or 1 depending on whether the input is a member of E. That is,

Definition: An indicator random variable 1_E is defined by setting 1_E(\omega) = 1 when \omega \in E and 0 otherwise. A common abuse of notation for singleton sets is to denote 1_{\left \{ \omega \right \} } by 1_\omega.

This is what we intuitively do when we compute probabilities: to get a ten when rolling two dice, one can either get a six, a five, or a four on the first die, and then the second die must match it to add to ten.

The most important thing about breaking up random variables into simpler random variables will make itself clear when we see that expected value is a linear functional. That is, probabilistic computations of linear combinations of random variables can be computed by finding the values of the simpler pieces. We can’t yet make that rigorous though, because we don’t yet know what it means to speak of the probability of a random variable’s outcome.

Definition: Denote by \left \{ X = k \right \} the set of outcomes \omega \in \Omega for which X(\omega) = k. With the function notation, \left \{ X = k \right \} = X^{-1}(k).

This definition extends to constructing ranges of outcomes of a random variable. i.e., we can define \left \{ X < 5 \right \} or \left \{ X \textup{ is even} \right \} just as we would naively construct sets. It works in general for any subset of S \subset \mathbb{R}. The notation is \left \{ X \in S \right \} = X^{-1}(S), and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure \textup{P}. That is, we want to write things like \textup{P}(X \textup{ is even}) and read it in our head “the probability that X is even”.

This is made rigorous by simply setting

\displaystyle \textup{P}(X \in S) = \sum_{\omega \in X^{-1}(S)} \textup{P}(\omega)

In words, it is just the sum of the probabilities that individual outcomes will have a value under X that lands in S. We will also use for \textup{P}(\left \{ X \in S \right \} \cap \left \{ Y \in T \right \}) the shorthand notation \textup{P}(X \in S, Y \in T) or \textup{P}(X \in S \textup{ and } Y \in T).

Often times \left \{ X \in S \right \} will be smaller than \Omega itself, even if S is large. For instance, let the probability space be the set of possible lottery numbers for one week’s draw of the lottery (with uniform probabilities), let X be the profit function. Then \textup{P}(X > 0) is very small indeed.

We should also note that because our probability spaces are finite, the image of the random variable \textup{im}(X) is a finite subset of real numbers. In other words, the set of all events of the form \left \{ X = x_i \right \} where x_i \in \textup{im}(X) form a partition of \Omega. As such, we get the following immediate identity:

\displaystyle 1 = \sum_{x_i \in \textup{im} (X)} P(X = x_i)

The set of such events is called the probability distribution of the random variable X.

The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. We say that two events E,F \subset \Omega are independent if the probability of both E, F occurring is the product of the probabilities of each event occurring. That is, \textup{P}(E \cap F) = \textup{P}(E)\textup{P}(F). There are multiple ways to realize this formally, but without the aid of conditional probability (more on that next time) this is the easiest way. One should note that this is distinct from E,F being disjoint as sets, because there may be a zero-probability outcome in both sets.

The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events. In particular, X,Y are independent random variables if

\displaystyle \textup{P}(X = x, Y = y) = \textup{P}(X=x)\textup{P}(Y=y)

for all x,y \in \mathbb{R}.

Expectation

We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory.

Definition: Let X be a random variable on a finite probability space \Omega. The expected value of X, denoted \textup{E}(X), is the quantity

\displaystyle \textup{E}(X) = \sum_{\omega \in \Omega} X(\omega) \textup{P}(\omega)

Note that if we label the image of X by x_1, \dots, x_n then this is equivalent to

\displaystyle \textup{E}(X) = \sum_{i=1}^n x_i \textup{P}(X = x_i)

The most important fact about expectation is that it is a linear functional on random variables. That is,

Theorem: If X,Y are random variables on a finite probability space and a,b \in \mathbb{R}, then

\displaystyle \textup{E}(aX + bY) = a\textup{E}(X) + b\textup{E}(Y)

Proof. The only real step in the proof is to note that for each possible pair of values x, y in the images of X,Y resp., the events E_{x,y} = \left \{ X = x, Y=y \right \} form a partition of the sample space \Omega. That is, because aX + bY has a constant value on E_{x,y}, the second definition of expected value gives

\displaystyle \textup{E}(aX + bY) = \sum_{x \in \textup{im} (X)} \sum_{y \in \textup{im} (Y)} (ax + by) \textup{P}(X = x, Y = y)

and a little bit of algebraic elbow grease reduces this expression to a\textup{E}(X) + b\textup{E}(Y). We leave this as an exercise to the reader, with the additional note that the sum \sum_{y \in \textup{im}(Y)} \textup{P}(X = x, Y = y) is identical to \textup{P}(X = x). \square

If we additionally know that X,Y are independent random variables, then the same technique used above allows one to say something about the expectation of the product \textup{E}(XY) (again by definition, XY(\omega) = X(\omega)Y(\omega)). In this case \textup{E}(XY) = \textup{E}(X)\textup{E}(Y). We leave the proof as an exercise to the reader.

Now intuitively the expected value of a random variable is the “center” of the values assumed by the random variable. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that \textup{E}(X) \in \textup{im}(X). For instance, in an experiment where we pick a number uniformly at random between 1 and 4 (the random variable is the identity function), the expected value would be:

\displaystyle 1 \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} = \frac{5}{2}

But the random variable never achieves this value. Nevertheless, it would not make intuitive sense to call either 2 or 3 the “center” of the random variable (for both 2 and 3, there are two outcomes on one side and one on the other).

Let’s see a nice application of the linearity of expectation to a purely mathematical problem. The power of this example lies in the method: after a shrewd decomposition of a random variable X into simpler (usually indicator) random variables, the computation of \textup{E}(X) becomes trivial.

tournament  T is a directed graph in which every pair of distinct vertices has exactly one edge between them (going one direction or the other). We can ask whether such a graph has a Hamiltonian path, that is, a path through the graph which visits each vertex exactly once. The datum of such a path is a list of numbers (v_1, \dots, v_n), where we visit vertex v_i at stage i of the traversal. The condition for this to be a valid Hamiltonian path is that (v_i, v_{i+1}) is an edge in T for all i.

Now if we construct a tournament on n vertices by choosing the direction of each edges independently with equal probability 1/2, then we have a very nice probability space and we can ask what is the expected number of Hamiltonian paths. That is, X is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in \textup{E}(X).

To compute this, simply note that we can break X = \sum_p X_p, where p ranges over all possible lists of the vertices. Then \textup{E}(X) = \sum_p \textup{E}(X_p), and it suffices to compute the number of possible paths and the expected value of any given path. It isn’t hard to see the number of paths is n! as this is the number of possible lists of n items. Because each edge direction is chosen with probability 1/2 and they are all chosen independently of one another, the probability that any given path forms a Hamiltonian path depends on whether each edge was chosen with the correct orientation. That’s just

\textup{P}(\textup{first edge and second edge and } \dots \textup{ and last edge})

which by independence is

\displaystyle \prod_{i = 1}^n \textup{P}(i^\textup{th} \textup{ edge is chosen}) = \frac{1}{2^{n-1}}

That is, the expected number of Hamiltonian paths is n!2^{-(n-1)}.

Variance and Covariance

Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable X are throughout the real line.

Definition: The variance of a random variable X is the quantity \textup{E}((X - \textup{E}(X))^2).

That is, \textup{E}(X) is a number, and so X - \textup{E}(X) is the random variable defined by (X - \textup{E}(X))(\omega) = X(\omega) - \textup{E}(X). It is the expectation of the square of the deviation of X from its expected value.

One often denotes the variance by \textup{Var}(X) or \sigma^2. The square is for silly reasons: the standard deviation, denoted \sigma and equivalent to \sqrt{\textup{Var}(X)} has the same “units” as the outcomes of the experiment and so it’s preferred as the “base” frame of reference by some. We won’t bother with such physical nonsense here, but we will have to deal with the notation.

The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition. We encourage the reader to prove a few:

  • \textup{Var}(X) = \textup{E}(X^2) - \textup{E}(X)^2.
  • \textup{Var}(aX) = a^2\textup{Var}(X).
  • When X,Y are independent then variance is additive: \textup{Var}(X+Y) = \textup{Var}(X) + \textup{Var}(Y).
  • Variance is invariant under constant additives: \textup{Var}(X+c) = \textup{Var}(X).

In addition, the quantity \textup{Var}(aX + bY) is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. The formal name for this is covariance.

Definition: Let X,Y be random variables. The covariance of X and Y, denoted \textup{Cov}(X,Y), is the quantity \textup{E}((X - \textup{E}(X))(Y - \textup{E}(Y))).

Note the similarities between the variance definition and this one: if X=Y then the two quantities coincide. That is, \textup{Cov}(X,X) = \textup{Var}(X).

There is a nice interpretation to covariance that should accompany every treatment of probability: it measures the extent to which one random variable “follows” another. To make this rigorous, we need to derive a special property of the covariance.

Theorem: Let X,Y be random variables with variances \sigma_X^2, \sigma_Y^2. Then their covariance is at most the product of the standard deviations in magnitude:

|\textup{Cov}(X,Y)| \leq \sigma_X \sigma_Y

Proof. Take any two non-constant random variables X and Y (we will replace these later with X - \textup{E}(X), Y - \textup{E}(Y)). Construct a new random variable (tX + Y)^2 where t is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. That is, \textup{E}((tX + Y)^2). Expanding this and using linearity gives

\displaystyle f(t) = t^2 \textup{E}(X^2) + 2t \textup{E}(XY) + \textup{E}(Y^2) \geq 0

This is a quadratic function of a single variable t which is nonnegative. From elementary algebra this means the discriminant is at most zero. i.e.

\displaystyle 4 \textup{E}(XY)^2 - 4 \textup{E}(X^2) \textup{E}(Y^2) \leq 0

and so dividing by 4 and replacing X,Y with X - \textup{E}(X), Y - \textup{E}(Y), resp., gives

\textup{Cov}(X,Y)^2 \leq \sigma_X^2 \sigma_Y^2

and the result follows. \square

Note that equality holds in the discriminant formula precisely when Y = -tX (the discriminant is zero), and after the replacement this translates to Y - \textup{E}(Y) = -t(X - \textup{E}(X)) for some fixed value of t. In other words, for some real numbers a,b we have Y = aX + b.

This has important consequences even in English: the covariance is maximized when Y is a linear function of X, and otherwise is bounded from above and below. By dividing both sides of the inequality by \sigma_X \sigma_Y we get the following definition:

Definition: The Pearson correlation coefficient of two random variables X,Y is defined by

\displaystyle r= \frac{\textup{Cov}(X,Y)}{\sigma_X \sigma_Y}

If r is close to 1, we call X and Y positively correlated. If r is close to -1 we call them negatively correlated, and if r is close to zero we call them uncorrelated.

The idea is that if two random variables are positively correlated, then a higher value for one variable (with respect to its expected value) corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other. The picture is as follows:

covariance

The  horizontal axis plots a sample of values of the random variable X and the vertical plots a sample of Y. The linear correspondence is clear. Of course, all of this must be taken with a grain of salt: this correlation coefficient is only appropriate for analyzing random variables which have a linear correlation. There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them.

Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various (continuous, but the issue still applies to the finite case) probability distributions:

Various examples of the Pearson correlation coefficient, credit Wikipedia.

Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. In one sense, the closer that the correlation coefficient is to 1, the better a linear predictor will perform in “guessing” values of Y given values of X (same goes for -1, but the predictor has negative slope).

But this strays a bit far from our original point: we still want to find a formula for \textup{Var}(aX + bY). Expanding the definition, it is not hard to see that this amounts to the following proposition:

Proposition: The variance operator satisfies

\displaystyle \textup{Var}(aX+bY) = a^2\textup{Var}(X) + b^2\textup{Var}(Y) + 2ab \textup{Cov}(X,Y)

And using induction we get a general formula:

\displaystyle \textup{Var} \left ( \sum_{i=1}^n a_i X_i \right ) = \sum_{i=1}^n \sum_{j = 1}^n a_i a_j \textup{Cov}(X_i,X_j)

Note that in the general sum, we get a bunch of terms \textup{Cov}(X_i,X_i) = \textup{Var}(X_i).

Another way to look at the linear relationships between a collection of random variables is via a covariance matrix.

Definition: The covariance matrix of a collection of random variables X_1, \dots, X_n is the matrix whose (i,j) entry is \textup{Cov}(X_i,X_j).

As we have already seen on this blog in our post on eigenfaces, one can manipulate this matrix in interesting ways. In particular (and we may be busting out an unhealthy dose of new terminology here), the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it. In more direct words: we can form a new collection of random variables Y_j (which are linear combinations of the original variables X_i) such that the covariance of distinct pairs Y_j, Y_k are all zero. In one sense, this is the “best perspective” with which to analyze the random variables. We gave a general algorithm to do this in our program gallery, and the technique is called principal component analysis.

Next Up

So far in this primer we’ve seen a good chunk of the kinds of theorems one can prove in probability theory. Fortunately, much of what we’ve said for finite probability spaces holds for infinite (discrete) probability spaces and has natural analogues for continuous probability spaces.

Next time, we’ll investigate how things change for discrete probability spaces, and should we need it, we’ll follow that up with a primer on continuous probability. This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.

Until then!

The Fourier Transform — A Primer

In our last primer we saw the Fourier series, which flushed out the notion that a periodic function can be represented as an infinite series of sines and cosines. While this is fine and dandy, and quite a powerful tool, it does not suffice for the real world. In the real world, very little is truly periodic, especially since human measurements can only record a finite period of time. Even things we wish to explore on this humble blog are hardly periodic (for instance, image analysis). In order to compensate for this, we will develop the Fourier transform, which can be thought of as a limiting case of the Fourier series.

We will approach the Fourier transform from two different angles. First, we will take the “limiting case of the Fourier series” notion as far as it will take us (this will be quite far) to motivate the appropriate definitions. In this naive world, we will perform a few interesting computations, and establish the most useful properties of the Fourier transform as an operation. On the other hand, this naive world will be fraught with hand-waving and mathematical laxity. We will make statements that analysts (and people who know about the issues of convergence) would find uncouth.

And so we will redevelop the bulk of the theory from the ground up, and define the Fourier transform on a larger class of things called distributions. In doing so, we will circumvent (or rather, properly handle) all of the issues of convergence.

Fourier and Naivete

The Fourier transform is best thought of as an operation on functions which has some nice properties. One such property is linearity, and a more complex one is its effect on the convolution operation. But if one wants to know where the transform comes from, and this is an important thing to know, then we could interpret it as a generalization of the Fourier series coefficient. Similarly, its inverse can be thought of as a generalization of the Fourier series representation (or, the map taking a function to its Fourier series representation). The generalization we speak of, and the “limiting case” we mentioned earlier, is in the size of the period. In rough terms,

The Fourier transform is the limit of the Fourier coefficient as the period of the function tends to infinity.

This is how we will develop the definition of the Fourier transform, and the reader should understand why this is a sensible place to start: a function which has no period is simply a function which has an infinitely large period. Colloquially, one can “spot” periodicity much easier if the period is shorter, and as the period increases, functions start to “look” less and less periodic. So in the limiting case, there is no period.

In order to do this correctly, we should alter some of the definitions we made in our post on Fourier series. Specifically, we want to have an arbitrary period T. So instead of making the 1-periodic complex exponential e^{2\pi i k t} our basic building block, we choose e^{2 \pi i t k/T}. The reader will check that as k varies over all integers, these new complex exponentials still form an orthonormal basis of L^2 (well not quite, we have to modify the inner product slightly; see below). Then, using the notation we used last time for the Fourier coefficients c_k, the series is calculated as

\displaystyle \sum_{k = -\infty}^{\infty} c_k e^{\frac{2 \pi i k t}{T}}

where the c_k are computed via the new inner product:

\displaystyle c_k = \frac{1}{T}\int_{0}^{T}e^{\frac{-2 \pi i k t}{T}} f(t)dt

We make another slight alteration in the limits of integration:

\displaystyle c_k = \frac{1}{T} \int_{-T/2}^{T/2} e^{\frac{-2 \pi i k t}{T}} f(t)dt

This doesn’t change the integral, since a  T-periodic function has the same integral on any interval of length T.

Before we continue, we should show an intuitive aspect of what happens when T \to \infty. We can think of the usual Fourier series representation of a 1-periodic function f as a function on the integers whose values are the Fourier coefficients k \mapsto c_k, since the coefficients completely determine the representation of f.  The interval between the inputs to this mapping is 1, because the function is 1-periodic. On the other hand, if we let T vary, we see that the intervals between adjacent inputs shrink to 1/T. As T grows, the inputs are moving closer and closer together. In other words, the Fourier series representation is becoming a continuous mapping of the frequency! This viewpoint will motivate our seemingly magical choices to follow, and it partially justifies the common notion that the Fourier transform takes a function “in the time domain” and represents it “in the frequency domain.” It’s a stupid notion for mathematicians, but everyone says it anyway.

So if we try to take the limit immediately, that is, if we use the exact formula above for c_k and try to evaluate \lim_{T \to \infty} c_k, we have issues. The problem rears it’s ugly head when we let f be a function with bounded support (that is, f(x) is zero everywhere except possibly on a finite interval). If f is zero outside of [a,b] and \int_a^b |f(x)|dx \leq M is finite, then for some large T, we have that all of the Fourier coefficients go to zero as T \to \infty. The details:

\displaystyle |c_k| = \left | \frac{1}{T}\int_{-T/2}^{T/2} e^{\frac{-2 \pi i kt}{T}} f(t)dt \right | \leq \frac{1}{T} \int_{-T/2}^{T/2} \left | e^{\frac{-2 \pi i k t}{T}} \right | |f(t)| dt

But as the absolute value of the complex exponential is 1, we can bound this by

\displaystyle \frac{1}{T} \int_{-T/2}^{T/2} |f(t)|dt \leq \frac{M}{T},

and as T \to \infty, we see that the whole thing goes to 0.

The solution is (magic!) to scale linearly by T, and pull the balancing factor of 1/T outside of the coefficients c_k, and into the Fourier series itself. In other words, our new Fourier series is (written with the terms rearranged for good reason)

\displaystyle \sum_{k = -\infty}^{\infty} e^{\frac{-2 \pi i k t}{T}} c_k (1/T) \frac{1}{T}

where the coefficients c_k(1/T) are

\displaystyle c_k(1/T) = \int_{-T/2}^{T/2}e^{\frac{-2 \pi i k t}{T}} f(t)dt

Now we suddenly (magically!) realize that the first equation is just the usual Riemann sum from calculus for the estimate of an integral. If we think of x as our variable, then we’d be integrating e^{-2 \pi i t x} c_k(x). And in particular, when the interval between the x values goes to 0, the discrete sum converges to an integral by definition. Let us denote the infinitesimal variable s to represent this “limit of 1/T.” Then we redefine the two above equations with glorious new names:

Definition: The Fourier transform of a function f: \mathbb{R} \to \mathbb{C} is the integral

\displaystyle \mathscr{F}f(s) = \int_{-\infty}^{\infty}e^{-2 \pi i s t} f(t)dt

whenever such an integral converges.

The inverse Fourier transform of f is the integral

\displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^{\infty} e^{2\pi i s t}g(s)ds

whenever such an integral converges.

And so, the Fourier transform above generalizes the Fourier coefficient c_k (the limits of integration go to infinity), while the inverse transform generalizes the Fourier series reconstruction, by our conversion from a discrete sum to an integral.

We should note a few things about this definition, because it is quite a mouthful. First, \mathscr{F} and \mathscr{F}^{-1} operate on functions. In other words, they accept a function as input, and their values are functions. Still in other words, the parentheses are like (\mathscr{F}f)(s), and not like \mathscr{F}(f(s)). We will often omit the parentheses with this implicitly understood precedence. This is also part of why we choose different variables. f(t) will often use a different variable than its transform \mathscr{F}f(s).

Second, returning to our remark about stupid notions, the function f(t) can be thought of as being in the “time domain,” where the inputs are instances of time, while the transformed function \mathscr{F}f is in the “frequency domain.” That is, for a given input s, the Fourier transform \mathscr{F}f(s) describes how the complex exponential with frequency s contributes to the overall function f. The set of values of the Fourier transform of f is called the spectrum of f. One can visualize the spectrum, but only indirectly. A complex-valued function is always hard to visualize, so instead one graphs |\mathscr{F}f(s)| as a function on the real numbers. We then get pretty pictures like this one giving the spectrum of some human-spoken words:

This also explains the humorous comic at the beginning of this post: the thing saying “meow” is the spectrum of a cat, complete with whiskers. The comic also reinforces the idea that \mathscr{F} is simply an operation on functions. One does not need to restrict \mathscr{F} to operate on functions whose domain is time (indeed, a cat is not a function of time). It’s just an instance in which one can concretely interpret the transform of a function. For example, if one wanted to (and we will shortly), one could wonder about \mathscr{FF}f, or even apply \mathscr{F} arbitrarily many times and see what happens under the limit. The same thing goes for the inverse Fourier transform.

The last big issue we have with the above definition is that it only makes sense when the integral actually converges. We will run into a few examples where this becomes a big problem, but we will sweep these issues under the rug for now (remember, this is still the land of naivete).

Nevertheless, we can do some wonderful computations with this mentality and this new definition. It will benefit us in the long run, because we’ll discover the useful properties of the Fourier transform now, and use those properties to steer the more rigorous formalities later.

Elementary Transforms and Elementary Properties

Armed with the mighty definition of the Fourier transform, we can take two paths. We can compute the transforms of various elementary functions, or we can develop tools to construct transforms of combinations of functions by computing the transforms of their constituent parts. This is largely the same approach one takes in studying derivatives and integrals in classical calculus: one learns to compute the derivatives of polynomials, logarithms, and exponentials; and then one learns to compute derivatives of products, sums, and compositions of functions. We will operate with the same philosophy for now.

Example: Let f = \chi_{[-1/2, 1/2]} be the characteristic function of the interval from -1/2 to 1/2. That is, it is the function which is 1 on that interval and zero everywhere else. We will show \mathscr{F}f = \frac{\sin(\pi s)}{\pi s} by appealing directly to the definition of the Fourier transform.

\displaystyle \mathscr{F}f(s) = \int_{-\infty}^{\infty} e^{-2 \pi i s t} \chi_{[-1/2, 1/2]}(t) dt

Since f is zero outside of the chosen interval, and one inside, we can simplify the integral by adjusting the limits to -1/2 and 1/2, and inside simply using f(t) = 1:

\displaystyle \mathscr{F}f(s) = \int_{-1/2}^{1/2}e^{-2 \pi ist} dt

And this is quite tractable. Integrating the complex exponential as usual, we have:

\displaystyle \mathscr{F}f(s) = \left ( \frac{1}{-2 \pi is} e^{-2 \pi ist} \right ) \Big |_{-1/2}^{1/2} = \frac{e^{- \pi ist} - e^{\pi ist}}{-2 \pi is} = \frac{\sin(\pi s)}{\pi s}

Where the last equality follows from the classic identity e^{ix} = \cos(x) + i\sin(x). The result of this Fourier transform is so pervasive that it has it’s own name: \textup{sinc}(s) = \frac{\sin(\pi s)}{\pi s}.

Exercise: Let \Lambda(t) be the piecewise function defined as 1 - |t| if |t| < 1, and zero otherwise. Prove that \mathscr{F}\Lambda(s) = \textup{sinc}^2(s) = \frac{\sin^2(\pi s)}{(\pi s)^2}.

Again, this one follows straight from the definition, which must be computed piecewise to handle the absolute value.

Example: Let f be the Gaussian f(t) = e^{- \pi t^2}. Then \mathscr{F}f(s) = f(s). That is, the Gaussian is fixed by the Fourier transform.

This is a very special property of the Gaussian, hinting at the special relationship between Fourier transforms and smoothness of curves. In order to prove it we need to borrow a fact from complex analysis, that \int_{-\infty}^{\infty} e^{- \pi t^2} = 1. Note that here the indefinite integral of f cannot be expressed in elementary terms, so basic calculus tools are insufficient to prove this fact. A proof is most easily accessible using complex integration and residue theory, and Wikipedia provides a proof that does the same thing using a real parameterization to make it seem more elementary.

To find the Fourier transform, we again appeal to the definition, except this time we use some tricks. First, we differentiate the definition with respect to s, and then integrate the result by parts to arrive at an ordinary differential equation, which we know how to solve. Set F(s) = \mathscr{F}f(s) for ease of notation.

\displaystyle F(s) = \int_{-\infty}^{\infty} e^{-2 \pi ist} e^{-\pi t^2}dt

Differentiating with respect to s, we have

\displaystyle F'(s) = \frac{d}{ds}\int_{-\infty}^{\infty} e^{-2 \pi ist}e^{-\pi t^2}dt = \int_{-\infty}^{\infty} \frac{d}{ds}(e^{-2 \pi ist}) e^{-\pi t^2} dt

Performing the differentiation and regrouping terms we have

\displaystyle F'(s) = i \int_{-\infty}^{\infty}e^{-2\pi ist}(-2\pi t e^{-\pi t^2}) dt

Now integrating by parts with respect to t, and recognizing that the term e^{-2 \pi ist} e^{-\pi t^2} tends to zero both as t \to \pm \infty, we get

\displaystyle F'(s) = -i \int_{-\infty}^{\infty} e^{-\pi t^2}(-2\pi is)e^{-2 \pi ist} dt = -2\pi s F(s)

As we claimed earlier, this is a simple ordinary differential equation, which has solution

\displaystyle F(s) = F(0)e^{-\pi s^2} = F(0)f(s)

And here F(0) = \int_{-\infty}^{\infty} e^{-2 \pi i t 0}e^{-\pi t^2} dt = 1, as we claimed from the beginning. This completes the proof, as F(s) = f(s).

Next, we will focus on the rules for taking Fourier transforms of functions combined in various ways. First, we note that the Fourier transform is linear, in the very same sense as linear maps between vector spaces in elementary linear algebra. In particular, the linearity of the integral gives

\displaystyle \mathscr{F}(af + bg)(s) = a \mathscr{F}f(s) + b\mathscr{F}g(s)

Other easy properties arise from modifying the input of f, and using multiplicative properties of the complex exponential. For instance, if we let f_a(t) = f(t - a), we see that \mathscr{F}f_a(s) = e^{-2 \pi ias}\mathscr{F}f(s). This follows by a simple change of variables in the integral. Letting u = t-a,

\displaystyle \mathscr{F}f_a(s) = \int_{-\infty}^{\infty} e^{-2\pi is(u+a)}f(u)du

And we can trivially factor out the needed complex exponential coefficient, and work with u as usual. One convenient interpretation of this formula is that a shift in time corresponds to a phase shift in frequency. Once again, we caution that these interpretations are a tool; they can massively confuse the issue when, say, the domain of f is frequency to begin with.

Similar considerations give us a formula for the scaled function f(at) whose transform is \frac{1}{|a|} \mathscr{F}f(\frac{s}{a}). We leave this as an exercise to the reader (hint: break it into two cases for when a<0, a>0).

Next, we note that the Fourier transform turns derivatives of some functions into a very manageable product. Rigorously, if \lim_{t \to \pm \infty} f(t) = 0 then

\displaystyle \mathscr{F}\frac{d^nf}{dt^n}(s) = (2\pi i s)^n \mathscr{F}f(s)

We can prove this by induction. We’ll just prove the base case:

\displaystyle \mathscr{F}f'(s) = \int_{-\infty}^{\infty} e^{-2 \pi ist}f'(t)dt

Integrating by parts, we get

\displaystyle \mathscr{F}f'(s) = e^{-2\pi ist}f(t) - \int_{-\infty}^{\infty} (-2\pi is)e^{-2\pi ist}f(t)dt

And by our boundedness property and the fact that the complex exponential has a constant norm, the first term (evaluated from -\infty to \infty) tends to zero, leaving our desired product. The inductive step follows with the ease of iterated integration by parts. Note that although this example only holds for functions which tend to zero at \pm \infty, next time we will rectify the situation by restricting our theory to functions which are “the best candidates” for the theory of Fourier analysis and eliminate the need for such hypotheses.

The More Interesting Properties

The final two properties of the Fourier transform that we will inspect are in a sense deeper and more substantial than those above. In particular, we will establish the duality of the Fourier transform, and the effect of Fourier transforms on convolution.

First, the Fourier transform has a few notions of duality. Let f^-(t) denote f(-t). One such duality notion is the following, which is a trivial consequence of the definitions of the transform and its inverse:

\displaystyle (\mathscr{F}f)^- = \mathscr{F}^{-1}f

Similarly, a minor change of variables shows that \mathscr{F}(f^-) = \mathscr{F}^{-1}f. Chaining these together, we have the nice identity

\displaystyle \mathscr{F}(f^-) = (\mathscr{F}f)^-

A simple corollary is that \mathscr{FF}f = f^-. This allows us to compute the Fourier transforms of some potentially unmanageable functions. For instance, let us return to our friend the sinc function.

\displaystyle \mathscr{F}\textup{sinc}(s) = \mathscr{FF}(\chi_{[-1/2,1/2]}) = \chi^-_{[-1/2, 1/2]} = \chi_{[-1/2, 1/2]}

by the symmetry of the characteristic function. On the other hand, it’s ridiculously counterintuitive that the following integral is actually the characteristic function of a finite interval:

\displaystyle \int_{-\infty}^{\infty} e^{-2 \pi ist} \frac{\sin(\pi t)}{\pi t} dt

In fact, even though we just “proved” that the sinc function has a nice transform, it is hardly clear how to integrate it. In fact, the sinc function is not even (Lebesgue) integrable! Without further qualifications, the above expression is complete nonsense.

Historically, this is the point at which the physicists contact the mathematicians and say, “We dun broked it!” Because the physicists went ahead and used these naively impossible transforms to do amazing things and discover elegant identities, the mathematicians are left to design a sensible theory to support their findings. The field is rife with such inconsistencies, and this is not the last one we will see before consolidating the theory. Perhaps this is in part because successful applications in engineering outpace mathematical rigor. Glibly, they’re racing for profit while the mathematicians want to architect a flawless foundation in which deep theorems are manifestly obvious.

Getting back to the properties of Fourier transforms, we have saved perhaps the most useful one for last. In short, Fourier transforms turn convolution into multiplication. Convolutions, both continuous and discrete, make cameo appearances all over applied mathematics, from signal processing and image analysis to quantum mechanics and mathematical finance. In our applications, we will use the following properties of convolution to modify the spectrum of a signal, for such purposes as removing noise, or filtering out low/high/mid frequency regions. Without further ado, the definition:

Definition: The convolution of f and g, denoted f \ast g, is the integral

\displaystyle (f \ast g)(x) = \int_{-\infty}^{\infty} g(x-y)f(y)dy,

should such an integral converge. Otherwise the convolution is undefined.

Often convolution is interpreted as some sort of stretch+translation of a function by another, but we find such meager interpretations mathematically flaccid. Convolution is simply an operation that combines functions in an interesting way (perhaps its definition is motivated by the question below). Nevertheless, Wikipedia provides a number of relevant animations showing convolution in action.

So the leading question here is, what happens when one takes the product of (\mathscr{F}f)(\mathscr{F}g)? From the definition, this is

\displaystyle \left ( \int_{-\infty}^{\infty}e^{-2\pi isx}f(x)dx \right ) \left ( \int_{-\infty}^{\infty} e^{-2 \pi ist} g(t) dt \right )

We may combine the integrals into a double integral, and further combine the complex exponentials, getting

\displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2 \pi is (t+x)}g(t) dt \right ) f(x) dx

Substituting u = t+x, we have

\displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} e^{-2\pi isu} g(u-x)du \right ) f(x) dx

And swapping the order of integration,

\displaystyle \int_{-\infty}^{\infty} \left ( \int_{-\infty}^{\infty} g(u-x)f(x)dx \right ) e^{-2\pi isu} du = \mathscr{F}(g \ast f)

(The parenthetical quantity drove our definition of the convolution to begin with.) And so we have the beautiful identity:

\mathscr{F}(f \ast g)(s) = \mathscr{F}f(s) \mathscr{F}g(s)

We will use this as follows: multiply the Fourier transform of a signal by an appropriate characteristic function (the characteristic function of the set of “good” frequencies of, say, a sound clip) and then take the inverse transform of the product, getting as a result a modified signal with certain frequencies removed.

There are a few hurdles between here and there (at least, as far as this blog goes). First, we must compensate for our convergence naivete with mathematical rigor. Next time, we will define the class of Schwartz functions, from which we will derive a class of “generalized functions,” intuitively constituting the class of “transformable” functions. After that, we must needs find a suitable discrete approximation of the Fourier transform. In real life, all signals are sampled sequences of numbers. As such, we cannot take their integrals, and must convert these continuous notions to operations on sequences. Finally, we need to investigate an algorithm to efficiently compute such a discrete Fourier transform. Then, and only then, may we proceed with writing programs to do great things.

So look forward to all of the above in the coming weeks. Until next time!