Markov Chain Monte Carlo Without all the Bullshit

I have a little secret: I don’t like the terminology, notation, and style of writing in statistics. I find it unnecessarily complicated. This shows up when trying to read about Markov Chain Monte Carlo methods. Take, for example, the abstract to the Markov Chain Monte Carlo article in the Encyclopedia of Biostatistics.

Markov chain Monte Carlo (MCMC) is a technique for estimating by simulation the expectation of a statistic in a complex model. Successive random selections form a Markov chain, the stationary distribution of which is the target distribution. It is particularly useful for the evaluation of posterior distributions in complex Bayesian models. In the Metropolis–Hastings algorithm, items are selected from an arbitrary “proposal” distribution and are retained or not according to an acceptance rule. The Gibbs sampler is a special case in which the proposal distributions are conditional distributions of single components of a vector parameter. Various special cases and applications are considered.

I can only vaguely understand what the author is saying here (and really only because I know ahead of time what MCMC is). There are certainly references to more advanced things than what I’m going to cover in this post. But it seems very difficult to find an explanation of Markov Chain Monte Carlo without all any superfluous jargon. The “bullshit” here is the implicit claim of an author that such jargon is needed. Maybe it is to explain advanced applications (like attempts to do “inference in Bayesian networks”), but it is certainly not needed to define or analyze the basic ideas.

So to counter, here’s my own explanation of Markov Chain Monte Carlo, inspired by the treatment of John Hopcroft and Ravi Kannan.

The Problem is Drawing from a Distribution

Markov Chain Monte Carlo is a technique to solve the problem of sampling from a complicated distribution. Let me explain by the following imaginary scenario. Say I have a magic box which can estimate probabilities of baby names very well. I can give it a string like “Malcolm” and it will tell me the exact probability p_{\textup{Malcolm}} that you will choose this name for your next child. So there’s a distribution D over all names, it’s very specific to your preferences, and for the sake of argument say this distribution is fixed and you don’t get to tamper with it.

Now comes the problem: I want to efficiently draw a name from this distribution D. This is the problem that Markov Chain Monte Carlo aims to solve. Why is it a problem? Because I have no idea what process you use to pick a name, so I can’t simulate that process myself. Here’s another method you could try: generate a name x uniformly at random, ask the machine for p_x, and then flip a biased coin with probability p_x and use x if the coin lands heads. The problem with this is that there are exponentially many names! The variable here is the number of bits needed to write down a name n = |x|. So either the probabilities p_x will be exponentially small and I’ll be flipping for a very long time to get a single name, or else there will only be a few names with nonzero probability and it will take me exponentially many draws to find them. Inefficiency is the death of me.

So this is a serious problem! Let’s restate it formally just to be clear.

Definition (The sampling problem):  Let D be a distribution over a finite set X. You are given black-box access to the probability distribution function p(x) which outputs the probability of drawing x \in X according to D. Design an efficient randomized algorithm A which outputs an element of X so that the probability of outputting x is approximately p(x). More generally, output a sample of elements from X drawn according to p(x).

Assume that A has access to only fair random coins, though this allows one to efficiently simulate flipping a biased coin of any desired probability.

Notice that with such an algorithm we’d be able to do things like estimate the expected value of some random variable f : X \to \mathbb{R}. We could take a large sample S \subset X via the solution to the sampling problem, and then compute the average value of f on that sample. This is what a Monte Carlo method does when sampling is easy. In fact, the Markov Chain solution to the sampling problem will allow us to do the sampling and the estimation of \mathbb{E}(f) in one fell swoop if you want.

But the core problem is really a sampling problem, and “Markov Chain Monte Carlo” would be more accurately called the “Markov Chain Sampling Method.” So let’s see why a Markov Chain could possibly help us.

Random Walks, the “Markov Chain” part of MCMC

Markov Chain is essentially a fancy term for a random walk on a graph.

You give me a directed graph G = (V,E), and for each edge e = (u,v) \in E you give me a number p_{u,v} \in [0,1]. In order to make a random walk make sense, the p_{u,v} need to satisfy the following constraint:

For any vertex x \in V, the set all values p_{x,y} on outgoing edges (x,y) must sum to 1, i.e. form a probability distribution.

If this is satisfied then we can take a random walk on G according to the probabilities as follows: start at some vertex x_0. Then pick an outgoing edge at random according to the probabilities on the outgoing edges, and follow it to x_1. Repeat if possible.

I say “if possible” because an arbitrary graph will not necessarily have any outgoing edges from a given vertex. We’ll need to impose some additional conditions on the graph in order to apply random walks to Markov Chain Monte Carlo, but in any case the idea of randomly walking is well-defined, and we call the whole object (V,E, \{ p_e \}_{e \in E})Markov chain.

Here is an example where the vertices in the graph correspond to emotional states.

An example Markov chain [image source http://www.mathcs.emory.edu/~cheung/]

An example Markov chain; image source http://www.mathcs.emory.edu/~cheung/

In statistics land, they take the “state” interpretation of a random walk very seriously. They call the edge probabilities “state-to-state transitions.”

The main theorem we need to do anything useful with Markov chains is the stationary distribution theorem (sometimes called the “Fundamental Theorem of Markov Chains,” and for good reason). What it says intuitively is that for a very long random walk, the probability that you end at some vertex v is independent of where you started! All of these probabilities taken together is called the stationary distribution of the random walk, and it is uniquely determined by the Markov chain.

However, for the reasons we stated above (“if possible”), the stationary distribution theorem is not true of every Markov chain. The main property we need is that the graph G is strongly connected. Recall that a directed graph is called connected if, when you ignore direction, there is a path from every vertex to every other vertex. It is called strongly connected if you still get paths everywhere when considering direction. If we additionally require the stupid edge-case-catcher that no edge can have zero probability, then strong connectivity (of one component of a graph) is equivalent to the following property:

For every vertex v \in V(G), an infinite random walk started at v will return to v with probability 1.

In fact it will return infinitely often. This property is called the persistence of the state v by statisticians. I dislike this term because it appears to describe a property of a vertex, when to me it describes a property of the connected component containing that vertex. In any case, since in Markov Chain Monte Carlo we’ll be picking the graph to walk on (spoiler!) we will ensure the graph is strongly connected by design.

Finally, in order to describe the stationary distribution in a more familiar manner (using linear algebra), we will write the transition probabilities as a matrix A where entry a_{j,i} = p_{(i,j)} if there is an edge (i,j) \in E and zero otherwise. Here the rows and columns correspond to vertices of G, and each column i forms the probability distribution of going from state i to some other state in one step of the random walk. Note A is the transpose of the weighted adjacency matrix of the directed weighted graph G where the weights are the transition probabilities (the reason I do it this way is because matrix-vector multiplication will have the matrix on the left instead of the right; see below).

This matrix allows me to describe things nicely using the language of linear algebra. In particular if you give me a basis vector e_i interpreted as “the random walk currently at vertex i,” then Ae_i gives a vector whose j-th coordinate is the probability that the random walk would be at vertex j after one more step in the random walk. Likewise, if you give me a probability distribution q over the vertices, then Aq gives a probability vector interpreted as follows:

If a random walk is in state i with probability q_i, then the j-th entry of Aq is the probability that after one more step in the random walk you get to vertex j.

Interpreted this way, the stationary distribution is a probability distribution \pi such that A \pi = \pi, in other words \pi is an eigenvector of A with eigenvalue 1.

A quick side note for avid readers of this blog: this analysis of a random walk is exactly what we did back in the early days of this blog when we studied the PageRank algorithm for ranking webpages. There we called the matrix A “a web matrix,” noted it was column stochastic (as it is here), and appealed to a special case of the Perron-Frobenius theorem to show that there is a unique maximal eigenvalue equal to one (with a dimension one eigenspace) whose eigenvector we used as a sort of “stationary distribution” and the final ranking of web pages. There we described an algorithm to actually find that eigenvector by iterated multiplication by A. The following theorem is essentially a variant of this algorithm but works under weaker conditions; for the web matrix we added additional “fake” edges that give the needed stronger conditions.

Theorem: Let G be a strongly connected graph with associated edge probabilities \{ p_e \}_e \in E forming a Markov chain. For a probability vector x_0, define x_{t+1} = Ax_t for all t \geq 1, and let v_t be the long-term average v_t = \frac1t \sum_{s=1}^t x_s. Then:

  1. There is a unique probability vector \pi with A \pi = \pi.
  2. For all x_0, the limit \lim_{t \to \infty} v_t = \pi.

Proof. Since v_t is a probability vector we just want to show that |Av_t - v_t| \to 0 as t \to \infty. Indeed, we can expand this quantity as

\displaystyle \begin{aligned} Av_t - v_t &=\frac1t (Ax_0 + Ax_1 + \dots + Ax_{t-1}) - \frac1t (x_0 + \dots + x_{t-1}) \\ &= \frac1t (x_t - x_0) \end{aligned}

But x_t, x_0 are unit vectors, so their difference is at most 2, meaning |Av_t - v_t| \leq \frac2t \to 0. Now it’s clear that this does not depend on v_0. For uniqueness we will cop out and appeal to the Perron-Frobenius theorem that says any matrix of this form has a unique such (normalized) eigenvector.

\square

One additional remark is that, in addition to computing the stationary distribution by actually computing this average or using an eigensolver, one can analytically solve for it as the inverse of a particular matrix. Define B = A-I_n, where I_n is the n \times n identity matrix. Let C be B with a row of ones appended to the bottom and the topmost row removed. Then one can show (quite opaquely) that the last column of C^{-1} is \pi. We leave this as an exercise to the reader, because I’m pretty sure nobody uses this method in practice.

One final remark is about why we need to take an average over all our x_t in the theorem above. There is an extra technical condition one can add to strong connectivity, called aperiodicity, which allows one to beef up the theorem so that x_t itself converges to the stationary distribution. Rigorously, aperiodicity is the property that, regardless of where you start your random walk, after some sufficiently large number of steps n the random walk has a positive probability of being at every vertex at every subsequent step. As an example of a graph where aperiodicity fails: an undirected cycle on an even number of vertices. In that case there will only be a positive probability of being at certain vertices every other step, and averaging those two long term sequences gives the actual stationary distribution.

Screen Shot 2015-04-07 at 6.55.39 PM

Image source: Wikipedia

One way to guarantee that your Markov chain is aperiodic is to ensure there is a positive probability of staying at any vertex. I.e., that your graph has a self-loop. This is what we’ll do in the next section.

Constructing a graph to walk on

Recall that the problem we’re trying to solve is to draw from a distribution over a finite set X with probability function p(x). The MCMC method is to construct a Markov chain whose stationary distribution is exactly p, even when you just have black-box access to evaluating p. That is, you (implicitly) pick a graph G and (implicitly) choose transition probabilities for the edges to make the stationary distribution p. Then you take a long enough random walk on G and output the x corresponding to whatever state you land on.

The easy part is coming up with a graph that has the right stationary distribution (in fact, “most” graphs will work). The hard part is to come up with a graph where you can prove that the convergence of a random walk to the stationary distribution is fast in comparison to the size of X. Such a proof is beyond the scope of this post, but the “right” choice of a graph is not hard to understand.

The one we’ll pick for this post is called the Metropolis-Hastings algorithm. The input is your black-box access to p(x), and the output is a set of rules that implicitly define a random walk on a graph whose vertex set is X.

It works as follows: you pick some way to put X on a lattice, so that each state corresponds to some vector in \{ 0,1, \dots, n\}^d. Then you add (two-way directed) edges to all neighboring lattice points. For n=5, d=2 it would look like this:

And for d=3, n \in \{2,3\} it would look like this:

You have to be careful here to ensure the vertices you choose for X are not disconnected, but in many applications X is naturally already a lattice.

Now we have to describe the transition probabilities. Let r be the maximum degree of a vertex in this lattice (r=2d). Suppose we’re at vertex i and we want to know where to go next. We do the following:

  1. Pick neighbor j with probability 1/r (there is some chance to stay at i).
  2. If you picked neighbor j and p(j) \geq p(i) then deterministically go to j.
  3. Otherwise, p(j) < p(i), and you go to j with probability p(j) / p(i).

We can state the probability weight p_{i,j} on edge (i,j) more compactly as

\displaystyle p_{i,j} = \frac1r \min(1, p(j) / p(i)) \\ p_{i,i} = 1 - \sum_{(i,j) \in E(G); j \neq i} p_{i,j}

It is easy to check that this is indeed a probability distribution for each vertex i. So we just have to show that p(x) is the stationary distribution for this random walk.

Here’s a fact to do that: if a probability distribution v with entries v(x) for each x \in X has the property that v(x)p_{x,y} = v(y)p_{y,x} for all x,y \in X, the v is the stationary distribution. To prove it, fix x and take the sum of both sides of that equation over all y. The result is exactly the equation v(x) = \sum_{y} v(y)p_{y,x}, which is the same as v = Av. Since the stationary distribution is the unique vector satisfying this equation, v has to be it.

Doing this with out chosen p(i) is easy, since p(i)p_{i,j} and p(i)p_{j,i} are both equal to \frac1r \min(p(i), p(j)) by applying a tiny bit of algebra to the definition. So we’re done! One can just randomly walk according to these probabilities and get a sample.

Last words

The last thing I want to say about MCMC is to show that you can estimate the expected value of a function \mathbb{E}(f) simultaneously while random-walking through your Metropolis-Hastings graph (or any graph whose stationary distribution is p(x)). By definition the expected value of f is \sum_x f(x) p(x).

Now what we can do is compute the average value of f(x) just among those states we’ve visited during our random walk. With a little bit of extra work you can show that this quantity will converge to the true expected value of f at about the same time that the random walk converges to the stationary distribution. (Here the “about” means we’re off by a constant factor depending on f). In order to prove this you need some extra tools I’m too lazy to write about in this post, but the point is that it works.

The reason I did not start by describing MCMC in terms of estimating the expected value of a function is because the core problem is a sampling problem. Moreover, there are many applications of MCMC that need nothing more than a sample. For example, MCMC can be used to estimate the volume of an arbitrary (maybe high dimensional) convex set. See these lecture notes of Alistair Sinclair for more.

If demand is popular enough, I could implement the Metropolis-Hastings algorithm in code (it wouldn’t be industry-strength, but perhaps illuminating? I’m not so sure…).

Until next time!

Hamming’s Code

Or how to detect and correct errors

Last time we made a quick tour through the main theorems of Claude Shannon, which essentially solved the following two problems about communicating over a digital channel.

  1. What is the best encoding for information when you are guaranteed that your communication channel is error free?
  2. Are there any encoding schemes that can recover from random noise introduced during transmission?

The answers to these questions were purely mathematical theorems, of course. But the interesting shortcoming of Shannon’s accomplishment was that his solution for the noisy coding problem (2) was nonconstructive. The question remains: can we actually come up with efficiently computable encoding schemes? The answer is yes! Marcel Golay was the first to discover such a code in 1949 (just a year after Shannon’s landmark paper), and Golay’s construction was published on a single page! We’re not going to define Golay’s code in this post, but we will mention its interesting status in coding theory later. The next year Richard Hamming discovered another simpler and larger family of codes, and went on to do some of the major founding work in coding theory. For his efforts he won a Turing Award and played a major part in bringing about the modern digital age. So we’ll start with Hamming’s codes.

We will assume some basic linear algebra knowledge, as detailed our first linear algebra primer. We will also use some basic facts about polynomials and finite fields, though the lazy reader can just imagine everything as binary \{ 0,1 \} and still grok the important stuff.

hamming-3

Richard Hamming, inventor of Hamming codes. [image source]

What is a code?

The formal definition of a code is simple: a code C is just a subset of \{ 0,1 \}^n for some n. Elements of C are called codewords.

This is deceptively simple, but here’s the intuition. Say we know we want to send messages of length k, so that our messages are in \{ 0,1 \}^k. Then we’re really viewing a code C as the image of some encoding function \textup{Enc}: \{ 0,1 \}^k \to \{ 0,1 \}^n. We can define C by just describing what the set is, or we can define it by describing the encoding function. Either way, we will make sure that \textup{Enc} is an injective function, so that no two messages get sent to the same codeword. Then |C| = 2^k, and we can call k = \log |C| the message length of C even if we don’t have an explicit encoding function.

Moreover, while in this post we’ll always work with \{ 0,1 \}, the alphabet of your encoded messages could be an arbitrary set \Sigma. So then a code C would be a subset of tuples in \Sigma^n, and we would call q = |\Sigma|.

So we have these parameters n, k, q, and we need one more. This is the minimum distance of a code, which we’ll denote by d. This is defined to be the minimum Hamming distance between all distinct pairs of codewords, where by Hamming distance I just mean the number of coordinates that two tuples differ in. Recalling the remarks we made last time about Shannon’s nonconstructive proof, when we decode an encoded message y (possibly with noisy bits) we look for the (unencoded) message x whose encoding \textup{Enc}(x) is as close to y as possible. This will only work in the worst case if all pairs of codewords are sufficiently far apart. Hence we track the minimum distance of a code.

So coding theorists turn this mess of parameters into notation.

Definition: A code C is called an (n, k, d)_q-code if

  • C \subset \Sigma^n for some alphabet \Sigma,
  • k = \log |C|,
  • C has minimum distance d, and
  • the alphabet \Sigma has size q.

The basic goals of coding theory are:

  1. For which values of these four parameters do codes exist?
  2. Fixing any three parameters, how can we optimize the other one?

In this post we’ll see how simple linear-algebraic constructions can give optima for one of these problems, optimizing k for d=3, and we’ll state a characterization theorem for optimizing k for a general d. Next time we’ll continue with a second construction that optimizes a different bound called the Singleton bound.

Linear codes and the Hamming code

A code is called linear if it can be identified with a linear subspace of some finite-dimensional vector space. In this post all of our vector spaces will be \{ 0,1 \}^n, that is tuples of bits under addition mod 2. But you can do the same constructions with any finite scalar field \mathbb{F}_q for a prime power q, i.e. have your vector space be \mathbb{F}_q^n. We’ll go back and forth between describing a binary code q=2 over \{ 0,1 \} and a code in $\mathbb{F}_q^n$. So to say a code is linear means:

  • The zero vector is a codeword.
  • The sum of any two codewords is a codeword.
  • Any scalar multiple of a codeword is a codeword.

Linear codes are the simplest kinds of codes, but already they give a rich variety of things to study. The benefit of linear codes is that you can describe them in a lot of different and useful ways besides just describing the encoding function. We’ll use two that we define here. The idea is simple: you can describe everything about a linear subspace by giving a basis for the space.

Definition: generator matrix of a (n,k,d)_q-code C is a k \times n matrix G whose rows form a basis for C.

There are a lot of equivalent generator matrices for a linear code (we’ll come back to this later), but the main benefit is that having a generator matrix allows one to encode messages x \in \{0,1 \}^k by left multiplication xG. Intuitively, we can think of the bits of x as describing the coefficients of the chosen linear combination of the rows of G, which uniquely describes an element of the subspace. Note that because a k-dimensional subspace of \{ 0,1 \}^n has 2^k elements, we’re not abusing notation by calling k = \log |C| both the message length and the dimension.

For the second description of C, we’ll remind the reader that every linear subspace C has a unique orthogonal complement C^\perp, which is the subspace of vectors that are orthogonal to vectors in C.

Definition: Let H^T be a generator matrix for C^\perp. Then H is called a parity check matrix.

Note H has the basis for C^\perp as columns. This means it has dimensions n \times (n-k). Moreover, it has the property that x \in C if and only if the left multiplication xH = 0. Having zero dot product with all columns of H characterizes membership in C.

The benefit of having a parity check matrix is that you can do efficient error detection: just compute yH on your received message y, and if it’s nonzero there was an error! What if there were so many errors, and just the right errors that y coincided with a different codeword than it started? Then you’re screwed. In other words, the parity check matrix is only guarantee to detect errors if you have fewer errors than the minimum distance of your code.

So that raises an obvious question: if you give me the generator matrix of a linear code can I compute its minimum distance? It turns out that this problem is NP-hard in general. In fact, you can show that this is equivalent to finding the smallest linearly dependent set of rows of the parity check matrix, and it is easier to see why such a problem might be hard. But if you construct your codes cleverly enough you can compute their distance properties with ease.

Before we do that, one more definition and a simple proposition about linear codes. The Hamming weight of a vector x, denoted wt(x), is the number of nonzero entries in x.

Proposition: The minimum distance of a linear code C is the minimum Hamming weight over all nonzero vectors x \in C.

Proof. Consider a nonzero x \in C. On one hand, the zero vector is a codeword and wt(x) is by definition the Hamming distance between x and zero, so it is an upper bound on the minimum distance. In fact, it’s also a lower bound: if x,y are two nonzero codewords, then x-y is also a codeword and wt(x-y) is the Hamming distance between x and y.

\square

So now we can define our first code, the Hamming code. It will be a (n, k, 3)_2-code. The construction is quite simple. We have fixed d=3, q=2, and we will also fix l = n-k. One can think of this as fixing n and maximizing k, but it will only work for n of a special form.

We’ll construct the Hamming code by describing a parity-check matrix H. In fact, we’re going to see what conditions the minimum distance d=3 imposes on H, and find out those conditions are actually sufficient to get d=3. We’ll start with 2. If we want to ensure d \geq 2, then you need it to be the case that no nonzero vector of Hamming weight 1 is a code word. Indeed, if e_i is a vector with all zeros except a one in position i, then e_i H = h_i is the i-th row of H. We need e_i H \neq 0, so this imposes the condition that no row of H can be zero. It’s easy to see that this is sufficient for d \geq 2.

Likewise for d \geq 3, given a vector y = e_i + e_j for some positions i \neq j, then yH = h_i + h_j may not be zero. But because our sums are mod 2, saying that h_i + h_j \neq 0 is the same as saying h_i \neq h_j. Again it’s an if and only if. So we have the two conditions.

  • No row of H may be zero.
  • All rows of H must be distinct.

That is, any parity check matrix with those two properties defines a distance 3 linear code. The only question that remains is how large can n  be if the vectors have length n-k = l? That’s just the number of distinct nonzero binary strings of length l, which is 2^l - 1. Picking any way to arrange these strings as the rows of a matrix (say, in lexicographic order) gives you a good parity check matrix.

Theorem: For every l > 0, there is a (2^l - 1, 2^l - l - 1, 3)_2-code called the Hamming code.

Since the Hamming code has distance 3, we can always detect if at most a single error occurs. Moreover, we can correct a single error using the Hamming code. If x \in C and wt(e) = 1 is an error bit in position i, then the incoming message would be y = x + e. Now compute yH = xH + eH = 0 + eH = h_i and flip bit i of y. That is, whichever row of H you get tells you the index of the error, so you can flip the corresponding bit and correct it. If you order the rows lexicographically like we said, then h_i = i as a binary number. Very slick.

Before we move on, we should note one interesting feature of linear codes.

Definition: A code is called systematic if it can be realized by an encoding function that appends some number n-k “check bits” to the end of each message.

The interesting feature is that all linear codes are systematic. The reason is as follows. The generator matrix G of a linear code has as rows a basis for the code as a linear subspace. We can perform Gaussian elimination on G and get a new generator matrix that looks like [I \mid A] where I is the identity matrix of the appropriate size and A is some junk. The point is that encoding using this generator matrix leaves the message unchanged, and adds a bunch of bits to the end that are determined by A. It’s a different encoding function on \{ 0,1\}^k, but it has the same image in \{ 0,1 \}^n, i.e. the code is unchanged. Gaussian elimination just performed a change of basis.

If you work out the parameters of the Hamming code, you’ll see that it is a systematic code which adds \Theta(\log n) check bits to a message, and we’re able to correct a single error in this code. An obvious question is whether this is necessary? Could we get away with adding fewer check bits? The answer is no, and a simple “information theoretic” argument shows this. A single index out of n requires \log n bits to describe, and being able to correct a single error is like identifying a unique index. Without logarithmically many bits, you just don’t have enough information.

The Hamming bound and perfect codes

One nice fact about Hamming codes is that they optimize a natural problem: the problem of maximizing d given a fixed choice of n, k, and q. To get this let’s define V_n(r) denote the volume of a ball of radius r in the space \mathbb{F}_2^n. I.e., if you fix any string (doesn’t matter which) x, V_n(r) is the size of the set \{ y : d(x,y) \leq r \}, where d(x,y) is the hamming distance.

There is a theorem called the Hamming bound, which describes a limit to how much you can pack disjoint balls of radius r inside \mathbb{F}_2^n.

Theorem: If an (n,k,d)_2-code exists, then

\displaystyle 2^k V_n \left ( \left \lfloor \frac{d-1}{2} \right \rfloor \right ) \leq 2^n

Proof. The proof is quite simple. To say a code C has distance d means that for every string x \in C there is no other string y within Hamming distance d of x. In other words, the balls centered around both x,y of radius r = \lfloor (d-1)/2 \rfloor are disjoint. The extra difference of one is for odd d, e.g. when d=3 you need balls of radius 1 to guarantee no overlap. Now |C| = 2^k, so the total number of strings covered by all these balls is the left-hand side of the expression. But there are at most 2^n strings in \mathbb{F}_2^n, establishing the desired inequality.

\square

Now a code is called perfect if it actually meets the Hamming bound exactly. As you probably guessed, the Hamming codes are perfect codes. It’s not hard to prove this, and I’m leaving it as an exercise to the reader.

The obvious follow-up question is whether there are any other perfect codes. The answer is yes, some of which are nonlinear. But some of them are “trivial.” For example, when d=1 you can just use the identity encoding to get the code C = \mathbb{F}_2^n. You can also just have a code which consists of a single codeword. There are also some codes that encode by repeating the message multiple times. These are called “repetition codes,” and all three of these examples are called trivial (as a definition). Now there are some nontrivial and nonlinear perfect codes I won’t describe here, but here is the nice characterization theorem.

Theorem [van Lint ’71, Tietavainen ‘73]: Let C be a nontrivial perfect (n,d,k)_q code. Then the parameters must either be that of a Hamming code, or one of the two:

  • A (23, 12, 7)_2-code
  • A (11, 6, 5)_3-code

The last two examples are known as the binary and ternary Golay codes, respectively, which are also linear. In other words, every possible set of parameters for a perfect code can be realized as one of these three linear codes.

So this theorem was a big deal in coding theory. The Hamming and Golay codes were both discovered within a year of each other, in 1949 and 1950, but the nonexistence of other perfect linear codes was open for twenty more years. This wrapped up a very neat package.

Next time we’ll discuss the Singleton bound, which optimizes for a different quantity and is incomparable with perfect codes. We’ll define the Reed-Solomon and show they optimize this bound as well. These codes are particularly famous for being the error correcting codes used in DVDs. We’ll then discuss the algorithmic issues surrounding decoding, and more recent connections to complexity theory.

Until then!

Multiple Qubits and the Quantum Circuit

Last time we left off with the tantalizing question: how do you do a quantum “AND” operation on two qubits? In this post we’ll see why the tensor product is the natural mathematical way to represent the joint state of multiple qubits. Then we’ll define some basic quantum gates, and present the definition of a quantum circuit.

Working with Multiple Qubits

In a classical system, if you have two bits with values b_1, b_2, then the “joint state” of the two bits is given by the concatenated string b_1b_2. But if we have two qubits v, w, which are vectors in \mathbb{C}^2, how do we represent their joint state?

There are seemingly infinitely many things we could try, but let’s entertain the simplest idea for the sake of exercising our linear algebra intuition. The simplest idea is to just “concatenate” the vectors as one does in linear algebra: represent the joint system as (v, w) \in \mathbb{C}^2 \oplus \mathbb{C}^2. Recall that the direct sum of two vector spaces is just what you’d want out of “concatenation” of vectors. It treats the two components as completely independent of each other, and there’s an easy way to take any vector in the sum and decompose it into two vectors in the pieces.

Why does this fail to meet our requirements of qubits? Here’s one reason: (v, w) is not a unit vector when v and w are separately unit vectors. Indeed, \left \| (v,w) \right \|^2 = \left \| v \right \|^2 + \left \| w \right \|^2 = 2. We could normalize everything, and that would work for a while, but we would still run into problems. A better reason is that direct sums screw up measurement. In particular, if you have two qubits (and they’re independent, in a sense we’ll make clear later), you should be able to measure one without affecting the other. But if we use the direct sum method for combining qubits, then measuring one qubit would collapse the other! There are times when we want this to happen, but we don’t always want it to happen. Alas, there should be better reasons out there (besides, “physics says so”) but I haven’t come across them yet.

So the nice mathematical alternative is to make the joint state of two qubits v,w the tensor product v \otimes w. For a review of the basic properties of tensors and multilinear maps, see our post on the subject. Suffice it for now to remind the reader that the basis of the tensor space U \otimes V consists of all the tensors of the basis elements of the pieces U and V: u_i \otimes v_j. As such, the dimension of U \otimes V is the product of the dimensions \text{dim}(U) \text{dim}(V).

As a consequence of this and the fact that all \mathbb{C}-vector spaces of the same dimension are the same (isomorphic), the state space of a set of n qubits can be identified with \mathbb{C}^{2^n}. This is one way to see why quantum computing has the potential to be strictly more powerful than classical computing: n qubits provide a state space with 2^n coefficients, each of which is a complex number. With classical probabilistic computing we only get n “coefficients.” This isn’t a proof that quantum computing is more powerful, but a wink and a nudge that it could be.

While most of the time we’ll just write our states in terms of tensors (using the \otimes symbol), we could write out the vector representation of v \otimes w in terms of the vectors v = (v_1, v_2), w=(w_1, w_2). It’s just (v_1w_1, v_1w_2, v_2w_1, v_2w_2), with the obvious generalization to vectors of any dimension. This already fixes our earlier problem with norms: the norm of a tensor of two vectors is the product of the two norms. So tensors of unit vectors are unit vectors. Moreover, if you measure the first qubit, that just sets the v_1, v_2 above to zero or one, leaving a joint state that is still a valid

Likewise, given two linear maps A, B, we can describe the map A \otimes B on the tensor space both in terms of pure tensors ((A \otimes B)(v \otimes w) = Av \otimes Bw) and in terms of a matrix. In the same vein as the representation for vectors, the matrix corresponding to A \otimes B is

\displaystyle \begin{pmatrix}  a_{1,1}B & a_{1,2}B & \dots & a_{1,n}B \\  a_{2,1}B & a_{2,2}B & \dots & a_{2,n}B \\  \vdots & \vdots & \ddots & \vdots \\  a_{n,1}B & a_{n,2}B & \dots & a_{n,n}B  \end{pmatrix}

This is called the Kronecker product.

One of the strange things about tensor products, which very visibly manifests itself in “strange quantum behavior,” is that not every vector in a tensor space can be represented as a single tensor product of some vectors. Let’s work with an example: \mathbb{C}^2 \otimes \mathbb{C}^2, and denote by e_0, e_1 the computational basis vectors (the same letters are used for each copy of \mathbb{C}^2). Sometimes you’ll get a vector like

\displaystyle v = \frac{1}{\sqrt{2}} e_0 \otimes e_0 + \frac{1}{\sqrt{2}} e_1 \otimes e_0

And if you’re lucky you’ll notice that this can be factored and written as \frac{1}{\sqrt{2}}(e_0 + e_1) \otimes e_0. Other times, though, you’ll get a vector like

\displaystyle \frac{1}{\sqrt{2}}(e_0 \otimes e_0 + e_1 \otimes e_1)

And it’s a deep fact that this cannot be factored into a tensor product of two vectors (prove it as an exercise). If a vector v in a tensor space can be written as a single tensor product of vectors, we call v a pure tensor. Otherwise, using some physics lingo, we call the state represented by v entangled. So if you did the exercise you proved that not all tensors are pure tensors, or equivalently that there exist entangled quantum states. The latter sounds so much more impressive. We’ll see in a future post why these entangled states are so important in quantum computing.

Now we need to explain how to extend gates and qubit measurements to state spaces with multiple qubits. The first is easy: just as we often restrict our classical gates to a few bits (like the AND of two bits), we restrict multi-qubit quantum gates to operate on at most three qubits.

Definition: A quantum gate G is a unitary map \mathbb{C}^{2^n} \to \mathbb{C}^{2^n} where n is at most 3, (recall, (\mathbb{C}^2)^{\otimes 3} = \mathbb{C}^{2^3} is the state space for 3 qubits).

Now let’s see how to implement AND and OR for two qubits. You might be wondering why we need three qubits in the definition above, and, perhaps surprisingly, we’ll see that AND and OR require us to work with three qubits.

Because how would one compute an AND of two qubits? Taking a naive approach from how we did the quantum NOT, we would label e_0 as “false” and e_1 as “true,” and we’d want to map e_1 \otimes e_1 \mapsto e_1 and all other possibilities to e_0. The main problem is that this is not an invertible function! Remember, all quantum operations are unitary matrices and all unitary matrices have inverses, so we have to model AND and OR as an invertible operation. We also have a “type error,” since the output is not even in the same vector space as the input, but any way to fix that would still run into the invertibility problem.

The way to deal with this is to add an extra “scratch work” qubit that is used for nothing else except to make the operation invertible. So now say we have three qubits a, b, c, and we want to compute a AND b in the sensible way described above. What we do is map

\displaystyle a \otimes b \otimes c \mapsto a \otimes b \otimes (c \oplus (a \wedge b))

Here a \wedge b is the usual AND (where we interpret, e.g., e_1 \wedge e_0 = e_0), and \oplus is the exclusive or operation on bits. It’s clear that this mapping makes sense for “bits” (the true/false interpretation of basis vectors) and so we can extend it to a linear map by writing down the matrix.

quantum-AND

This gate is often called the Toffoli gate by physicists, but we’ll just call it the (quantum) AND gate. Note that the column ijk represents the input e_i \otimes e_j \otimes e_k, and the 1 in that column denotes the row whose label is the output. In particular, if we want to do an AND then we’ll ensure the “scratch work” qubit is e_0, so we can ignore half the columns above where the third qubit is 1. The reader should write down the analogous construction for a quantum OR.

From now on, when we’re describing a basis state like e_1 \otimes e_0 \otimes e_1, we’ll denote it as e_{101}, and more generally when i is a nonnegative integer or a binary string we’ll denote the basis state as e_i. We’re taking advantage of the correspondence between the 2^n binary strings and the 2^n basis states, and it compactifies notation.

Once we define a quantum circuit, it will be easy to show that using quantum AND’s, OR’s and NOT’s, we can achieve any computation that a classical circuit can.

We have one more issue we’d like to bring up before we define quantum circuits. We’re being a bit too slick when we say we’re working with “at most three qubits.” If we have ten qubits, potentially all entangled up in a weird way, how can we apply a mapping to only some of those qubits? Indeed, we only defined AND for \mathbb{C}^8, so how can we extend that to an AND of three qubits sitting inside any \mathbb{C}^{2^n} we please? The answer is to apply the Kronecker product with the identity matrix appropriately. Let’s do a simple example of this to make everything stick.

Say I want to apply the quantum NOT gate to a qubit v, and I have four other qubits w_1, w_2, w_3, w_4 so that they’re all in the joint state x = v \otimes w_1 \otimes w_2 \otimes w_3 \otimes w_4. I form the NOT gate, which I’ll call A, and then I apply the gate A \otimes I_{2^4} to x (since there are 4 of the w_i). This will compute the tensor Av \otimes I_2 w_1 \otimes I_2 w_2 \otimes I_2 w_3 \otimes I_2 w_4, as desired.

In particular, you can represent a gate that depends on only 3 qubits by writing down the 3×3 matrix and the three indices it operates on. Note that this requires only 12 (possibly complex) numbers to write down, and so it takes “constant space” to represent a single gate.

Quantum Circuits

Here we are at the definition of a quantum circuit.

Definition: quantum circuit is a list G_1, \dots, G_T of 2^m \times 2^m unitary matrices, such that each G_i depends on at most 3 qubits.

We’ll write down what it means to “compute” something with a quantum circuit, but for now we can imagine drawing it like a usual circuit. We write the input state as some unit vector x \in C^{2^n} (which may or may not be a pure tensor), each qubit making up the vector is associated to a “wire,” and at each step we pick three of the wires, send them to the next quantum gate G_i, and use the three output wires for further computations. The final output is the matrix product applied to the input G_T \dots G_1x. We imagine that each gate takes only one step to compute (recall, in our first post one “step” was a photon flying through a special material, so it’s not like we have to multiply these matrices by hand).

So now we have to say how a quantum circuit could solve a problem. At all levels of mathematical maturity we should have some idea how a regular circuit solves a problem: there is some distinguished output wire or set of wires containing the answer. For a quantum circuit it’s basically the same, except that at the end of the circuit we get a single quantum state (a tensor in this big vector space), and we just measure that state. Like the case of a single qubit, if the vector has coordinates x = (x_1, \dots, x_{2^n}), they must satisfy \sum_i |x_i|^2 = 1, and the probability of the measurement producing index j is |x_j|^2. The result of that measurement is an integer (some classical bits) that represent our answer. As a side effect, the vector x is mutated into the basis state e_j. As we’ve said we may need to repeat a quantum computation over and over to get a good answer with high probability, so we can imagine that a quantum circuit is used as some subroutine in a larger (otherwise classical) algorithm that allows for pre- and post-processing on the quantum part.

The final caveat is that we allow one to include as many scratchwork qubits as one needs in their circuit. This makes it possible already to simulate any classical circuit using a quantum circuit. Let’s prove it as a theorem.

Theorem: Given a classical circuit C with a single output bit, there is a quantum circuit D that computes the same function.

Proof. Let x be a binary string input to C, and suppose that C has s gates g_1, \dots, g_s, each being either AND, OR, or NOT, and with g_s being the output gate. To construct D, we can replace every g_i with their quantum counterparts G_i. Recall that this takes e_{b_1b_20} \mapsto e_{b_1b_2(g_i(b_1, b_2))}. And so we need to add a single scratchwork qubit for each one (really we only need it for the ANDs and ORs, but who cares). This means that our start state is e_{x} \otimes e_{0^s} = e_{x0^s}. Really, we need one of these gates G_i for each wire going out of the classical gate g_i, but with some extra tricks one can do it with a single quantum gate that uses multiple scratchwork qubits. The crucial thing to note is that the state vector is always a basis vector!

If we call z the contents of all the scratchwork after the quantum circuit described above runs and z_0 the initial state of the scratchwork, then what we did was extend the function x \mapsto C(x) to a function e_{xz_0} \mapsto e_{xz}. In particular, one of the bits in the z part is the output of the last gate of C, and everything is 0-1 valued. So we can measure the state vector, get the string xz and inspect the bit of z which corresponds to the output wire of the final gate of the original circuit C. This is your answer.

\square

It should be clear that the single output bit extends to the general case easily. We can split a circuit with lots of output bits into a bunch of circuits with single output bits in the obvious way and combine the quantum versions together.

Next time we’ll finally look at our first quantum algorithms. And along the way we’ll see some more significant quantum operations that make use of the properties that make the quantum world interesting. Until then!

The Quantum Bit

The best place to start our journey through quantum computing is to recall how classical computing works and try to extend it. Since our final quantum computing model will be a circuit model, we should informally discuss circuits first.

A circuit has three parts: the “inputs,” which are bits (either zero or one); the “gates,” which represent the lowest-level computations we perform on bits; and the “wires,” which connect the outputs of gates to the inputs of other gates. Typically the gates have one or two input bits and one output bit, and they correspond to some logical operation like AND, NOT, or XOR.

A simple example of a circuit.

A simple example of a circuit. The V’s are “OR” and the Λ’s are “AND.” Image source: Ryan O’Donnell

If we want to come up with a different model of computing, we could start regular circuits and generalize some or all of these pieces. Indeed, in our motivational post we saw a glimpse of a probabilistic model of computation, where instead of the inputs being bits they were probabilities in a probability distribution, and instead of the gates being simple boolean functions they were linear maps that preserved probability distributions (we called such a matrix “stochastic”).

Rather than go through that whole train of thought again let’s just jump into the definitions for the quantum setting. In case you missed last time, our goal is to avoid as much physics as possible and frame everything purely in terms of linear algebra.

Qubits are Unit Vectors

The generalization of a bit is simple: it’s a unit vector in \mathbb{C}^2. That is, our most atomic unit of data is a vector (a,b) with the constraints that a,b are complex numbers and |a|^2 + |b|^2 = 1. We call such a vector a qubit.

A qubit can assume “binary” values much like a regular bit, because you could pick two distinguished unit vectors, like (1,0) and (0,1), and call one “zero” and the other “one.” Obviously there are many more possible unit vectors, such as \frac{1}{\sqrt{2}}(1, 1) and (-i,0). But before we go romping about with what qubits can do, we need to understand how we can extract information from a qubit. The definitions we make here will motivate a lot of the rest of what we do, and is in my opinion one of the major hurdles to becoming comfortable with quantum computing.

A bittersweet fact of life is that bits are comforting. They can be zero or one, you can create them and change them and read them whenever you want without an existential crisis. The same is not true of qubits. This is a large part of what makes quantum computing so weird: you can’t just read the information in a qubit! Before we say why, notice that the coefficients in a qubit are complex numbers, so being able to read them exactly would potentially encode an infinite amount of information (in the infinite binary expansion)! Not only would this be an undesirably powerful property of a circuit, but physicists’ experiments tell us it’s not possible either.

So as we’ll see when we get to some algorithms, the main difficulty in getting useful quantum algorithms is not necessarily figuring out how to compute what you want to compute, it’s figuring out how to tease useful information out of the qubits that otherwise directly contain what you want. And the reason it’s so hard is that when you read a qubit, most of the information in the qubit is destroyed. And what you get to see is only a small piece of the information available. Here is the simplest example of that phenomenon, which is called the measurement in the computational basis.

Definition: Let v = (a,b) \in \mathbb{C}^2 be a qubit. Call the standard basis vectors e_0 = (1,0), e_1 = (0,1) the computational basis of \mathbb{C}^2. The process of measuring v in the computational basis consists of two parts.

  1. You observe (get as output) a random choice of e_0 or e_1. The probability of getting e_0 is |a|^2, and the probability of getting e_1 is |b|^2.
  2. As a side effect, the qubit v instantaneously becomes whatever state was observed in 1. This is often called a collapse of the waveform by physicists.

There are more sophisticated ways to measure, and more sophisticated ways to express the process of measurement, but we’ll cover those when we need them. For now this is it.

Why is this so painful? Because if you wanted to try to estimate the probabilities |a|^2 or |b|^2, not only would you get an estimate at best, but you’d have to repeat whatever computation prepared v for measurement over and over again until you get an estimate you’re satisfied with. In fact, we’ll see situations like this, where we actually have a perfect representation of the data we need to solve our problem, but we just can’t get at it because the measurement process destroys it once we measure.

Before we can talk about those algorithms we need to see how we’re allowed to manipulate qubits. As we said before, we use unitary matrices to preserve unit vectors, so let’s recall those and make everything more precise.

Qubit Mappings are Unitary Matrices

Suppose v = (a,b) \in \mathbb{C}^2 is a qubit. If we are to have any mapping between vector spaces, it had better be a linear map, and the linear maps that send unit vectors to unit vectors are called unitary matrices. An equivalent definition that seems a bit stronger is:

Definition: A linear map \mathbb{C}^2 \to \mathbb{C}^2 is called unitary if it preserves the inner product on \mathbb{C}^2.

Let’s remember the inner product on \mathbb{C}^n is defined by \left \langle v,w \right \rangle = \sum_{i=1}^n v_i \overline{w_i} and has some useful properties.

  • The square norm of a vector is \left \| v \right \|^2 = \left \langle v,v \right \rangle.
  • Swapping the coordinates of the complex inner product conjugates the result: \left \langle v,w \right \rangle = \overline{\left \langle w,v \right \rangle}
  • The complex inner product is a linear map if you fix the second coordinate, and a conjugate-linear map if you fix the first. That is, \left \langle au+v, w \right \rangle = a \left \langle u, w \right \rangle + \left \langle v, w \right \rangle and \left \langle u, aw + v \right \rangle = \overline{a} \left \langle u, w \right \rangle + \left \langle u,v \right \rangle

By the first bullet, it makes sense to require unitary matrices to preserve the inner product instead of just the norm, though the two are equivalent (see the derivation on page 2 of these notes). We can obviously generalize unitary matrices to any complex vector space, and unitary matrices have some nice properties. In particular, if U is a unitary matrix then the important property is that the columns (and rows) of U form an orthonormal basis. As an immediate result, if we take the product U\overline{U}^\text{T}, which is just the matrix of all possible inner products of columns of U, we get the identity matrix. This means that unitary matrices are invertible and their inverse is \overline{U}^\text{T}.

Already we have one interesting philosophical tidbit. Any unitary transformation of a qubit is reversible because all unitary matrices are invertible. Apparently the only non-reversible thing we’ve seen so far is measurement.

Recall that \overline{U}^\text{T} is the conjugate transpose of the matrix, which I’ll often write as U^*. Note that there is a way to define U^* without appealing to matrices: it is a notion called the adjoint, which is that linear map U^* such that \left \langle Uv, w \right \rangle = \left \langle v, U^*w \right \rangle for all v,w. Also recall that “unitary matrix” for complex vector spaces means precisely the same thing as “orthogonal matrix” does for real numbers. The only difference is the inner product being used (indeed, if the complex matrix happens to have real entries, then orthogonal matrix and unitary matrix mean the same thing).

Definition: single qubit gate is a unitary matrix \mathbb{C}^2 \to \mathbb{C}^2.

So enough with the properties and definitions, let’s see some examples. For all of these examples we’ll fix the basis to the computational basis e_0, e_1. One very important, but still very simple example of a single qubit gate is the Hadamard gate. This is the unitary map given by the matrix

\displaystyle \frac{1}{\sqrt{2}}\begin{pmatrix}  1 & 1 \\  1 & -1  \end{pmatrix}

It’s so important because if you apply it to a basis vector, say, e_0 = (1,0), you get a uniform linear combination \frac{1}{\sqrt{2}}(e_1 + e_2). One simple use of this is to allow for unbiased coin flips, and as readers of this blog know unbiased coins can efficiently simulate biased coins. But it has many other uses we’ll touch on as they come.

Just to give another example, the quantum NOT gate, often called a Pauli X gate, is the following matrix

\displaystyle \begin{pmatrix}  0 & 1 \\  1 & 0  \end{pmatrix}

It’s called this because, if we consider e_0 to be the “zero” bit and e_1 to be “one,” then this mapping swaps the two. In general, it takes (a,b) to (b,a).

As the reader can probably imagine by the suggestive comparison with classical operations, quantum circuits can do everything that classical circuits can do. We’ll save the proof for a future post, but if we want to do some kind of “quantum AND” operation, we get an obvious question. How do you perform an operation that involves multiple qubits? The short answer is: you represent a collection of bits by their tensor product, and apply a unitary matrix to that tensor.

We’ll go into more detail on this next time, and in the mean time we suggest checking out this blog’s primer on the tensor product. Until then!