A Proofless Introduction to Information Theory

There are two basic problems in information theory that are very easy to explain. Two people, Alice and Bob, want to communicate over a digital channel over some long period of time, and they know the probability that certain messages will be sent ahead of time. For example, English language sentences are more likely than gibberish, and “Hi” is much more likely than “asphyxiation.” The problems are:

  1. Say communication is very expensive. Then the problem is to come up with an encoding scheme for the messages which minimizes the expected length of an encoded message and guarantees the ability to unambiguously decode a message. This is called the noiseless coding problem.
  2. Say communication is not expensive, but error prone. In particular, each bit i of your message is erroneously flipped with some known probably p, and all the errors are independent. Then the question is, how can one encode their messages to as to guarantee (with high probability) the ability to decode any sent message? This is called the noisy coding problem.

There are actually many models of “communication with noise” that generalize (2), such as models based on Markov chains. We are not going to cover them here.

Here is a simple example for the noiseless problem. Say you are just sending binary digits as your messages, and you know that the string “00000000” (eight zeros) occurs half the time, and all other eight-bit strings occur equally likely in the other half. It would make sense, then, to encode the “eight zeros” string as a 0, and prefix all other strings with a 1 to distinguish them from zero. You would save on average 7 \cdot 1/2 + (-1) \cdot 1/2 = 3 bits in every message.

One amazing thing about these two problems is that they were posed and solved in the same paper by Claude Shannon in 1948. One byproduct of his work was the notion of entropy, which in this context measures the “information content” of a message, or the expected “compressibility” of a single bit under the best encoding. For the extremely dedicated reader of this blog, note this differs from Kolmogorov complexity in that we’re not analyzing the compressibility of a string by itself, but rather when compared to a distribution. So really we should think of (the domain of) the distribution as being compressed, not the string.

Claude Shannon. Image credit: Wikipedia

Entropy and noiseless encoding

Before we can state Shannon’s theorems we have to define entropy.

Definition: Suppose D is a distribution on a finite set X, and I’ll use D(x) to denote the probability of drawing x from D. The entropy of D, denoted H(D) is defined as

H(D) = \sum_{x \in X} D(x) \log \frac{1}{D(x)}

It is strange to think about this sum in abstract, so let’s suppose D is a biased coin flip with bias 0 \leq p \leq 1 of landing heads. Then we can plot the entropy as follows

Image source: Wikipedia

Image source: Wikipedia

The horizontal axis is the bias p, and the vertical axis is the value of H(D), which with some algebra is - p \log p - (1-p) \log (1-p). From the graph above we can see that the entropy is maximized when p=1/2 and minimized at p=0, 1. You can verify all of this with calculus, and you can prove that the uniform distribution maximizes entropy in general as well.

So what is this saying? A high entropy measures how incompressible something is, and low entropy gives us lots of compressibility. Indeed, if our message consisted of the results of 10 such coin flips, and p was close to 1, we could be able to compress a lot by encoding strings with lots of 1’s using few bits. On the other hand, if p=1/2 we couldn’t get any compression at all. All strings would be equally likely.

Shannon’s famous theorem shows that the entropy of the distribution is actually all that matters. Some quick notation: \{ 0,1 \}^* is the set of all binary strings.

Theorem (Noiseless Coding Theorem) [Shannon 1948]: For every finite set X and distribution D over X, there are encoding and decoding functions \textup{Enc}: X \to \{0,1 \}^*, \textup{Dec}: \{ 0,1 \}^* \to X such that

  1. The encoding/decoding actually works, i.e. \textup{Dec}(\textup{Enc}(x)) = x for all x.
  2. The expected length of an encoded message is between H(D) and H(D) + 1.

Moreover, no encoding scheme can do better.

Item 2 and the last sentence are the magical parts. In other words, if you know your distribution over messages, you precisely know how long to expect your messages to be. And you know that you can’t hope to do any better!

As the title of this post says, we aren’t going to give a proof here. Wikipedia has a proof if you’re really interested in the details.

Noisy Coding

The noisy coding problem is more interesting because in a certain sense (that was not solved by Shannon) it is still being studied today in the field of coding theory. The interpretation of the noisy coding problem is that you want to be able to recover from white noise errors introduced during transmission. The concept is called error correction. To restate what we said earlier, we want to recover from error with probability asymptotically close to 1, where the probability is over the errors.

It should be intuitively clear that you can’t do so without your encoding “blowing up” the length of the messages. Indeed, if your encoding does not blow up the message length then a single error will confound you since many valid messages would differ by only a single bit. So the question is does such an encoding exist, and if so how much do we need to blow up the message length? Shannon’s second theorem answers both questions.

Theorem (Noisy Coding Theorem) [Shannon 1948]: For any constant noise rate p < 1/2, there is an encoding scheme \textup{Enc} : \{ 0,1 \}^k \to \{0,1\}^{ck}, \textup{Dec} : \{ 0,1 \}^{ck} \to \{ 0,1\}^k with the following property. If x is the message sent by Alice, and y is the message received by Bob (i.e. \textup{Enc}(x) with random noise), then \Pr[\textup{Dec}(y) = x] \to 1 as a function of n=ck. In addition, if we denote by H(p) the entropy of the distribution of an error on a single bit, then choosing any c > \frac{1}{1-H(p)} guarantees the existence of such an encoding scheme, and no scheme exists for any smaller c.

This theorem formalizes a “yes” answer to the noisy coding problem, but moreover it characterizes the blowup needed for such a scheme to exist. The deep fact is that it only depends on the noise rate.

A word about the proof: it’s probabilistic. That is, Shannon proved such an encoding scheme exists by picking \textup{Enc} to be a random function (!). Then \textup{Dec}(y) finds (nonconstructively) the string x such that the number of bits different between \textup{Enc}(x) and y is minimized. This “number of bits that differ” measure is called the Hamming distance. Then he showed using relatively standard probability tools that this scheme has the needed properties with high probability, the implication being that some scheme has to exist for such a probability to even be positive. The sharp threshold for c takes a bit more work. If you want the details, check out the first few lectures of Madhu Sudan’s MIT class.

The non-algorithmic nature of his solution is what opened the door to more research. The question has surpassed, “Are there any encodings that work?” to the more interesting, “What is the algorithmic cost of constructing such an encoding?” It became a question of complexity, not computability. Moreover, the guarantees people wanted were strengthened to worst case guarantees. In other words, if I can guarantee at most 12 errors, is there an encoding scheme that will allow me to always recover the original message, and not just with high probability? One can imagine that if your message contains nuclear codes or your bank balance, you’d definitely want to have 100% recovery ability.

Indeed, two years later Richard Hamming spawned the theory of error correcting codes and defined codes that can always correct a single error. This theory has expanded and grown over the last sixty years, and these days the algorithmic problems of coding theory have deep connections to most areas of computer science, including learning theory, cryptography, and quantum computing.

We’ll cover Hamming’s basic codes next time, and then move on to Reed-Solomon codes and others. Until then!

Zero-One Laws for Random Graphs

Last time we saw a number of properties of graphs, such as connectivity, where the probability that an Erdős–Rényi random graph G(n,p) satisfies the property is asymptotically either zero or one. And this zero or one depends on whether the parameter p is above or below a universal threshold (that depends only on n and the property in question).

To remind the reader, the Erdős–Rényi random “graph” G(n,p) is a distribution over graphs that you draw from by including each edge independently with probability p. Last time we saw that the existence of an isolated vertex has a sharp threshold at (\log n) / n, meaning if p is asymptotically smaller than the threshold there will certainly be isolated vertices, and if p is larger there will certainly be no isolated vertices. We also gave a laundry list of other properties with such thresholds.

One might want to study this phenomenon in general. Even if we might not be able to find all the thresholds we want for a given property, can we classify which properties have thresholds and which do not?

The answer turns out to be mostly yes! For large classes of properties, there are proofs that say things like, “either this property holds with probability tending to one, or it holds with probability tending to zero.” These are called “zero-one laws,” and they’re sort of meta theorems. We’ll see one such theorem in this post relating to constant edge-probabilities in random graphs, and we’ll remark on another at the end.

Sentences about graphs in first order logic

A zero-one law generally works by defining a class of properties, and then applying a generic first/second moment-type argument to every property in the class.

So first we define what kinds of properties we’ll discuss. We’ll pick a large class: anything that can be expressed in first-order logic in the language of graphs. That is, any finite logical statement that uses existential and universal quantifiers over variables, and whose only relation (test) is whether an edge exists between two vertices. We’ll call this test e(x,y). So you write some sentence P in this language, and you take a graph G, and you can ask P(G) = 1, whether the graph satisfies the sentence.

This seems like a really large class of properties, and it is, but let’s think carefully about what kinds of properties can be expressed this way. Clearly the existence of a triangle can be written this way, it’s just the sentence

\exists x,y,z : e(x,y) \wedge e(y,z) \wedge e(x,z)

I’m using \wedge for AND, and \vee for OR, and \neg for NOT. Similarly, one can express the existence of a clique of size k, or the existence of an independent set of size k, or a path of a fixed length, or whether there is a vertex of maximal degree n-1.

Here’s a question: can we write a formula which will be true for a graph if and only if it’s connected? Well such a formula seems like it would have to know about how many vertices there are in the graph, so it could say something like “for all x,y there is a path from x to y.” It seems like you’d need a family of such formulas that grows with n to make anything work. But this isn’t a proof; the question remains whether there is some other tricky way to encode connectivity.

But as it turns out, connectivity is not a formula you can express in propositional logic. We won’t prove it here, but we will note at the end of the article that connectivity is in a different class of properties that you can prove has a similar zero-one law.

The zero-one law for first order logic

So the theorem about first-order expressible sentences is as follows.

Theorem: Let P be a property of graphs that can be expressed in the first order language of graphs (with the e(x,y) relation). Then for any constant p, the probability that P holds in G(n,p) has a limit of zero or one as n \to \infty.

Proof. We’ll prove the simpler case of p=1/2, but the general case is analogous. Given such a graph G drawn from G(n,p), what we’ll do is define a countably infinite family of propositional formulas \varphi_{k,l}, and argue that they form a sort of “basis” for all first-order sentences about graphs.

First let’s describe the \varphi_{k,l}. For any k,l \in \mathbb{N}, the sentence will assert that for every set of k vertices and every set of l vertices, there is some other vertex connected to the first k but not the last l.

\displaystyle \varphi_{k,l} : \forall x_1, \dots, x_k, y_1, \dots, y_l \exists z : \\ e(z,x_1) \wedge \dots \wedge e(z,x_k) \wedge \neg e(z,y_1) \wedge \dots \wedge \neg e(z,y_l).

In other words, these formulas encapsulate every possible incidence pattern for a single vertex. It is a strange set of formulas, but they have a very nice property we’re about to get to. So for a fixed \varphi_{k,l}, what is the probability that it’s false on n vertices? We want to give an upper bound and hence show that the formula is true with probability approaching 1. That is, we want to show that all the \varphi_{k,l} are true with probability tending to 1.

Computing the probability: we have \binom{n}{k} \binom{n-k}{l} possibilities to choose these sets, and the probability that some other fixed vertex z has the good connections is 2^{-(k+l)} so the probability z is not good is 1 - 2^{-(k+l)}, and taking a product over all choices of z gives the probability that there is some bad vertex z with an exponent of (n - (k + l)). Combining all this together gives an upper bound of \varphi_{k,l} being false of:

\displaystyle \binom{n}{k}\binom{n-k}{l} (1-2^{-k-1})^{n-k-l}

And k, l are constant, so the left two terms are polynomials while the rightmost term is an exponentially small function, and this implies that the whole expression tends to zero, as desired.

Break from proof.

A bit of model theory

So what we’ve proved so far is that the probability of every formula of the form \varphi_{k,l} being satisfied in G(n,1/2) tends to 1.

Now look at the set of all such formulas

\displaystyle \Phi = \{ \varphi_{k,l} : k,l \in \mathbb{N} \}

We ask: is there any graph which satisfies all of these formulas? Certainly it cannot be finite, because a finite graph would not be able to satisfy formulas with sufficiently large values of l, k > n. But indeed, there is a countably infinite graph that works. It’s called the Rado graph, pictured below.

rado

The Rado graph has some really interesting properties, such as that it contains every finite and countably infinite graph as induced subgraphs. Basically this means, as far as countably infinite graphs go, it’s the big momma of all graphs. It’s the graph in a very concrete sense of the word. It satisfies all of the formulas in \Phi, and in fact it’s uniquely determined by this, meaning that if any other countably infinite graph satisfies all the formulas in \Phi, then that graph is isomorphic to the Rado graph.

But for our purposes (proving a zero-one law), there’s a better perspective than graph theory on this object. In the logic perspective, the set \Phi is called a theory, meaning a set of statements that you consider “axioms” in some logical system. And we’re asking whether there any model realizing the theory. That is, is there some logical system with a semantic interpretation (some mathematical object based on numbers, or sets, or whatever) that satisfies all the axioms?

A good analogy comes from the rational numbers, because they satisfy a similar property among all ordered sets. In fact, the rational numbers are the unique countable, ordered set with the property that it has no biggest/smallest element and is dense. That is, in the ordering there is always another element between any two elements you want. So the theorem says if you have two countable sets with these properties, then they are actually isomorphic as ordered sets, and they are isomorphic to the rational numbers.

So, while we won’t prove that the Rado graph is a model for our theory \Phi, we will use that fact to great benefit. One consequence of having a theory with a model is that the theory is consistent, meaning it can’t imply any contradictions. Another fact is that this theory \Phi is complete. Completeness means that any formula or it’s negation is logically implied by the theory. Note these are syntactical implications (using standard rules of propositional logic), and have nothing to do with the model interpreting the theory.

The proof that \Phi is complete actually follows from the uniqueness of the Rado graph as the only countable model of \Phi. Suppose the contrary, that \Phi is not consistent, then there has to be some formula \psi that is not provable, and it’s negation is also not provable, by starting from \Phi. Now extend \Phi in two ways: by adding \psi and by adding \neg \psi. Both of the new theories are still countable, and by a theorem from logic this means they both still have countable models. But both of these new models are also countable models of \Phi, so they have to both be the Rado graph. But this is very embarrassing for them, because we assumed they disagree on the truth of \psi.

So now we can go ahead and prove the zero-one law theorem.

Return to proof.

Given an arbitrary property \varphi \not \in \Psi. Now either \varphi or it’s negation can be derived from \Phi. Without loss of generality suppose it’s \varphi. Take all the formulas from the theory you need to derive \varphi, and note that since it is a proof in propositional logic you will only finitely many such \varphi_{k,l}. Now look at the probabilities of the \varphi_{k,l}: they are all true with probability tending to 1, so the implied statement of the proof of \varphi (i.e., \varphi itself) must also hold with probability tending to 1. And we’re done!

\square

If you don’t like model theory, there is another “purely combinatorial” proof of the zero-one law using something called Ehrenfeucht–Fraïssé games. It is a bit longer, though.

Other zero-one laws

One might naturally ask two questions: what if your probability is not constant, and what other kinds of properties have zero-one laws? Both great questions.

For the first, there are some extra theorems. I’ll just describe one that has always seemed very strange to me. If your probability is of the form p = n^{-\alpha} but \alpha is irrational, then the zero-one law still holds! This is a theorem of Baldwin-Shelah-Spencer, and it really makes you wonder why irrational numbers would be so well behaved while rational numbers are not :)

For the second question, there is another theorem about monotone properties of graphs. Monotone properties come in two flavors, so called “increasing” and “decreasing.” I’ll describe increasing monotone properties and the decreasing counterpart should be obvious. A property is called monotone increasing if adding edges can never destroy the property. That is, with an empty graph you don’t have the property (or maybe you do), and as you start adding edges eventually you suddenly get the property, but then adding more edges can’t cause you to lose the property again. Good examples of this include connectivity, or the existence of a triangle.

So the theorem is that there is an identical zero-one law for monotone properties. Great!

It’s not so often that you get to see these neat applications of logic and model theory to graph theory and (by extension) computer science. But when you do get to apply them they seem very powerful and mysterious. I think it’s a good thing.

Until next time!

The Quantum Bit

The best place to start our journey through quantum computing is to recall how classical computing works and try to extend it. Since our final quantum computing model will be a circuit model, we should informally discuss circuits first.

A circuit has three parts: the “inputs,” which are bits (either zero or one); the “gates,” which represent the lowest-level computations we perform on bits; and the “wires,” which connect the outputs of gates to the inputs of other gates. Typically the gates have one or two input bits and one output bit, and they correspond to some logical operation like AND, NOT, or XOR.

A simple example of a circuit.

A simple example of a circuit. The V’s are “OR” and the Λ’s are “AND.” Image source: Ryan O’Donnell

If we want to come up with a different model of computing, we could start regular circuits and generalize some or all of these pieces. Indeed, in our motivational post we saw a glimpse of a probabilistic model of computation, where instead of the inputs being bits they were probabilities in a probability distribution, and instead of the gates being simple boolean functions they were linear maps that preserved probability distributions (we called such a matrix “stochastic”).

Rather than go through that whole train of thought again let’s just jump into the definitions for the quantum setting. In case you missed last time, our goal is to avoid as much physics as possible and frame everything purely in terms of linear algebra.

Qubits are Unit Vectors

The generalization of a bit is simple: it’s a unit vector in \mathbb{C}^2. That is, our most atomic unit of data is a vector (a,b) with the constraints that a,b are complex numbers and |a|^2 + |b|^2 = 1. We call such a vector a qubit.

A qubit can assume “binary” values much like a regular bit, because you could pick two distinguished unit vectors, like (1,0) and (0,1), and call one “zero” and the other “one.” Obviously there are many more possible unit vectors, such as \frac{1}{\sqrt{2}}(1, 1) and (-i,0). But before we go romping about with what qubits can do, we need to understand how we can extract information from a qubit. The definitions we make here will motivate a lot of the rest of what we do, and is in my opinion one of the major hurdles to becoming comfortable with quantum computing.

A bittersweet fact of life is that bits are comforting. They can be zero or one, you can create them and change them and read them whenever you want without an existential crisis. The same is not true of qubits. This is a large part of what makes quantum computing so weird: you can’t just read the information in a qubit! Before we say why, notice that the coefficients in a qubit are complex numbers, so being able to read them exactly would potentially encode an infinite amount of information (in the infinite binary expansion)! Not only would this be an undesirably powerful property of a circuit, but physicists’ experiments tell us it’s not possible either.

So as we’ll see when we get to some algorithms, the main difficulty in getting useful quantum algorithms is not necessarily figuring out how to compute what you want to compute, it’s figuring out how to tease useful information out of the qubits that otherwise directly contain what you want. And the reason it’s so hard is that when you read a qubit, most of the information in the qubit is destroyed. And what you get to see is only a small piece of the information available. Here is the simplest example of that phenomenon, which is called the measurement in the computational basis.

Definition: Let v = (a,b) \in \mathbb{C}^2 be a qubit. Call the standard basis vectors e_0 = (1,0), e_1 = (0,1) the computational basis of \mathbb{C}^2. The process of measuring v in the computational basis consists of two parts.

  1. You observe (get as output) a random choice of e_0 or e_1. The probability of getting e_0 is |a|^2, and the probability of getting e_1 is |b|^2.
  2. As a side effect, the qubit v instantaneously becomes whatever state was observed in 1. This is often called a collapse of the waveform by physicists.

There are more sophisticated ways to measure, and more sophisticated ways to express the process of measurement, but we’ll cover those when we need them. For now this is it.

Why is this so painful? Because if you wanted to try to estimate the probabilities |a|^2 or |b|^2, not only would you get an estimate at best, but you’d have to repeat whatever computation prepared v for measurement over and over again until you get an estimate you’re satisfied with. In fact, we’ll see situations like this, where we actually have a perfect representation of the data we need to solve our problem, but we just can’t get at it because the measurement process destroys it once we measure.

Before we can talk about those algorithms we need to see how we’re allowed to manipulate qubits. As we said before, we use unitary matrices to preserve unit vectors, so let’s recall those and make everything more precise.

Qubit Mappings are Unitary Matrices

Suppose v = (a,b) \in \mathbb{C}^2 is a qubit. If we are to have any mapping between vector spaces, it had better be a linear map, and the linear maps that send unit vectors to unit vectors are called unitary matrices. An equivalent definition that seems a bit stronger is:

Definition: A linear map \mathbb{C}^2 \to \mathbb{C}^2 is called unitary if it preserves the inner product on \mathbb{C}^2.

Let’s remember the inner product on \mathbb{C}^n is defined by \left \langle v,w \right \rangle = \sum_{i=1}^n v_i \overline{w_i} and has some useful properties.

  • The square norm of a vector is \left \| v \right \|^2 = \left \langle v,v \right \rangle.
  • Swapping the coordinates of the complex inner product conjugates the result: \left \langle v,w \right \rangle = \overline{\left \langle w,v \right \rangle}
  • The complex inner product is a linear map if you fix the second coordinate, and a conjugate-linear map if you fix the first. That is, \left \langle au+v, w \right \rangle = a \left \langle u, w \right \rangle + \left \langle v, w \right \rangle and \left \langle u, aw + v \right \rangle = \overline{a} \left \langle u, w \right \rangle + \left \langle u,v \right \rangle

By the first bullet, it makes sense to require unitary matrices to preserve the inner product instead of just the norm, though the two are equivalent (see the derivation on page 2 of these notes). We can obviously generalize unitary matrices to any complex vector space, and unitary matrices have some nice properties. In particular, if U is a unitary matrix then the important property is that the columns (and rows) of U form an orthonormal basis. As an immediate result, if we take the product U\overline{U}^\text{T}, which is just the matrix of all possible inner products of columns of U, we get the identity matrix. This means that unitary matrices are invertible and their inverse is \overline{U}^\text{T}.

Already we have one interesting philosophical tidbit. Any unitary transformation of a qubit is reversible because all unitary matrices are invertible. Apparently the only non-reversible thing we’ve seen so far is measurement.

Recall that \overline{U}^\text{T} is the conjugate transpose of the matrix, which I’ll often write as U^*. Note that there is a way to define U^* without appealing to matrices: it is a notion called the adjoint, which is that linear map U^* such that \left \langle Uv, w \right \rangle = \left \langle v, U^*w \right \rangle for all v,w. Also recall that “unitary matrix” for complex vector spaces means precisely the same thing as “orthogonal matrix” does for real numbers. The only difference is the inner product being used (indeed, if the complex matrix happens to have real entries, then orthogonal matrix and unitary matrix mean the same thing).

Definition: single qubit gate is a unitary matrix \mathbb{C}^2 \to \mathbb{C}^2.

So enough with the properties and definitions, let’s see some examples. For all of these examples we’ll fix the basis to the computational basis e_0, e_1. One very important, but still very simple example of a single qubit gate is the Hadamard gate. This is the unitary map given by the matrix

\displaystyle \frac{1}{\sqrt{2}}\begin{pmatrix}  1 & 1 \\  1 & -1  \end{pmatrix}

It’s so important because if you apply it to a basis vector, say, e_0 = (1,0), you get a uniform linear combination \frac{1}{\sqrt{2}}(e_1 + e_2). One simple use of this is to allow for unbiased coin flips, and as readers of this blog know unbiased coins can efficiently simulate biased coins. But it has many other uses we’ll touch on as they come.

Just to give another example, the quantum NOT gate, often called a Pauli X gate, is the following matrix

\displaystyle \begin{pmatrix}  0 & 1 \\  1 & 0  \end{pmatrix}

It’s called this because, if we consider e_0 to be the “zero” bit and e_1 to be “one,” then this mapping swaps the two. In general, it takes (a,b) to (b,a).

As the reader can probably imagine by the suggestive comparison with classical operations, quantum circuits can do everything that classical circuits can do. We’ll save the proof for a future post, but if we want to do some kind of “quantum AND” operation, we get an obvious question. How do you perform an operation that involves multiple qubits? The short answer is: you represent a collection of bits by their tensor product, and apply a unitary matrix to that tensor.

We’ll go into more detail on this next time, and in the mean time we suggest checking out this blog’s primer on the tensor product. Until then!

A Motivation for Quantum Computing

Quantum mechanics is one of the leading scientific theories describing the rules that govern the universe. It’s discovery and formulation was one of the most important revolutions in the history of mankind, contributing in no small part to the invention of the transistor and the laser.

Here at Math ∩ Programming we don’t put too much emphasis on physics or engineering, so it might seem curious to study quantum physics. But as the reader is likely aware, quantum mechanics forms the basis of one of the most interesting models of computing since the Turing machine: the quantum circuit. My goal with this series is to elucidate the algorithmic insights in quantum algorithms, and explain the mathematical formalisms while minimizing the amount of “interpreting” and “debating” and “experimenting” that dominates so much of the discourse by physicists.

Indeed, the more I learn about quantum computing the more it’s become clear that the shroud of mystery surrounding quantum topics has a lot to do with their presentation. The people teaching quantum (writing the textbooks, giving the lectures, writing the Wikipedia pages) are almost all purely physicists, and they almost unanimously follow the same path of teaching it.

Scott Aaronson (one of the few people who explains quantum in a way I understand) describes the situation superbly.

There are two ways to teach quantum mechanics. The first way – which for most physicists today is still the only way – follows the historical order in which the ideas were discovered. So, you start with classical mechanics and electrodynamics, solving lots of grueling differential equations at every step. Then, you learn about the “blackbody paradox” and various strange experimental results, and the great crisis that these things posed for physics. Next, you learn a complicated patchwork of ideas that physicists invented between 1900 and 1926 to try to make the crisis go away. Then, if you’re lucky, after years of study, you finally get around to the central conceptual point: that nature is described not by probabilities (which are always nonnegative), but by numbers called amplitudes that can be positive, negative, or even complex.

The second way to teach quantum mechanics eschews a blow-by-blow account of its discovery, and instead starts directly from the conceptual core – namely, a certain generalization of the laws of probability to allow minus signs (and more generally, complex numbers). Once you understand that core, you can then sprinkle in physics to taste, and calculate the spectrum of whatever atom you want.

Indeed, the sequence of experiments and debate has historical value. But the mathematics needed to have a basic understanding of quantum mechanics is quite simple, and it is often blurred by physicists in favor of discussing interpretations. To start thinking about quantum mechanics you only need to a healthy dose of linear algebra, and most of it we’ve covered in the three linear algebra primers on this blog. More importantly for computing-minded folks, one only needs a basic understanding of quantum mechanics to understand quantum computing.

The position I want to assume on this blog is that we don’t care about whether quantum mechanics is an accurate description of the real world. The real world gave an invaluable inspiration, but at the end of the day the mathematics stands on its own merits. The really interesting question to me is how the quantum computing model compares to classical computing. Most people believe it is strictly stronger in terms of efficiency. And so the murky depths of the quantum swamp must be hiding some fascinating algorithmic ideas. I want to understand those ideas, and explain them up to my own standards of mathematical rigor and lucidity.

So let’s begin this process with a discussion of an experiment that motivates most of the ideas we’ll need for quantum computing. Hopefully this will be the last experiment we discuss.

Shooting Photons and The Question of Randomness

Does the world around us have inherent randomness in it? This is a deep question open to a lot of philosophical debate, but what evidence do we have that there is randomness?

Here’s the experiment. You set up a contraption that shoots photons in a straight line, aimed at what’s called a “beam splitter.” A beam splitter seems to have the property that when photons are shot at it, they will be either be reflected at a 90 degree angle or stay in a straight line with probability 1/2. Indeed, if you put little photon receptors at the end of each possible route (straight or up, as below) to measure the number of photons that end at each receptor, you’ll find that on average half of the photons went up and half went straight.

photon-experiment

The triangle is the photon shooter, and the camera-looking things are receptors.

 

If you accept that the photon shooter is sufficiently good and the beam splitter is not tricking us somehow, then this is evidence that universe has some inherent randomness in it! Moreover, the probability that a photon goes up or straight seems to be independent of what other photons do, so this is evidence that whatever randomness we’re seeing follows the classical laws of probability. Now let’s augment the experiment as follows. First, put two beam splitters on the corners of a square, and mirrors at the other two corners, as below.

The thicker black lines are mirrors which always reflect the photons.

The thicker black lines are mirrors which always reflect the photons.

This is where things get really weird. If you assume that the beam splitter splits photons randomly (as in, according to an independent coin flip), then after the first beam splitter half go up and half go straight, and the same thing would happen after the second beam splitter. So the two receptors should measure half the total number of photons on average.

But that’s not what happens. Rather, all the photons go to the top receptor! Somehow the “probability” that the photon goes left or up in the first beam splitter is connected to the probability that it goes left or up in the second. This seems to be a counterexample to the claim that the universe behaves on the principles of independent probability. Obviously there is some deeper mystery at work.

awardplz

Complex Probabilities

One interesting explanation is that the beam splitter modifies something intrinsic to the photon, something that carries with it until the next beam splitter. You can imagine the photon is carrying information as it shambles along, but regardless of the interpretation it can’t follow the laws of classical probability. The classical probability explanation would go something like this:

There are two states, RIGHT and UP, and we model the state of a photon by a probability distribution (p, q) such that the photon has a probability p of being in state RIGHT a probability q of being in state UP, and like any probability distribution p + q = 1. A photon hence starts in state (1,0), and the process of traveling through the beam splitter is the random choice to switch states. This is modeled by multiplication by a particular so-called stochastic matrix (which just means the rows sum to 1)

\displaystyle A = \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}

Of course, we chose this matrix because when we apply it to (1,0) and (0,1) we get (1/2, 1/2) for both outcomes. By doing the algebra, applying it twice to (1,0) will give the state (1/2, 1/2), and so the chance of ending up in the top receptor is the same as for the right receptor.

But as we already know this isn’t what happens in real life, so something is amiss. Here is an alternative explanation that gives a nice preview of quantum mechanics.

The idea is that, rather than have the state of the traveling photon be a probability distribution over RIGHT and UP, we have it be a unit vector in a vector space (over \mathbb{C}). That is, now RIGHT and UP are the (basis) unit vectors e_1 = (1,0), e_2 = (0,1), respectively, and a state x is a linear combination c_1 e_1 + c_2 e_2, where we require \left \| x \right \|^2 = |c_1|^2 + |c_2|^2 = 1. And now the “probability” that the photon is in the RIGHT state is the square of the coefficient for that basis vector p_{\text{right}} = |c_1|^2. Likewise, the probability of being in the UP state is p_{\text{up}} = |c_2|^2.

This might seem like an innocuous modification — even a pointless one! — but changing the sum (or 1-norm) to the Euclidean sum-of-squares (or the 2-norm) is at the heart of why quantum mechanics is so different. Now rather than have stochastic matrices for state transitions, which are defined they way they are because they preserve probability distributions, we use unitary matrices, which are those complex-valued matrices that preserve the 2-norm. In both cases, we want “valid states” to be transformed into “valid states,” but we just change precisely what we mean by a state, and pick the transformations that preserve that.

In fact, as we’ll see later in this series using complex numbers is totally unnecessary. Everything that can be done with complex numbers can be done without them (up to a good enough approximation for computing), but using complex numbers just happens to make things more elegant mathematically. It’s the kind of situation where there are more and better theorems in linear algebra about complex-valued matrices than real valued matrices.

But back to our experiment. Now we can hypothesize that the beam splitter corresponds to the following transformation of states:

\displaystyle A = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & i \\ i & 1 \end{pmatrix}

We’ll talk a lot more about unitary matrices later, so for now the reader can rest assured that this is one. And then how does it transform the initial state x =(1,0)?

\displaystyle y = Ax = \frac{1}{\sqrt{2}}(1, i)

So at this stage the probability of being in the RIGHT state is 1/2 = (1/\sqrt{2})^2 and the probability of being in state UP is also 1/2 = |i/\sqrt{2}|^2. So far it matches the first experiment. Applying A again,

\displaystyle Ay = A^2x = \frac{1}{2}(0, 2i) = (0, i)

And the photon is in state UP with probability 1. Stunning. This time Science is impressed by mathematics.

Next time we’ll continue this train of thought by generalizing the situation to the appropriate mathematical setting. Then we’ll dive into the quantum circuit model, and start churning out some algorithms.

Until then!