Zero Knowledge Proofs — A Primer

In this post we’ll get a strong taste for zero knowledge proofs by exploring the graph isomorphism problem in detail. In the next post, we’ll see how this relates to cryptography and the bigger picture. The goal of this post is to get a strong understanding of the terms “prover,” “verifier,” and “simulator,” and “zero knowledge” in the context of a specific zero-knowledge proof. Then next time we’ll see how the same concepts (though not the same proof) generalizes to a cryptographically interesting setting.

Graph isomorphism

Let’s start with an extended example. We are given two graphs $ G_1, G_2$, and we’d like to know whether they’re isomorphic, meaning they’re the same graph, but “drawn” different ways.

The problem of telling if two graphs are isomorphic seems hard. The pictures above, which are all different drawings of the same graph (or are they?), should give you pause if you thought it was easy.

To add a tiny bit of formalism, a graph $ G$ is a list of edges, and each edge $ (u,v)$ is a pair of integers between 1 and the total number of vertices of the graph, say $ n$. Using this representation, an isomorphism between $ G_1$ and $ G_2$ is a permutation $ \pi$ of the numbers $ \{1, 2, \dots, n \}$ with the property that $ (i,j)$ is an edge in $ G_1$ if and only if $ (\pi(i), \pi(j))$ is an edge of $ G_2$. You swap around the labels on the vertices, and that’s how you get from one graph to another isomorphic one.

Given two arbitrary graphs as input on a large number of vertices $ n$, nobody knows of an efficient—i.e., polynomial time in $ n$—algorithm that can always decide whether the input graphs are isomorphic. Even if you promise me that the inputs are isomorphic, nobody knows of an algorithm that could construct an isomorphism. (If you think about it, such an algorithm could be used to solve the decision problem!)

A game

Now let’s play a game. In this game, we’re given two enormous graphs on a billion nodes. I claim they’re isomorphic, and I want to prove it to you. However, my life’s fortune is locked behind these particular graphs (somehow), and if you actually had an isomorphism between these two graphs you could use it to steal all my money. But I still want to convince you that I do, in fact, own all of this money, because we’re about to start a business and you need to know I’m not broke.

Is there a way for me to convince you beyond a reasonable doubt that these two graphs are indeed isomorphic? And moreover, could I do so without you gaining access to my secret isomorphism? It would be even better if I could guarantee you learn nothing about my isomorphism or any isomorphism, because even the slightest chance that you can steal my money is out of the question.

Zero knowledge proofs have exactly those properties, and here’s a zero knowledge proof for graph isomorphism. For the record, $ G_1$ and $ G_2$ are public knowledge, (common inputs to our protocol for the sake of tracking runtime), and the protocol itself is common knowledge. However, I have an isomorphism $ f: G_1 \to G_2$ that you don’t know.

Step 1: I will start by picking one of my two graphs, say $ G_1$, mixing up the vertices, and sending you the resulting graph. In other words, I send you a graph $ H$ which is chosen uniformly at random from all isomorphic copies of $ G_1$. I will save the permutation $ \pi$ that I used to generate $ H$ for later use.

Step 2: You receive a graph $ H$ which you save for later, and then you randomly pick an integer $ t$ which is either 1 or 2, with equal probability on each. The number $ t$ corresponds to your challenge for me to prove $ H$ is isomorphic to $ G_1$ or $ G_2$. You send me back $ t$, with the expectation that I will provide you with an isomorphism between $ H$ and $ G_t$.

Step 3: Indeed, I faithfully provide you such an isomorphism. If I you send me $ t=1$, I’ll give you back $ \pi^{-1} : H \to G_1$, and otherwise I’ll give you back $ f \circ \pi^{-1}: H \to G_2$. Because composing a fixed permutation with a uniformly random permutation is again a uniformly random permutation, in either case I’m sending you a uniformly random permutation.

Step 4: You receive a permutation $ g$, and you can use it to verify that $ H$ is isomorphic to $ G_t$. If the permutation I sent you doesn’t work, you’ll reject my claim, and if it does, you’ll accept my claim.

Before we analyze, here’s some Python code that implements the above scheme. You can find the full, working example in a repository on this blog’s Github page.

First, a few helper functions for generating random permutations (and turning their list-of-zero-based-indices form into a function-of-positive-integers form)

import random

def randomPermutation(n):
    L = list(range(n))
    random.shuffle(L)
    return L

def makePermutationFunction(L):
    return lambda i: L[i - 1] + 1

def makeInversePermutationFunction(L):
    return lambda i: 1 + L.index(i - 1)

def applyIsomorphism(G, f):
    return [(f(i), f(j)) for (i, j) in G]

Here’s a class for the Prover, the one who knows the isomorphism and wants to prove it while keeping the isomorphism secret:

class Prover(object):
    def __init__(self, G1, G2, isomorphism):
        '''
            isomomorphism is a list of integers representing
            an isomoprhism from G1 to G2.
        '''
        self.G1 = G1
        self.G2 = G2
        self.n = numVertices(G1)
        assert self.n == numVertices(G2)

        self.isomorphism = isomorphism
        self.state = None

    def sendIsomorphicCopy(self):
        isomorphism = randomPermutation(self.n)
        pi = makePermutationFunction(isomorphism)

        H = applyIsomorphism(self.G1, pi)

        self.state = isomorphism
        return H

    def proveIsomorphicTo(self, graphChoice):
        randomIsomorphism = self.state
        piInverse = makeInversePermutationFunction(randomIsomorphism)

        if graphChoice == 1:
            return piInverse
        else:
            f = makePermutationFunction(self.isomorphism)
            return lambda i: f(piInverse(i))

The prover has two methods, one for each round of the protocol. The first creates an isomorphic copy of $ G_1$, and the second receives the challenge and produces the requested isomorphism.

And here’s the corresponding class for the verifier

class Verifier(object):
    def __init__(self, G1, G2):
        self.G1 = G1
        self.G2 = G2
        self.n = numVertices(G1)
        assert self.n == numVertices(G2)

    def chooseGraph(self, H):
        choice = random.choice([1, 2])
        self.state = H, choice
        return choice

    def accepts(self, isomorphism):
        '''
            Return True if and only if the given isomorphism
            is a valid isomorphism between the randomly
            chosen graph in the first step, and the H presented
            by the Prover.
        '''
        H, choice = self.state
        graphToCheck = [self.G1, self.G2][choice - 1]
        f = isomorphism

        isValidIsomorphism = (graphToCheck == applyIsomorphism(H, f))
        return isValidIsomorphism

Then the protocol is as follows:

def runProtocol(G1, G2, isomorphism):
    p = Prover(G1, G2, isomorphism)
    v = Verifier(G1, G2)

    H = p.sendIsomorphicCopy()
    choice = v.chooseGraph(H)
    witnessIsomorphism = p.proveIsomorphicTo(choice)

    return v.accepts(witnessIsomorphism)

Analysis: Let’s suppose for a moment that everyone is honestly following the rules, and that $ G_1, G_2$ are truly isomorphic. Then you’ll always accept my claim, because I can always provide you with an isomorphism. Now let’s suppose that, actually I’m lying, the two graphs aren’t isomorphic, and I’m trying to fool you into thinking they are. What’s the probability that you’ll rightfully reject my claim?

Well, regardless of what I do, I’m sending you a graph $ H$ and you get to make a random choice of $ t = 1, 2$ that I can’t control. If $ H$ is only actually isomorphic to either $ G_1$ or $ G_2$ but not both, then so long as you make your choice uniformly at random, half of the time I won’t be able to produce a valid isomorphism and you’ll reject. And unless you can actually tell which graph $ H$ is isomorphic to—an open problem, but let’s say you can’t—then probability 1/2 is the best you can do.

Maybe the probability 1/2 is a bit unsatisfying, but remember that we can amplify this probability by repeating the protocol over and over again. So if you want to be sure I didn’t cheat and get lucky to within a probability of one-in-one-trillion, you only need to repeat the protocol 30 times. To be surer than the chance of picking a specific atom at random from all atoms in the universe, only about 400 times.

If you want to feel small, think of the number of atoms in the universe. If you want to feel big, think of its logarithm.

Here’s the code that repeats the protocol for assurance.

def convinceBeyondDoubt(G1, G2, isomorphism, errorTolerance=1e-20):
    probabilityFooled = 1

    while probabilityFooled > errorTolerance:
        result = runProtocol(G1, G2, isomorphism)
        assert result
        probabilityFooled *= 0.5
        print(probabilityFooled)

Running it, we see it succeeds

$ python graph-isomorphism.py
0.5
0.25
0.125
0.0625
0.03125
 ...
<SNIP>
 ...
1.3552527156068805e-20
6.776263578034403e-21

So it’s clear that this protocol is convincing.

But how can we be sure that there’s no leakage of knowledge in the protocol? What does “leakage” even mean? That’s where this topic is the most difficult to nail down rigorously, in part because there are at least three a priori different definitions! The idea we want to capture is that anything that you can efficiently compute after the protocol finishes (i.e., you have the content of the messages sent to you by the prover) you could have computed efficiently given only the two graphs $ G_1, G_2$, and the claim that they are isomorphic.

Another way to say it is that you may go through the verification process and feel happy and confident that the two graphs are isomorphic. But because it’s a zero-knowledge proof, you can’t do anything with that information more than you could have done if you just took the assertion on blind faith. I’m confident there’s a joke about religion lurking here somewhere, but I’ll just trust it’s funny and move on.

In the next post we’ll expand on this “leakage” notion, but before we get there it should be clear that the graph isomorphism protocol will have the strongest possible “no-leakage” property we can come up with. Indeed, in the first round the prover sends a uniform random isomorphic copy of $ G_1$ to the verifier, but the verifier can compute such an isomorphism already without the help of the prover. The verifier can’t necessarily find the isomorphism that the prover used in retrospect, because the verifier can’t solve graph isomorphism. Instead, the point is that the probability space of “$ G_1$ paired with an $ H$ made by the prover” and the probability space of “$ G_1$ paired with $ H$ as made by the verifier” are equal. No information was leaked by the prover.

For the second round, again the permutation $ \pi$ used by the prover to generate $ H$ is uniformly random. Since composing a fixed permutation with a uniform random permutation also results in a uniform random permutation, the second message sent by the prover is uniformly random, and so again the verifier could have constructed a similarly random permutation alone.

Let’s make this explicit with a small program. We have the honest protocol from before, but now I’m returning the set of messages sent by the prover, which the verifier can use for additional computation.

def messagesFromProtocol(G1, G2, isomorphism):
    p = Prover(G1, G2, isomorphism)
    v = Verifier(G1, G2)

    H = p.sendIsomorphicCopy()
    choice = v.chooseGraph(H)
    witnessIsomorphism = p.proveIsomorphicTo(choice)

    return [H, choice, witnessIsomorphism]

To say that the protocol is zero-knowledge (again, this is still colloquial) is to say that anything that the verifier could compute, given as input the return value of this function along with $ G_1, G_2$ and the claim that they’re isomorphic, the verifier could also compute given only $ G_1, G_2$ and the claim that $ G_1, G_2$ are isomorphic.

It’s easy to prove this, and we’ll do so with a python function called simulateProtocol.

def simulateProtocol(G1, G2):
    # Construct data drawn from the same distribution as what is
    # returned by messagesFromProtocol
    choice = random.choice([1, 2])
    G = [G1, G2][choice - 1]
    n = numVertices(G)

    isomorphism = randomPermutation(n)
    pi = makePermutationFunction(isomorphism)
    H = applyIsomorphism(G, pi)

    return H, choice, pi

The claim is that the distribution of outputs to messagesFromProtocol and simulateProtocol are equal. But simulateProtocol will work regardless of whether $ G_1, G_2$ are isomorphic. Of course, it’s not convincing to the verifier because the simulating function made the choices in the wrong order, choosing the graph index before making $ H$. But the distribution that results is the same either way.

So if you were to use the actual Prover/Verifier protocol outputs as input to another algorithm (say, one which tries to compute an isomorphism of $ G_1 \to G_2$), you might as well use the output of your simulator instead. You’d have no information beyond hard-coding the assumption that $ G_1, G_2$ are isomorphic into your program. Which, as I mentioned earlier, is no help at all.

In this post we covered one detailed example of a zero-knowledge proof. Next time we’ll broaden our view and see the more general power of zero-knowledge (that it captures all of NP), and see some specific cryptographic applications. Keep in mind the preceding discussion, because we’re going to re-use the terms “prover,” “verifier,” and “simulator” to mean roughly the same things as the classes Prover, Verifier and the function simulateProtocol.

Until then!

Tensorphobia and the Outer Product

Variations on a theme

Back in 2014 I wrote a post called How to Conquer Tensorphobia that should end up on Math $ \cap$ Programming’s “greatest hits” album. One aspect of tensors I neglected to discuss was the connection between the modern views of tensors and the practical views of linear algebra. I feel I need to write this because every year or two I forget why it makes sense.

The basic question is:

What the hell is going on with the outer product of vectors?

The simple answer, the one that has never satisfied me, is that the outer product of $ v,w$ is the matrix $ vw^T$ whose $ i,j$ entry is the product $ v_i w_j$. This doesn’t satisfy me because it’s an explanation by fiat. It lacks motivation and you’re supposed to trust (or verify) things magically work out. To me this definition is like the definition of matrix multiplication: having it dictated to you before you understand why it makes sense is a cop out. Math isn’t magic, it needs to make perfect sense.

The answer I like, and the one I have to re-derive every few years because I never wrote it down, is a little bit longer.

To borrow a programming term, the basic confusion is a type error. We start with two vectors, $ v,w$ in some vector space $ V$ (let’s say everything is finite dimensional), and we magically turn them into a matrix. Let me reiterate for dramatic effect: we start with vectors, which I have always thought of as objects, things you can stretch, rotate, or give as a gift to your cousin Mauricio. While a matrix is a mapping, a thing that takes vectors as input and spits out new vectors. Sure, you can play with the mappings, feed them kibble and wait for them to poop or whatever. And sure, sometimes vectors are themselves maps, like in the vector space of real-valued functions (where the word “matrix” is a stretch, since it’s infinite dimensional).

But to take two vectors and *poof* get a mapping of all vectors, it’s a big jump. And part of the discomfort is that it feels random. To be happy we have to understand that the construction is natural, or even better canonical, meaning this should be the only way to turn two vectors into a linear map. Then the definition would make sense.

So let’s see how we can do that. Just to be clear, everything we do in this post will be for finite-dimensional vector spaces over $ \mathbb{R}$, but we’ll highlight the caveats when they come up.

Dual vector spaces

The first step is understanding how to associate a vector with a linear map in a “natural” or “canonical” way. There is one obvious candidate: if you give me a vector $ v$, I can make a linear map by taking the dot product of $ v$ with the input. So I can associate

$ \displaystyle v \mapsto \langle v,- \rangle$

The dash is a placeholder for the input. Another way to say it is to define $ \varphi_v(w) = \langle v,w \rangle$ and say the association takes $ v \mapsto \varphi_v$. So this “association,” taking $ v$ to the inner product, is itself  a mapping from $ V$ to “maps from $ V$ to $ \mathbb{R}$.” Note that $ \varphi_v(w)$ is linear in $ v$ and $ w$ because that’s part of the definition of an inner product.

To avoid saying “maps from $ V$ to $ \mathbb{R}$” all the time, we’ll introduce some notation.

Definition: Let $ V$ be a vector space over a field $ k$. The set of $ k$-linear maps from $ V \to k$ is called $ \textup{Hom}(V,k)$. More generally, the set of $ k$-linear maps from $ V$ to another $ k$-vector space $ W$ is called $ \textup{Hom}(V,W)$.

“Hom” stands for “homomorphism,” and in general it just means “maps with the structure I care about.” For this post $ k = \mathbb{R}$, but most of what we say here will be true for any field. If you go deeply into this topic, it matters whether $ k$ is algebraically closed, or has finite characteristic, but for simplicity we’ll ignore all of that. We’ll also ignore the fact that these maps are called linear functionals and this is where the name “functional analysis” comes from. All we really want to do is understand the definition of the outer product.

Another bit of notation for brevity:

Definition: Let $ V$ be a $ \mathbb{R}$-vector space. The dual vector space for $ V$, denoted $ V^*$, is $ \textup{Hom}(V, \mathbb{R})$.

So the “vector-to-inner-product” association we described above is a map $ V \to V^*$. It takes in $ v \in V$ and spits out $ \varphi_v \in V^*$.

Now here’s where things start to get canonical (interesting). First, $ V^*$ is itself a vector space. This is an easy exercise, and the details are not too important for us, but I’ll say the key: if you want to add two functions, you just add their (real number) outputs. In fact we can say more:

Theorem: $ V$ and $ V^*$ are isomorphic as vector spaces, and the map $ v \mapsto \varphi_v$ is the canonical isomorphism.

Confessions of a mathematician: we’re sweeping some complexity under the rug. When we upgraded our vector space to an inner product space, we fixed a specific (but arbitrary) inner product on $ V$. For finite dimensional vector spaces it makes no difference, because every finite-dimensional $ \mathbb{R}$-inner product space is isomorphic to $ \mathbb{R}^n$ with the usual inner product. But the theorem is devastatingly false for infinite-dimensional vector spaces. There are two reasons: (1) there are many (non-canonical) choices of inner products and (2) the mapping for any given inner product need not span $ V^*$. Luckily we’re in finite dimensions so we can ignore all that. [Edit: see Emilio’s comments for a more detailed discussion of what’s being swept under the rug, and how we’re ignoring the categorical perspective when we say “natural” and “canonical.”]

Before we make sense of the isomorphism let’s talk more about $ V^*$. First off, it’s not even entirely obvious that $ V^*$ is finite-dimensional. On one hand, if $ v_1, \dots, v_n$ is a basis of $ V$ then we can quickly prove that $ \varphi_{v_1}, \dots ,\varphi_{v_n}$ are linearly independent in $ V^*$. Indeed, if they weren’t then there’d be some linear combination $ a_1 \varphi_{v_1} + \dots + a_n \varphi_{v_n}$ that is the zero function, meaning that for every vector $ w$, the following is zero

$ \displaystyle a_1 \langle v_1, w \rangle + \dots + a_n \langle v_n, w \rangle = 0.$

But since the inner product is linear in both arguments we get that $ \langle a_1 v_1 + \dots + a_n v_n , w \rangle = 0$ for every $ w$. And this can only happen when $ a_1v_1 + \dots + a_nv_n$ is the zero vector (prove this).

One consequence is that the linear map $ v \mapsto \varphi_v$ is injective. So we can think of $ V$ as “sitting inside” $ V^*$. Now here’s a very slick way to show that the $ \varphi_{v_i}$ span all of $ V^*$. First we can assume our basis $ v_1, \dots, v_n$ is actually an orthonormal basis with respect to our inner product (this is without loss of generality). Then we write any linear map $ f \in V^*$ as

$ f = f(v_1)\varphi_{v_1} + \dots + f(v_n) \varphi_{v_n}$

To show these two are actually equal, it’s enough to show they agree on a basis for $ V$. That is, if you plug in $ v_1$ to the function on the left- and right-hand side of the above, you’ll get the same thing. The orthonormality of the basis makes it work, since all the irrelevant inner products are zero.

In case you missed it, that completes the proof that $ V$ and $ V^*$ are isomorphic. Now when I say that the isomorphism is “canonical,” I mean that if you’re willing to change the basis of $ V$ and $ V^*$, then $ v \mapsto \varphi_v$ is the square identity matrix, i.e. the only isomorphism between any two finite vector spaces (up to a change of basis).

Tying in tensors

At this point we have a connection between single vectors $ v$ and linear maps $ V^*$ whose codomain has dimension 1. If we want to understand the outer product, we need a connection between pairs of vectors and matrices, i.e. $ \textup{Hom}(V,V)$. In other words, we’d like to find a canonical isomorphism between $ V \times V$ and $ \textup{Hom}(V,V)$. But already it’s not possible because the spaces have different dimensions. If $ \textup{dim}(V) = n$ then the former has dimension $ 2n$ and the latter has dimension $ n^2$. So any “natural” relation between these spaces has to be a way to embed $ V \times V \subset \textup{Hom}(V,V)$ as a subspace via some injective map.

There are two gaping problems with this approach. First, the outer product is not linear as a map from $ V \times V \to \textup{Hom}(V,V)$. To see this, take any $ v,w \in V$, pick any scalar $ \lambda \in \mathbb{R}$. Scaling the pair $ (v,w)$ means scaling both components to $ (\lambda v, \lambda w)$, and so the outer product is the matrix $ (\lambda v)(\lambda w^T) = \lambda^2 vw^T$.

The second problem is that the only way to make $ V \times V$ a subspace of $ \textup{Hom}(V,V)$ (up to a change of basis) is to map $ v,w$ to the first two rows of a matrix with zeros elsewhere. This is canonical but it doesn’t have the properties that the outer product promises us. Indeed, the outer product let’s us uniquely decompose a matrix as a “sum of rank 1 matrices,” but we don’t get a unique decomposition of a matrix as a sum of these two-row things. We also don’t even get a well-defined rank by decomposing into a sum of two-row matrices (you can get cancellation by staggering the sum). This injection is decisively useless.

It would seem like we’re stuck, until we think back to our association between $ V$ and $ V^*$. If we take one of our two vectors, say $ v$, and pair it with $ w$, we can ask how $ (\varphi_v, w)$ could be turned into a linear map in $ \textup{Hom}(V,V)$. A few moments of guessing and one easily discovers the map

$ \displaystyle x \mapsto \varphi_v(x) w = \langle v,x \rangle w$

In words, we’re scaling $ w$ by the inner product of $ x$ and $ v$. In geometric terms, we project onto $ v$ and scale $ w$ by the signed length of that projection. Let’s call this map $ \beta_{v,w}(x)$, so that the association maps $ (v,w) \mapsto \beta_{v,w}$. The thought process of “easily discovering” this is to think, “What can you do with a function $ \varphi_v$ and an input $ x$? Plug it in. Then what can you do with the resulting number and a vector $ w$? Scale $ w$.”

If you look closely you’ll see we’ve just defined the outer product. This is because the outer product works by saying $ uv^T$ is a matrix, which acts on a vector $ x$ by doing $ (uv^T)x = u(v^T x)$. But the important thing is that, because $ V$ and $ V^*$ are canonically isomorphic, this is a mapping

$ \displaystyle V \times V = V^* \times V \to \textup{Hom}(V,V)$

Now again, this mapping is not linear. In fact, it’s bilinear, and if there’s one thing we know about bilinear maps, it’s that tensors are their gatekeepers. If you recall our previous post on tensorphobia, this means that this bilinear map “factors through” the tensor product in a canonical way. So the true heart of this association $ (v,w) \mapsto \beta_{v,w}$ is a map $ B: V \otimes V \to \textup{Hom}(V,V)$ defined by

$ \displaystyle B(v \otimes w) = \beta_{v,w}$

And now the punchline,

Theorem: $ B$ is an isomorphism of vector spaces.

Proof. If $ v_1, \dots, v_n$ is a basis for $ V$ then it’s enough to show that $ \beta_{v_i, v_j}$ forms a basis for $ \textup{Hom}(V,V)$. Since we already know $ \dim(\textup{Hom}(V,V)) = n^2$ and there are $ n^2$ of the $ \beta_{v_i, v_j}$, all we need to do is show that the $ \beta$’s are linearly independent. For brevity let me remove the $ v$’s and call $ \beta_{i,j} =\beta_{v_i, v_j}$.

Suppose they are not linearly independent. Then there is some choice of scalars $ a_{i,j}$ so that the linear combination below is the identically zero function

$ \displaystyle \sum_{i,j=1}^n a_{i,j}\beta_{i,j} = 0$

In other words, if I plug in any $ v_i$ from my (orthonormal) basis, the result is zero. So let’s plug in $ v_1$.

$ \displaystyle \begin{aligned} 0 &= \sum_{i,j=1}^n a_{i,j} \beta_{i,j}(v_1) \\ &= \sum_{i,j=1}^n a_{i,j} \langle v_i, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} \langle v_1, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} v_j \end{aligned}$

The orthonormality makes all of the $ \langle v_i ,v_1 \rangle = 0$ when $ i \neq 1$, so we get a linear combination of the $ v_j$ being zero. Since the $ v_i$ form a basis, it must be that all the $ a_{1,j} = 0$. The same thing happens when you plug in $ v_2$ or any other $ v_k$, and so all the $ a_{i,j}$ are zero, proving linear independence.

$ \square$

This theorem immediately implies some deep facts, such as that every matrix can be uniquely decomposed as a sum of the $ \beta_{i,j}$’s. Moreover, facts like the $ \beta_{i,j}$’s being rank 1 are immediate: by definition the maps scale a single vector by some number. So of course the image will be one-dimensional. Finding a useful basis $ \{ v_i \}$ with which to decompose a matrix is where things get truly fascinating, and we’ll see that next time when we study the singular value decomposition.

In the mean time, this understanding generalizes nicely (via induction/recursion) to higher dimensional tensors. And you don’t need to talk about slices or sub-tensors or lose your sanity over $ n$-tuples of indices.

Lastly, all of this duality stuff provides a “coordinate-free” way to think about the transpose of a linear map. We can think of the “transpose” operation as a linear map $ -^T : \textup{Hom}(V,W) \to \textup{Hom}(W^*,V^*)$ which (even in infinite dimensions) has the following definition. If $ f \in \textup{Hom}(V,W)$ then $ f^T$ is a linear map taking $ h \in W^*$ to $ f^T(h) \in V^*$. The latter is a function in $ V^*$, so we need to say what it does on inputs $ v \in V$. The only definition that doesn’t introduce any type errors is $ (f^T(h))(v) = h(f(v))$. A more compact way to say this is that $ f^T(h) = h \circ f$.

Until next time!

The Complexity of Communication

satellite

One of the most interesting questions posed in the last thirty years of computer science is to ask how much “information” must be communicated between two parties in order for them to jointly compute something. One can imagine these two parties living on distant planets, so that the cost of communicating any amount of information is very expensive, but each person has an integral component of the answer that the other does not.

Since this question was originally posed by Andrew Yao in 1979, it has led to a flurry of applications in many areas of mathematics and computer science. In particular it has become a standard tool for proving lower bounds in many settings such as circuit design and streaming algorithms. And if there’s anything theory folks love more than a problem that can be solved by an efficient algorithm, it’s a proof that a problem cannot be solved by any efficient algorithm (that’s what I mean by “lower bound”).

Despite its huge applicability, the basic results in this area are elementary. In this post we’ll cover those basics, but once you get past these basic ideas and their natural extensions you quickly approach the state of the art and open research problems. Attempts to tackle these problems in recent years have used sophisticated techniques in Fourier analysis, Ramsey theory, and geometry. This makes it a very fun and exciting field.

As a quick side note before we start, the question we’re asking is different from the one of determining the information content of a specific message. That is the domain of information theory, which was posed (and answered) decades earlier. Here we’re trying to determine the complexity of a problem, where more complex messages require more information about their inputs.

The Basic Two-Player Model

The most basic protocol is simple enough to describe over a dinner table. Alice and Bob each have one piece of information $ x,y$, respectively, say they each have a number. And together they want to compute some operation that depends on both their inputs, for example whether $ x > y$. But in the beginning Alice has access only to her number $ x$, and knows nothing about $ y$. So Alice sends Bob a few bits. Depending on the message Bob computes something and replies, and this repeats until they have computed an answer. The question is: what is the minimum number of bits they need to exchange in order for both of them to be able to compute the right answer?

There are a few things to clarify here: we’re assuming that Alice and Bob have agreed on a protocol for sending information before they ever saw their individual numbers. So imagine ten years earlier Alice and Bob were on the same planet, and they agreed on the rules they’d follow for sending/replying information once they got their numbers. In other words, we’re making a worst-case assumption on Alice and Bob’s inputs, and as usual it will be measured as a function of $ n$, the lengths of their inputs. Then we take a minimum (asymptotically) over all possible protocols they could follow, and this value is the “communication complexity” of the problem. Computing the exact communication complexity of a given problem is no simple task, since there’s always the nagging question of whether there’s some cleverer protocol than the one you came up with. So most of the results are bounds on the communication complexity of a problem.

Indeed, we can give our first simple bound for the “$ x$ greater than $ y$” problem we posed above. Say the strings $ x,y$ both have $ n$ bits. What Alice does is send her entire string $ x$ to Bob, and Bob then computes the answer and sends the answer bit back to Alice. This requires $ n + 1$ bits of communication. This proves that the communication complexity of “$ x > y$” is bounded from above by $ n+1$. A much harder question is, can we do any better?

To make any progress on upper or lower bounds we need to be a bit more formal about the communication model. Basically, the useful analysis happens when the players alternate sending single bits, and this is only off by small constant factors from a more general model. This is the asymptotic analysis, that we only distinguish between things like linear complexity $ O(n)$ versus sublinear options like $ \log(n)$ or $ \sqrt{n}$ or even constant complexity $ O(1)$. Indeed, the protocol we described for $ x > y$ is the stupidest possible protocol for the problem, and it’s actually valid for any problem. For this problem it happens to be optimal, but we’re just trying to emphasize that nontrivial bounds are all sub-linear in the size of the inputs.

On to the formal model.

Definition: player is a computationally unbounded Turing machine.

And we really mean unbounded. Our players have no time or space constraints, and if they want they can solve undecidable problems like the halting problem or computing Kolmogorov complexity. This is to emphasize that the critical resource is the amount of communication between players. Moreover, it gives us a hint that lower bounds in this model won’t come form computational intractability, but instead will be “information-theoretic.”

Definition: Let $ \Sigma^*$ be the set of all binary strings. A communication protocol is a pair of functions $ A,B: \Sigma^* \times \Sigma^* \to \{ 0,1 \}$.

The input to these functions $ A(x, h)$ should be thought of as follows: $ x$ is the player’s secret input and $ h$ is the communication history so far. The output is the single bit that they will send in that round (which can be determined by the length of $ h$ since only one bit is sent in each round). The protocol then runs by having Alice send $ b_1 = A(x, \{ \})$ to Bob, then Bob replies with $ b_2 = B(y, b_1)$, Alice continues with $ b_3 = A(x, b_1b_2)$, and so on. We implicitly understand that the content of a communication protocol includes a termination condition, but we’ll omit this from the notation. We call the length of the protocol the number of rounds.

Definition: A communication protocol $ A,B$ is said to be valid for a boolean function $ f(x,y)$ if for all strings $ x, y$, the protocol for $ A, B$ terminates on some round $ t$ with $ b_t = 1$ if and only if $ f(x,y) = 1$.

So to define the communication complexity, we let the function $ L_{A,B}(n)$ be the maximum length of the protocol $ A, B$ when run on strings of length $ n$ (the worst-case for a given input size). Then the communication complexity of a function $ f$ is the minimum of $ L_{A,B}$ over all valid protocols $ A, B$. In symbols,

$ \displaystyle CC_f(n) = \min_{A,B \textup{ is valid for } f} L_{A,B}(n)$

We will often abuse the notation by writing the communication complexity of a function as $ CC(f)$, understanding that it’s measured asymptotically as a function of $ n$.

Matrices and Lower Bounds

Let’s prove a lower bound, that to compute the equality function you need to send a linear number of bits in the worst case. In doing this we’ll develop a general algebraic tool.

So let’s write out the function $ f$ as a binary matrix $ M(f)$ in the following way. Write all $ 2^n$ inputs of length $ n$ in some fixed order along the rows and columns of the matrix, and let entry $ i,j$ be the value of $ f(i,j)$. For example, the 6-bit function $ f$ which computes whether the majority of the two player’s bits are ones looks like this:

maj-matrix

The key insight to remember is that if the matrix of a function has a nice structure, then one needs very little communication to compute it. Let’s see why.

Say in the first round the row player sends a bit $ b$. This splits the matrix into two submatrices $ A_0, A_1$ by picking the rows of $ A_0$ to be those inputs for which the row player sends a $ b=0$, and likewise for $ A_1$ with $ b=1$. If you’re willing to rearrange the rows of the matrix so that $ A_0$ and $ A_1$ stack on top of each other, then this splits the matrix into two rectangles. Now we can switch to the column player and see which bit he sends in reply to each of the possible choices for $ b$ (say he sends back $ b’$). This separately splits each of $ A_0, A_1$ into two subrectangles corresponding to which inputs for the column player make him send the specific value of $ b’$. Continuing in this fashion we recurse until we find a submatrix consisting entirely of ones or entirely of zeros, and then we can say with certainty what the value of the function $ f$ is.

It’s difficult to visualize because every time we subdivide we move around the rows and columns within the submatrix corresponding to the inputs for each player. So the following would be a possible subdivision of an 8×8 matrix (with the values in the rectangles denoting which communicated bits got you there), but it’s sort of a strange one because we didn’t move the inputs around at all. It’s just a visual aid.

maj-matrix-subdivision

If we do this for $ t$ steps we get $ 2^t$ subrectangles. A crucial fact is that any valid communication protocol for a function has to give a subdivision of the matrix where all the rectangles are constant. or else there would be two pairs of inputs $ (x,y), (x’, y’)$, which are labeled identically by the communication protocol, but which have different values under $ f$.

So naturally one expects the communication complexity of $ f$ would require as many steps as there are steps in the best decomposition, that is, the decomposition with the fewest levels of subdivision. Indeed, we’ll prove this and introduce some notation to make the discourse less clumsy.

Definition: For an $ m \times n$ matrix $ M$, a rectangle is a submatrix $ A \times B$ where $ A \subset \{ 1, \dots m \}, B \subset \{ 1, \dots, n \}$. A rectangle is called monochromatic if all entires in the corresponding submatrix $ \left.M\right|_{A \times B}$ are the same. A monochromatic tiling of $ M$ is a partition of $ M$ into disjoint monochromatic rectangles. Define $ \chi(f)$ to be the minimum number of rectangles in any monochromatic tiling of $ M(f)$.

As we said, if there are $ t$ steps in a valid communication protocol for $ f$, then there are $ 2^t$ rectangles in the corresponding monochromatic tiling of $ M(f)$. Here is an easy consequence of this.

Proposition: If $ f$ has communication complexity $ CC(f)$, then there is a monochromatic tiling of $ M(f)$ with at most $ 2^{CC(f)}$ rectangles. In particular, $ \log(\chi(f)) \leq CC(f)$.

Proof. Pick any protocol that achieves the communication complexity of $ f$, and apply the process we described above to subdivide $ M(f)$. This will take exactly $ CC(f)$, and produce no more than $ 2^{CC(f)}$ rectangles.

$ \square$

This already gives us a bunch of theorems. Take the EQ function, for example. Its matrix is the identity matrix, and it’s not hard to see that every monochromatic tiling requires $ 2^n$ rectangles, one for each entry of the diagonal. I.e., $ CC(EQ) \geq n$. But we already know that one player can just send all his bits, so actually $ CC(EQ) = \Theta(n)$. Now it’s not always so easy to compute $ \chi(f)$. The impressive thing to do is to use efficiently computable information about $ M(f)$ to give bounds on $ \chi(f)$ and hence on $ CC(f)$. So can we come up with a better lower bound that depends on something we can compute? The answer is yes.

Theorem: For every function $ f$, $ \chi(f) \geq \textup{rank }M(f)$.

Proof. This just takes some basic linear algebra. One way to think of the rank of a matrix $ A$ is as the smallest way to write $ A$ as a linear combination of rank 1 matrices (smallest as in, the smallest number of terms needed to do this). The theorem is true no matter which field you use to compute the rank, although in this proof and in the rest of this post we’ll use the real numbers.

If you give me a monochromatic tiling by rectangles, I can view each rectangle as a matrix whose rank is at most one. If the entries are all zeros then the rank is zero, and if the entries are all ones then (using zero elsewhere) this is by itself a rank 1 matrix. So adding up these rectangles as separate components gives me an upper bound on the rank of $ A$. So the minimum way to do this is also an upper bound on the rank of $ A$.

$ \square$

Now computing something like $ CC(EQ)$ is even easier, because the rank of $ M(EQ) = M(I_{2^n})$ is just $ 2^n$.

Upper Bounds

There are other techniques to show lower bounds that are stronger than the rank and tiling method (because they imply the rank and tiling method). See this survey for a ton of details. But I want to discuss upper bounds a bit, because the central open conjecture in communication complexity is an upper bound.

The Log-Rank Conjecture: There is a universal constant $ c$, such that for all $ f$, the communication complexity $ CC(f) = O((\log \textup{rank }M(f))^c)$.

All known examples satisfy the conjecture, but unfortunately the farthest progress toward the conjecture is still exponentially worse than the conjecture’s statement. In 1997 the record was due to Andrei Kotlov who proved that $ CC(f) \leq \log(4/3) \textup{rank }M(f)$. It was not until 2013 that any (unconditional) improvements were made to this, when Shachar Lovett proved that $ CC(f) = O(\sqrt{\textup{rank }M(f)} \cdot \log \textup{rank }M(f))$.

The interested reader can check out this survey of Shachar Lovett from earlier this year (2014) for detailed proofs of these theorems and a discussion of the methods. I will just discuss one idea from this area that ties in nicely with our discussion: which is that finding an efficient communication protocol for a low-rank function $ f$ reduces to finding a large monochromatic rectangle in $ M(f)$.

Theorem [Nisan-Wigderson 94]: Let $ c(r)$ be a function. Suppose that for any function $ f: X \times Y \to \{ 0,1 \}$, we can find a monochromatic rectangle of size $ R \geq 2^{-c(r)} \cdot | X \times Y |$ where $ r = \textup{rank }M(f)$. Then any such $ f$ is computable by a deterministic protocol with communication complexity.

$ \displaystyle O \left ( \log^2(r) + \sum_{i=0}^{\log r} c(r/2^i) \right )$

Just to be concrete, this says that if $ c(r)$ is polylogarithmic, then finding these big rectangles implies a protocol also with polylogarithmic complexity. Since the complexity of the protocol is a function of $ r$ alone, the log-rank conjecture follows as a consequence. The best known results use the theorem for larger $ c(r) = r^b$ for some $ b < 1$, which gives communication complexity also $ O(r^b)$.

The proof of the theorem is detailed, but mostly what you’d expect. You take your function, split it up into the big monochromatic rectangle and the other three parts. Then you argue that when you recurse to one of the other three parts, either the rank is cut in half, or the size of the matrix is much smaller. In either case, you can apply the theorem once again. Then you bound the number of leaves in the resulting protocol tree by looking at each level $ i$ where the rank has dropped to $ r/2^i$. For the full details, see page 4 of the Shachar survey.

Multiple Players and More

In the future we’ll cover some applications of communication complexity, many of which are related to computing in restricted models such as parallel computation and streaming computation. For example, in parallel computing you often have processors which get arbitrary chunks of data as input and need to jointly compute something. Lower bounds on the communication complexity can help you prove they require a certain amount of communication in order to do that.

But in these models there are many players. And the type of communication matters: it can be point-to-point or broadcast, or something more exotic like MapReduce. So before we can get to these applications we need to define and study the appropriate generalizations of communication complexity to multiple interacting parties.

Until then!

Learning to Love Complex Numbers

This post is intended for people with a little bit of programming experience and no prior mathematical background.

So let’s talk about numbers.

Numbers are curious things. On one hand, they represent one of the most natural things known to humans, which is quantity. It’s so natural to humans that even newborn babies are in tune with the difference between quantities of objects between 1 and 3, in that they notice when quantity changes much more vividly than other features like color or shape.

But our familiarity with quantity doesn’t change the fact that numbers themselves (as an idea) are a human invention. And they’re not like most human inventions, the kinds where you have to tinker with gears or circuits to get a machine that makes your cappuccino. No, these are mathematical inventions. These inventions exist only in our minds.

Numbers didn’t always exist. A long time ago, back when the Greeks philosophers were doing their philosophizing, negative numbers didn’t exist! In fact, it wasn’t until 1200 AD that the number zero was first considered in Europe. Zero, along with negative numbers and fractions and square roots and all the rest, were invented primarily to help people solve more problems than they could with the numbers they had available. That is, numbers were invented primarily as a way for people to describe their ideas in a useful way. People simply  wondered “is there a number whose square gives you 2?” And after a while they just decided there was and called it $ \sqrt{2}$ because they didn’t have a better name for it. 

But with these new solutions came a host of new problems. You see, although I said mathematical inventions only exist in our minds, once they’re invented they gain a life of their own. You start to notice patterns in your mathematical objects and you have to figure out why they do the things they do. And numbers are a perfectly good example of this: once I notice that I can multiply a number by itself, I can ask how often these “perfect squares” occur. That is, what’s the pattern in the numbers $ 1^2, 2^2, 3^2, 4^2, \dots$? If you think about it for a while, you’ll find that square numbers have a very special relationship with odd numbers.

Other times, however, the things you invent turn out to make no sense at all, and you can prove they never existed in the first place! It’s an odd state of affairs, but we’re going to approach the subject of complex numbers from this mindset. We’re going to come up with a simple idea, the idea that negative numbers can be perfect squares, and explore the world of patterns it opens up. Along the way we’ll do a little bit of programming to help explore, give some simple proofs to solidify our intuition, and by the end we’ll see how these ideas can cause wonderful patterns like this one:

mandelbrot

The number i

Let’s bring the story back around to squares. One fact we all remember about numbers is that squaring a number gives you something non-negative. $ 7^2 = 49, (-2)^2 = 4, 0^2 = 0$, and so on. But it certainly doesn’t have to be this way. What if we got sick of that stupid fact and decided to invent a new number whose square was negative? Which negative, you ask? Well it doesn’t really matter, because I can always stretch it larger or smaller so that it’s square is -1.

Let’s see how: if you say that your made-up number $ x$ makes $ x^2 = -7$, then I can just use $ \frac{x}{\sqrt{7}}$ to get a number whose square is -1. If you’re going to invent a number that’s supposed to interact with our usual numbers, then you have to be allowed to add, subtract, and multiply $ x$ with regular old real numbers, and the usual properties would have to still work. So it would have to be true that $ (x / \sqrt{7})^2 = x^2 / \sqrt{7}^2 = -7/7 = -1$.

So because it makes no difference (this is what mathematicians mean by, “without loss of generality”) we can assume that the number we’re inventing will have a square of negative one. Just to line up with history, let’s call the new number $ i$. So there it is: $ i$ exists and $ i^2 = -1$. And now that we are “asserting” that $ i$ plays nicely with real numbers, we get these natural rules for adding and subtracting and multiplying and dividing. For example

  • $ 1 + i$ is a new number, which we’ll just call $ 1+i$. And if we added two of these together, $ (1+ i) + (1+i)$, we can combine the real parts and the $ i$ parts to get $ 2 + 2i$. Same goes for subtraction. In general a complex number looks like $ a + bi$, because as we’ll see in the other points you can simplify every simple arithmetic expression down to just one “real number” part and one “real number times $ i$” part.
  • We can multiply $ 3 \cdot i$, and we’ll just call it $ 3i$, and we require that multiplication distributes across addition (that the FOIL rule works). So that, for example, $ (2 – i)(1 + 3i) = (2 + 6i – i – 3i^2) = (2 + 3) + (6i – i) = (5 + 5i)$.
  • Dividing is a significantly more annoying. Say we want to figure out what $ 1 / (1+i)$ is (in fact, it’s not even obvious that this should look like a regular number! But it does). The $ 1 / a$ notation just means we’re looking for a number which, when we multiply by the denominator $ a$, we get back to 1. So we’re looking to find out when $ (a + bi)(1 + i) = 1 + 0i$ where $ a$ and $ b$ are variables we’re trying to solve for. If we multiply it out we get $ (a-b) + (a + b)i = 1 + 0i$, and since the real part and the $ i$ part have to match up, we know that $ a – b = 1$ and $ a + b = 0$. If we solve these two equations, we find that $ a = 1/2, b = -1/2$ works great. If we want to figure out something like $ (2 + 3i) / (1 – i)$, we just find out what $ 1 / (1- i)$ is first, and then multiply the result by $ (2+3i)$.

So that was tedious and extremely boring, and we imagine you didn’t even read it (that’s okay, it really is boring!). All we’re doing is establishing ground rules for the game, so if you come across some arithmetic that doesn’t make sense, you can refer back to this list to see what’s going on. And once again, for the purpose of this post, we’re asserting that all these laws hold. Maybe some laws follow from others, but as long as we don’t come up with any nasty self-contradictions we’ll be fine.

And now we turn to the real questions: is $ i$ the only square root of -1? Does $ i$ itself have a square root? If it didn’t, we’d be back to where we started, with some numbers (the non-$ i$ numbers) having square roots while others don’t. And so we’d feel the need to make all the $ i$ numbers happy by making up more numbers to be their square roots, and then worrying what if these new numbers don’t have square roots and…gah!

I’ll just let you in on the secret to save us from this crisis. It turns out that $ i$ does have a square root in terms of other $ i$ numbers, but in order to find it we’ll need to understand $ i$ from a different angle, and that angle turns out to be geometry.

Geometry? How is geometry going to help me understand numbers!?

It’s a valid question and part of why complex numbers are so fascinating. And I don’t mean geometry like triangles and circles and parallel lines (though there will be much talk of angles), I mean transformations in the sense that we’ll be “stretching,” “squishing,” and “rotating” numbers. Maybe another time I can tell you why for me “geometry” means stretching and rotating; it’s a long but very fun story.

The clever insight is that you can represent complex numbers as geometric objects in the first place. To do it, you just think of $ a + bi$ as a pair of numbers $ (a,b)$, (the pair of real part and $ i$ part), and then plot that point on a plane. For us, the $ x$-axis will be the “real” axis, and the $ y$-axis will be the $ i$-axis. So the number $ (3 – 4i)$ is plotted 3 units in the positive $ x$ direction and 4 units in the negative $ y$ direction. Like this:

single-complex-number

The “j” instead of “i” is not a typo, but a disappointing fact about the programming language we used to make this image. We’ll talk more about why later.

We draw it as an arrow for a good reason. Stretching, squishing, rotating, and reflecting will all be applied to the arrow, keeping its tail fixed at the center of the axes. Sometimes the arrow is called a “vector,” but we won’t use that word because here it’s synonymous with “complex number.”

So let’s get started squishing stuff.

Stretching, Squishing, Rotating

Before we continue I should clear up some names. We call a number that has an $ i$ in it a complex number, and we call the part without the $ i$ the real part (like 2 in $ 2-i$) and the part with $ i$ the complex part.

Python is going to be a great asset for us in exploring complex numbers, so let’s jump right into it. It turns out that Python natively supports complex numbers, and I wrote a program for drawing complex numbers. I used it to make the plot above. The program depends on a library I hate called matplotlib, and so the point of the program is to shield you from as much pain as possible and focus on complex numbers. You can use the program by downloading it from this blog’s Github page, along with everything else I made in writing this post. All you need to know how to do is call a function, and I’ve done a bit of window dressing removal to simplify things (I really hate matplotlib).

Here’s the function header:

# plotComplexNumbers : [complex] -&gt; None
# display a plot of the given list of complex numbers
def plotComplexNumbers(numbers):
   ...

Before we show some examples of how to use it, we have to understand how to use complex numbers in Python. It’s pretty simple, except that Python was written by people who hate math, and so they decided the complex number would be represented by $ j$ instead of $ i$ (people who hate math are sometimes called “engineers,” and they use $ j$ out of spite. Not really, though).

So in Python it’s just like any other computation. For example:

>>> (1 + 1j)*(4 - 2j) == (6+2j)
True
>>> 1 / (1+1j)
(0.5-0.5j)

And so calling the plotting function with a given list of complex numbers is as simple as importing the module and calling the function

from plotcomplex import plot
plot.plotComplexNumbers([(-1+1j), (1+2j), (-1.5 - 0.5j), (.6 - 1.8j)])

Here’s the result

example-complex-plot

So let’s use plots like this one to explore what “multiplication by $ i$” does to a complex number. It might not seem exciting at first, but I promise there’s a neat punchline.

Even without plotting it’s pretty easy to tell what multiplying by $ i$ does to some numbers. It takes 1 to $ i$, moves $ i$ to $ i^2 = -1$, it takes -1 to $ -i$, and $ -i$ to $ -i \cdot i = 1$.

What’s the pattern in these? well if we plot all these numbers, they’re all at right angles in counter-clockwise order. So this might suggest that multiplication by $ i$ does some kind of rotation. Is that always the case? Well lets try it with some other more complicated numbers. Click the plots below to enlarge.

Well, it looks close but it’s hard to tell. Some of the axes are squished and stretched, so it might be that our images don’t accurately represent the numbers (the real world can be such a pain). Well when visual techniques fail, we can attempt to prove it.

Clearly multiplying by $ i$ does some kind of rotation, maybe with other stuff too, and it shouldn’t be so hard to see that multiplying by $ i$ does the same thing no matter which number you use (okay, the skeptical readers will say that’s totally hard to see, but we’ll prove it super rigorously in a minute). So if we take any number and multiply it by $ i$ once, then twice, then three times, then four, and if we only get back to where we started at four multiplications, then each rotation had to be a quarter turn.

Indeed,

$ \displaystyle (a + bi) i^4 = (ai – b) i^3 = (-a – bi) i^2 = (-ai + b) i = a + bi$

This still isn’t all that convincing, and we want to be 100% sure we’re right. What we really need is a way to arithmetically compute the angle between two complex numbers in their plotted forms. What we’ll do is find a way to measure the angle of one complex number with the $ x$-axis, and then by subtraction we can get angles between arbitrary points. For example, in the figure below $ \theta = \theta_1 – \theta_2$.

angle-example

One way to do this is with trigonometry: the geometric drawing of $ a + bi$ is the hypotenuse of a right triangle with the $ x$-axis.

triangle-example

And so if $ r$ is the length of the arrow, then by the definition of sine and cosine, $ \cos(\theta) = a/r, \sin(\theta) = b/r$. If we have $ r, \theta$, and $ r > 0$, we can solve for a unique $ a$ and $ b$, so instead of representing a complex number in terms of the pair of numbers $ (a,b)$, we can represent it with the pair of numbers $ (r, \theta)$. And the conversion between the two is just

$ a + bi = r \cos(\theta) + (r \sin(\theta)) i$

The $ (r, \theta)$ representation is called the polar representation, while the $ (a,b)$ representation is called the rectangular representation or the Cartesian representation. Converting between polar and Cartesian coordinates fills the pages of many awful pre-calculus textbooks (despite the fact that complex numbers don’t exist in classical calculus). Luckily for us Python has built-in functions to convert between the two representations for us.

&gt;&gt;&gt; import cmath
&gt;&gt;&gt; cmath.polar(1 + 1j)
(1.4142135623730951, 0.7853981633974483)
&gt;&gt;&gt; z = cmath.polar(1 + 1j)
&gt;&gt;&gt; cmath.rect(z[0], z[1])
(1.0000000000000002+1j)

It’s a little bit inaccurate on the rounding, but it’s fine for our purposes.

So how do we compute the angle between two complex numbers? Just convert each to the polar form, and subtract the second coordinates. So if we get back to our true goal, to figure out what multiplication by $ i$ does, we can just do everything in polar form. Here’s a program that computes the angle between two complex numbers.

def angleBetween(z, w):
   zPolar, wPolar = cmath.polar(z), cmath.polar(w)
   return wPolar[1] - zPolar[1]

print(angleBetween(1 + 1j, (1 + 1j) * 1j))
print(angleBetween(2 - 3j, (2 - 3j) * 1j))
print(angleBetween(-0.5 + 7j, (-0.5 + 7j) * 1j))

Running it gives

1.5707963267948966
1.5707963267948966
-4.71238898038469

Note that the decimal form of $ \pi/2$ is 1.57079…, and that the negative angle is equivalent to $ \pi/2$ if you add a full turn of $ 2\pi$ to it. So programmatically we can see that for every input we try multiplying by $ i$ rotates 90 degrees.

But we still haven’t proved it works. So let’s do that now. To say what the angle is between $ r \cos (\theta) + ri \sin (\theta)$ and $ i \cdot [r \cos (\theta) + ri \sin(\theta)] = -r \sin (\theta) + ri \cos(\theta)$, we need to transform the second number into the usual polar form (where the $ i$ is on the sine part and not the cosine part). But we know, or I’m telling you now, this nice fact about sine and cosine:

$ \displaystyle \sin(\theta + \pi/2) = cos(\theta)$
$ \displaystyle \cos(\theta + \pi / 2) = -\sin(\theta)$

This fact is maybe awkward to write out algebraically, but it’s just saying that if you shift the whole sine curve a little bit you get the cosine curve, and if you keep shifting it you get the opposite of the sine curve (and if you kept shifting it even more you’d eventually get back to the sine curve; they’re called periodic for this reason).

So immediately we can rewrite the second number as $ r \cos(\theta + \pi/2) + i r \sin (\theta + \pi/2)$. The angle is the same as the original angle plus a right angle of $ \pi/2$. Neat!

Applying this same idea to $ (a + bi) \cdot (c + di)$, it’s not much harder to prove that multiplying two complex numbers in general multiplies their lengths and adds their angles. So if a complex number $ z$ has its magnitude $ r$ smaller than 1, multiplying by $ z$ squishes and rotates whatever is being multiplied. And if the magnitude is greater than 1, it stretches and rotates. So we have a super simple geometric understanding of how arithmetic with complex numbers works. And as we’re about to see, all this stretching and rotating results in some really weird (and beautifully mysterious!) mathematics and programs.

But before we do that we still have one question to address, the question that started this whole geometric train of thought: does $ i$ have a square root? Indeed, I’m just looking for a number such that, when I square its length and double its angle, I get $ i = \cos(\pi/2) + i \sin(\pi/2)$. Indeed, the angle we want is $ \pi/4$, and the length we want is $ r = 1$, which means $ \sqrt{i} = \cos(\pi/4) + i \sin(\pi/4)$. Sweet! There is another root if you play with the signs, see if you can figure it out.

In fact it’s a very deeper and more beautiful theorem (“theorem” means “really important fact”) called the fundamental theorem of algebra. And essentially it says that the complex numbers are complete. That is, we can always find square roots, cube roots, or anything roots of numbers involving $ i$. It actually says a lot more, but it’s easier to appreciate the rest of it after you do more math than we’re going to do in this post.

On to pretty patterns!

The Fractal

So here’s a little experiment. Since every point in the plane is the end of some arrow representing a complex number, we can imagine transforming the entire complex plane by transforming each number by the same rule. The most interesting simple rule we can think of: squaring! So though it might strain your capacity for imagination, try to visualize the idea like this. Squaring a complex number is the same as squaring it’s length and doubling its angle. So imagine: any numbers whose arrows are longer than 1 will grow much bigger, arrows shorter than 1 will shrink, and arrows of length exactly one will stay the same length (arrows close to length 1 will grow/shrink much more slowly than those far away from 1). And complex numbers with small positive angles will increase their angle, but only a bit, while larger angles will grow faster.

Here’s an animation made by Douglas Arnold showing what happens to the set of complex numbers $ a + bi$ with $ 0 \leq a, b \leq 1$ or $ -1 < a,b < 0$. Again, imagine every point is the end of a different arrow for the corresponding complex number. The animation is for a single squaring, and the points move along the arc they would travel if one rotated/stretched them smoothly.

complex-squaring

So that’s pretty, but this is by all accounts a well-behaved transformation. It’s “predictable,” because for example we can always tell which complex numbers will get bigger and bigger (in length) and which will get smaller.

What if, just for the sake of tinkering, we changed the transformation a little bit? That is, instead of sending $ z = a+bi$ to $ z^2$ (I’ll often write this $ z \mapsto z^2$), what if we sent

$ \displaystyle z \mapsto z^2 + 1$

Now it’s not so obvious: which numbers will grow and which will shrink? Notice that it’s odd because adding 1 only changes the real part of the number. So a number whose length is greater than 1 can become small under this transformation. For example, $ i$ is sent to $ 0$, so something slightly larger would also be close to zero. Indeed, $ 5i/4 \mapsto -9/16$.

So here’s an interesting question: are there any complex numbers that will stay small even if I keep transforming like this forever? Specifically, if I call $ f(z) = z^2$, and I call $ f^2(z) = f(f(z))$, and likewise call $ f^k(z)$ for $ k$ repeated transformations of $ z$, is there a number $ z$ so that for every $ k$, the value $ f^k(z) < 2$? “Obvious” choices like $ z=0$ don’t work, and neither do random guesses like $ z=i$ or $ z=1$. So should we guess the answer is no?

Before we jump to conclusions let’s write a program to see what happens for more than our random guesses. The program is simple: we’ll define the “square plus one” function, and then repeatedly apply that function to a number for some long number of times (say, 250 times). If the length of the number stays under 2 after so many tries, we’ll call it “small forever,” and otherwise we’ll call it “not small forever.”

def squarePlusOne(z):
   return z*z + 1

def isSmallForever(z, f):
   k = 0

   while abs(z) &lt; 2: z = f(z) k += 1 if k &gt; 250:
         return True

   return False

This isSmallForever function is generic: you can give it any function $ f$ and it will repeatedly call $ f$ on $ z$ until the result grows bigger than 2 in length. Note that the abs function is a built-in Python function for computing the length of a complex number.

Then I wrote a classify function, which you can give a window and a small increment, and it will produce a grid of zeros and ones marking the results of isSmallForever. The details of the function are not that important. I also wrote a function that turns the grid into a picture. So here’s an example of how we’d use it:

from plotcomplex.plot import gridToImage

def classifySquarePlusOne(z):
   return isSmallForever(z, squarePlusOne)

grid = classify(classifySquarePlusOne) # the other arguments are defaulted to [-2,2], [-2,2], 0.1
gridToImage(grid)

And here’s the result. Points colored black grow beyond 2, and white points stay small for the whole test.

Looks like they'll always grow big.

Looks like they’ll always grow big.

So it looks like repeated squaring plus one will always make complex numbers grow big. That’s not too exciting, but we can always make it more exciting. What happens if we replace the 1 in $ z^2 + 1$ with a different complex number? For example, if we do $ z^2 – 1$ then will things always grow big?

You can randomly guess and see that 0 will never grow big, because $ 0^2 – 1 = -1$ and $ (-1)^2 – 1 = 0$. It will just oscillate forever. So with -1 some numbers will grow and some will not! Let’s use the same routine above to see which:

def classifySquareMinusOne(z):
      return isSmallForever(z, squareMinusOne)

grid = classify(classifySquareMinusOne)
gridToImage(grid)

And the result:

second-attempt

Now that’s a more interesting picture! Let’s ramp up the resolution

grid = classify(classifySquareMinusOne, step=0.001)
gridToImage(grid)

second-attempt-zoomed

Gorgeous. If you try this at home you’ll notice, however, that this took a hell of a long time to run. Speeding up our programs is very possible, but it’s a long story for another time. For now we can just be patient.

Indeed, this image has a ton of interesting details! It looks almost circular in the middle, but if we zoom in we can see that it’s more like a rippling wave

second-attempt-zoomed2

It’s pretty incredible, and a huge question is jumping out at me: what the heck is causing this pattern to occur? What secret does -1 know that +1 doesn’t that makes the resulting pattern so intricate?

But an even bigger question is this. We just discovered that some values of $ c$ make $ z \mapsto z^2 + c$ result in interesting patterns, and some do not! So the question is which ones make interesting patterns? Even if we just, say, fix the starting point to zero: what is the pattern in the complex numbers that would tell me when this transformation makes zero blow up, and when it keeps zero small?

Sounds like a job for another program. This time we’ll use a nice little Python feature called a closure, which we define a function that saves the information that exists when it’s created for later. It will let us write a function that takes in $ c$ and produces a function that transforms according to $ z \mapsto z^2+c$.

def squarePlusC(c):
   def f(z):
      return z*z + c

   return f

And we can use the very same classification/graphing function from before to do this.

def classifySquarePlusC(c):
   return isSmallForever(0, squarePlusC(c))

grid = classify(classifySquarePlusC, xRange=(-2, 1), yRange=(-1, 1), step=0.005)
gridToImage(grid)

And the result:

mandelbrot

Stunning. This wonderful pattern, which is still largely not understood today, is known as the Mandelbrot set. That is, the white points are the points in the Mandlebrot set, and the black points are not in it. The detail on the border of this thing is infinitely intricate. For example, we can change the window in our little program to zoom in on a particular region.

mandelbrot-zoomed

And if you keep zooming in you keep getting more and more detail. This was true of the specific case of $ z^2 – 1$, but somehow the patterns in the Mandelbrot set are much more varied and interesting. And if you keep going down eventually you’ll see patterns that look like the original Mandelbrot set. We can already kind of see that happening above. The name for this idea is a fractal, and the $ z^2 – 1$ image has it too. Fractals are a fascinating and mysterious subject studied in a field called discrete dynamical systems. Many people dedicate their entire lives to studying these things, and it’s for good reason. There’s a lot to learn and even more that’s unknown!

So this is the end of our journey for now. I’ve posted all of the code we used in the making of this post so you can continue to play, but here are some interesting ideas.

  • The Mandelbrot set (and most fractals) are usually colored. The way they’re colored is as follows. Rather than just say true or false when zero blows up beyond 2 in length, you return the number of iterations $ k$ that happened. Then you pick a color based on how big $ k$ is. There’s a link below that lets you play with this. In fact, adding colors shows that there is even more intricate detail happening outside the Mandelbrot set that’s too faint to see in our pictures above. Such as this.
  • Some very simple questions about fractals are very hard to answer. For example, is the Mandelbrot set connected? That is, is it possible to “walk” from every point in the Mandelbrot set to every other point without leaving the set? Despite the scattering of points in the zoomed in picture above that suggest the answer is no, the answer is actually yes! This is a really difficult thing to prove, however.
  • The patterns in many fractals are often used to generate realistic looking landscapes and generate pseudo randomness. So fractals are not just mathematical curiosities.
  • You should definitely be experimenting with this stuff! What happens if you change the length threshold from 2 to some bigger number? What about a smaller number? What if you do powers different than $ 2$? There’s so much to explore!
  • The big picture thing to take away from this is that it’s not the numbers themselves that are particularly interesting, it’s the transformations of the numbers that generate these patterns! The interesting questions are what kinds of things are the same under these transformations, and what things are different. This is a very general idea in mathematics, and the more math you do the more you’ll find yourself wondering about useful and bizarre transformations.

For the chance to keep playing with the Mandelbrot set, check out this Mandelbrot grapher that works in your browser. It lets you drag rectangles to zoom further in on regions of interest. It’s really fun.

Until next time!