Tensorphobia and the Outer Product

Variations on a theme

Back in 2014 I wrote a post called How to Conquer Tensorphobia that should end up on Math $ \cap$ Programming’s “greatest hits” album. One aspect of tensors I neglected to discuss was the connection between the modern views of tensors and the practical views of linear algebra. I feel I need to write this because every year or two I forget why it makes sense.

The basic question is:

What the hell is going on with the outer product of vectors?

The simple answer, the one that has never satisfied me, is that the outer product of $ v,w$ is the matrix $ vw^T$ whose $ i,j$ entry is the product $ v_i w_j$. This doesn’t satisfy me because it’s an explanation by fiat. It lacks motivation and you’re supposed to trust (or verify) things magically work out. To me this definition is like the definition of matrix multiplication: having it dictated to you before you understand why it makes sense is a cop out. Math isn’t magic, it needs to make perfect sense.

The answer I like, and the one I have to re-derive every few years because I never wrote it down, is a little bit longer.

To borrow a programming term, the basic confusion is a type error. We start with two vectors, $ v,w$ in some vector space $ V$ (let’s say everything is finite dimensional), and we magically turn them into a matrix. Let me reiterate for dramatic effect: we start with vectors, which I have always thought of as objects, things you can stretch, rotate, or give as a gift to your cousin Mauricio. While a matrix is a mapping, a thing that takes vectors as input and spits out new vectors. Sure, you can play with the mappings, feed them kibble and wait for them to poop or whatever. And sure, sometimes vectors are themselves maps, like in the vector space of real-valued functions (where the word “matrix” is a stretch, since it’s infinite dimensional).

But to take two vectors and *poof* get a mapping of all vectors, it’s a big jump. And part of the discomfort is that it feels random. To be happy we have to understand that the construction is natural, or even better canonical, meaning this should be the only way to turn two vectors into a linear map. Then the definition would make sense.

So let’s see how we can do that. Just to be clear, everything we do in this post will be for finite-dimensional vector spaces over $ \mathbb{R}$, but we’ll highlight the caveats when they come up.

Dual vector spaces

The first step is understanding how to associate a vector with a linear map in a “natural” or “canonical” way. There is one obvious candidate: if you give me a vector $ v$, I can make a linear map by taking the dot product of $ v$ with the input. So I can associate

$ \displaystyle v \mapsto \langle v,- \rangle$

The dash is a placeholder for the input. Another way to say it is to define $ \varphi_v(w) = \langle v,w \rangle$ and say the association takes $ v \mapsto \varphi_v$. So this “association,” taking $ v$ to the inner product, is itself  a mapping from $ V$ to “maps from $ V$ to $ \mathbb{R}$.” Note that $ \varphi_v(w)$ is linear in $ v$ and $ w$ because that’s part of the definition of an inner product.

To avoid saying “maps from $ V$ to $ \mathbb{R}$” all the time, we’ll introduce some notation.

Definition: Let $ V$ be a vector space over a field $ k$. The set of $ k$-linear maps from $ V \to k$ is called $ \textup{Hom}(V,k)$. More generally, the set of $ k$-linear maps from $ V$ to another $ k$-vector space $ W$ is called $ \textup{Hom}(V,W)$.

“Hom” stands for “homomorphism,” and in general it just means “maps with the structure I care about.” For this post $ k = \mathbb{R}$, but most of what we say here will be true for any field. If you go deeply into this topic, it matters whether $ k$ is algebraically closed, or has finite characteristic, but for simplicity we’ll ignore all of that. We’ll also ignore the fact that these maps are called linear functionals and this is where the name “functional analysis” comes from. All we really want to do is understand the definition of the outer product.

Another bit of notation for brevity:

Definition: Let $ V$ be a $ \mathbb{R}$-vector space. The dual vector space for $ V$, denoted $ V^*$, is $ \textup{Hom}(V, \mathbb{R})$.

So the “vector-to-inner-product” association we described above is a map $ V \to V^*$. It takes in $ v \in V$ and spits out $ \varphi_v \in V^*$.

Now here’s where things start to get canonical (interesting). First, $ V^*$ is itself a vector space. This is an easy exercise, and the details are not too important for us, but I’ll say the key: if you want to add two functions, you just add their (real number) outputs. In fact we can say more:

Theorem: $ V$ and $ V^*$ are isomorphic as vector spaces, and the map $ v \mapsto \varphi_v$ is the canonical isomorphism.

Confessions of a mathematician: we’re sweeping some complexity under the rug. When we upgraded our vector space to an inner product space, we fixed a specific (but arbitrary) inner product on $ V$. For finite dimensional vector spaces it makes no difference, because every finite-dimensional $ \mathbb{R}$-inner product space is isomorphic to $ \mathbb{R}^n$ with the usual inner product. But the theorem is devastatingly false for infinite-dimensional vector spaces. There are two reasons: (1) there are many (non-canonical) choices of inner products and (2) the mapping for any given inner product need not span $ V^*$. Luckily we’re in finite dimensions so we can ignore all that. [Edit: see Emilio’s comments for a more detailed discussion of what’s being swept under the rug, and how we’re ignoring the categorical perspective when we say “natural” and “canonical.”]

Before we make sense of the isomorphism let’s talk more about $ V^*$. First off, it’s not even entirely obvious that $ V^*$ is finite-dimensional. On one hand, if $ v_1, \dots, v_n$ is a basis of $ V$ then we can quickly prove that $ \varphi_{v_1}, \dots ,\varphi_{v_n}$ are linearly independent in $ V^*$. Indeed, if they weren’t then there’d be some linear combination $ a_1 \varphi_{v_1} + \dots + a_n \varphi_{v_n}$ that is the zero function, meaning that for every vector $ w$, the following is zero

$ \displaystyle a_1 \langle v_1, w \rangle + \dots + a_n \langle v_n, w \rangle = 0.$

But since the inner product is linear in both arguments we get that $ \langle a_1 v_1 + \dots + a_n v_n , w \rangle = 0$ for every $ w$. And this can only happen when $ a_1v_1 + \dots + a_nv_n$ is the zero vector (prove this).

One consequence is that the linear map $ v \mapsto \varphi_v$ is injective. So we can think of $ V$ as “sitting inside” $ V^*$. Now here’s a very slick way to show that the $ \varphi_{v_i}$ span all of $ V^*$. First we can assume our basis $ v_1, \dots, v_n$ is actually an orthonormal basis with respect to our inner product (this is without loss of generality). Then we write any linear map $ f \in V^*$ as

$ f = f(v_1)\varphi_{v_1} + \dots + f(v_n) \varphi_{v_n}$

To show these two are actually equal, it’s enough to show they agree on a basis for $ V$. That is, if you plug in $ v_1$ to the function on the left- and right-hand side of the above, you’ll get the same thing. The orthonormality of the basis makes it work, since all the irrelevant inner products are zero.

In case you missed it, that completes the proof that $ V$ and $ V^*$ are isomorphic. Now when I say that the isomorphism is “canonical,” I mean that if you’re willing to change the basis of $ V$ and $ V^*$, then $ v \mapsto \varphi_v$ is the square identity matrix, i.e. the only isomorphism between any two finite vector spaces (up to a change of basis).

Tying in tensors

At this point we have a connection between single vectors $ v$ and linear maps $ V^*$ whose codomain has dimension 1. If we want to understand the outer product, we need a connection between pairs of vectors and matrices, i.e. $ \textup{Hom}(V,V)$. In other words, we’d like to find a canonical isomorphism between $ V \times V$ and $ \textup{Hom}(V,V)$. But already it’s not possible because the spaces have different dimensions. If $ \textup{dim}(V) = n$ then the former has dimension $ 2n$ and the latter has dimension $ n^2$. So any “natural” relation between these spaces has to be a way to embed $ V \times V \subset \textup{Hom}(V,V)$ as a subspace via some injective map.

There are two gaping problems with this approach. First, the outer product is not linear as a map from $ V \times V \to \textup{Hom}(V,V)$. To see this, take any $ v,w \in V$, pick any scalar $ \lambda \in \mathbb{R}$. Scaling the pair $ (v,w)$ means scaling both components to $ (\lambda v, \lambda w)$, and so the outer product is the matrix $ (\lambda v)(\lambda w^T) = \lambda^2 vw^T$.

The second problem is that the only way to make $ V \times V$ a subspace of $ \textup{Hom}(V,V)$ (up to a change of basis) is to map $ v,w$ to the first two rows of a matrix with zeros elsewhere. This is canonical but it doesn’t have the properties that the outer product promises us. Indeed, the outer product let’s us uniquely decompose a matrix as a “sum of rank 1 matrices,” but we don’t get a unique decomposition of a matrix as a sum of these two-row things. We also don’t even get a well-defined rank by decomposing into a sum of two-row matrices (you can get cancellation by staggering the sum). This injection is decisively useless.

It would seem like we’re stuck, until we think back to our association between $ V$ and $ V^*$. If we take one of our two vectors, say $ v$, and pair it with $ w$, we can ask how $ (\varphi_v, w)$ could be turned into a linear map in $ \textup{Hom}(V,V)$. A few moments of guessing and one easily discovers the map

$ \displaystyle x \mapsto \varphi_v(x) w = \langle v,x \rangle w$

In words, we’re scaling $ w$ by the inner product of $ x$ and $ v$. In geometric terms, we project onto $ v$ and scale $ w$ by the signed length of that projection. Let’s call this map $ \beta_{v,w}(x)$, so that the association maps $ (v,w) \mapsto \beta_{v,w}$. The thought process of “easily discovering” this is to think, “What can you do with a function $ \varphi_v$ and an input $ x$? Plug it in. Then what can you do with the resulting number and a vector $ w$? Scale $ w$.”

If you look closely you’ll see we’ve just defined the outer product. This is because the outer product works by saying $ uv^T$ is a matrix, which acts on a vector $ x$ by doing $ (uv^T)x = u(v^T x)$. But the important thing is that, because $ V$ and $ V^*$ are canonically isomorphic, this is a mapping

$ \displaystyle V \times V = V^* \times V \to \textup{Hom}(V,V)$

Now again, this mapping is not linear. In fact, it’s bilinear, and if there’s one thing we know about bilinear maps, it’s that tensors are their gatekeepers. If you recall our previous post on tensorphobia, this means that this bilinear map “factors through” the tensor product in a canonical way. So the true heart of this association $ (v,w) \mapsto \beta_{v,w}$ is a map $ B: V \otimes V \to \textup{Hom}(V,V)$ defined by

$ \displaystyle B(v \otimes w) = \beta_{v,w}$

And now the punchline,

Theorem: $ B$ is an isomorphism of vector spaces.

Proof. If $ v_1, \dots, v_n$ is a basis for $ V$ then it’s enough to show that $ \beta_{v_i, v_j}$ forms a basis for $ \textup{Hom}(V,V)$. Since we already know $ \dim(\textup{Hom}(V,V)) = n^2$ and there are $ n^2$ of the $ \beta_{v_i, v_j}$, all we need to do is show that the $ \beta$’s are linearly independent. For brevity let me remove the $ v$’s and call $ \beta_{i,j} =\beta_{v_i, v_j}$.

Suppose they are not linearly independent. Then there is some choice of scalars $ a_{i,j}$ so that the linear combination below is the identically zero function

$ \displaystyle \sum_{i,j=1}^n a_{i,j}\beta_{i,j} = 0$

In other words, if I plug in any $ v_i$ from my (orthonormal) basis, the result is zero. So let’s plug in $ v_1$.

$ \displaystyle \begin{aligned} 0 &= \sum_{i,j=1}^n a_{i,j} \beta_{i,j}(v_1) \\ &= \sum_{i,j=1}^n a_{i,j} \langle v_i, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} \langle v_1, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} v_j \end{aligned}$

The orthonormality makes all of the $ \langle v_i ,v_1 \rangle = 0$ when $ i \neq 1$, so we get a linear combination of the $ v_j$ being zero. Since the $ v_i$ form a basis, it must be that all the $ a_{1,j} = 0$. The same thing happens when you plug in $ v_2$ or any other $ v_k$, and so all the $ a_{i,j}$ are zero, proving linear independence.

$ \square$

This theorem immediately implies some deep facts, such as that every matrix can be uniquely decomposed as a sum of the $ \beta_{i,j}$’s. Moreover, facts like the $ \beta_{i,j}$’s being rank 1 are immediate: by definition the maps scale a single vector by some number. So of course the image will be one-dimensional. Finding a useful basis $ \{ v_i \}$ with which to decompose a matrix is where things get truly fascinating, and we’ll see that next time when we study the singular value decomposition.

In the mean time, this understanding generalizes nicely (via induction/recursion) to higher dimensional tensors. And you don’t need to talk about slices or sub-tensors or lose your sanity over $ n$-tuples of indices.

Lastly, all of this duality stuff provides a “coordinate-free” way to think about the transpose of a linear map. We can think of the “transpose” operation as a linear map $ -^T : \textup{Hom}(V,W) \to \textup{Hom}(W^*,V^*)$ which (even in infinite dimensions) has the following definition. If $ f \in \textup{Hom}(V,W)$ then $ f^T$ is a linear map taking $ h \in W^*$ to $ f^T(h) \in V^*$. The latter is a function in $ V^*$, so we need to say what it does on inputs $ v \in V$. The only definition that doesn’t introduce any type errors is $ (f^T(h))(v) = h(f(v))$. A more compact way to say this is that $ f^T(h) = h \circ f$.

Until next time!

Concrete Examples of Quantum Gates

So far in this series we’ve seen a lot of motivation and defined basic ideas of what a quantum circuit is. But on rereading my posts, I think we would all benefit from some concreteness.

“Local” operations

So by now we’ve understood that quantum circuits consist of a sequence of gates $ A_1, \dots, A_k$, where each $ A_i$ is an 8-by-8 matrix that operates “locally” on some choice of three (or fewer) qubits. And in your head you imagine starting with some state vector $ v$ and applying each $ A_i$ locally to its three qubits until the end when you measure the state and get some classical output.

But the point I want to make is that $ A_i$ actually changes the whole state vector $ v$, because the three qubits it acts “locally” on are part of the entire basis. Here’s an example. Suppose we have three qubits and they’re in the state

$ \displaystyle v = \frac{1}{\sqrt{14}} (e_{001} + 2e_{011} – 3e_{101})$

Recall we abbreviate basis states by subscripting them by binary strings, so $ e_{011} = e_0 \otimes e_1 \otimes e_1$, and a valid state is any unit vector over the $ 2^3 = 8$ possible basis elements. As a vector, this state is $ \frac{1}{\sqrt{14}} (0,1,0,2,0,-3,0,0)$

Say we apply the gate $ A$ that swaps the first and third qubits. “Locally” this gate has the following matrix:

$ \displaystyle \begin{pmatrix} 1&0&0&0 \\ 0&0&1&0 \\ 0&1&0&0 \\ 0&0&0&1 \end{pmatrix}$

where we index the rows and columns by the relevant strings in lexicographic order: 00, 01, 10, 11. So this operation leaves $ e_{00}$ and $ e_{11}$ the same while swapping the other two. However, as an operation on three qubits the operation looks quite different. And it’s sort of hard to describe a general way to write it down as a matrix because of the choice of indices. There are three different perspectives.

Perspective 1: if the qubits being operated on are sequential (like, the third, fourth, and fifth qubits), then we can write the matrix as $ I_{2^{a}} \otimes A \otimes I_{2^{b}}$ where a tensor product of matrices is the Kronecker product and $ a + b + \log \textup{dim}(A) = n$ (the number of qubits adds up). Then the final operation looks like a “tiled product” of identity matrices by $ A$, but it’s a pain to write out. Let me hurt my self for your sake, dear reader.

tensormatrix1

And each copy of $ A \otimes I_{2^b}$ looks like

tiled-piece

 

That’s a mess, but if you write it out for our example of swapping the first and third qubits of a three-qubit register you get the following:

example-3swap2

And this makes sense: the gate changes any entry of the state vector that has values for the first and third qubit that are different. This is what happens to our state:

$ \displaystyle v = \frac{1}{\sqrt{14}} (0,1,0,2,0,-3,0,0) \mapsto \frac{1}{\sqrt{14}} (0,0,0,0,1,-3,2,0)$

Perspective 2: just assume every operation works on the first three qubits, and wrap each operation $ A$ in between an operation that swaps the first three qubits with the desired three. So like $ BAB$ for $ B$ a swap operation. Then the matrix form looks a bit simpler, and it just means we permute the columns of the matrix form we gave above so that it just has the form $ A \otimes I_a$. This allows one to retain a shred of sanity when trying to envision the matrix for an operation that acts on three qubits that are not sequential. The downside is that to actually use this perspective in an analysis you have to carry around the extra baggage of these permutation matrices. So one might use this as a simplifying assumption (a “without loss of generality” statement).

Perspective 3: ignore matrices and write things down in a summation form. So if $ \sigma$ is the permutation that swaps 1 and 3 and leaves the other indices unchanged, we can write the general operation on a state $ v = \sum_{x \in \{ 0,1 \}^n } a_x e_{x}$ as $ Av = \sum_{x \in \{ 0,1 \}^n} a_x e_{\sigma(x)}$.

The third option is probably the nicest way to do things, but it’s important to keep the matrix view in mind for many reasons. Just one quick reason: “errors” in quantum gates (that are meant to approximately compute something) compound linearly in the number of gates because the operations are linear. This is a key reason that allows one to design quantum analogues of error correcting codes.

So we’ve established that the basic (atomic) quantum gates are “local” in the sense that they operate on a fixed number of qubits, but they are not local in the sense that they can screw up the entire state vector.

A side note on the meaning of “local”

When I was chugging through learning this stuff (and I still have far to go), I wanted to come up with an alternate characterization of the word “local” so that I would feel better about using the word “local.” Mathematicians are as passionate about word choice as programmers are about text editors. In particular, for a long time I was ignorantly convinced that quantum gates that act on a small number of qubits don’t affect the marginal distribution of measurement outcomes for other qubits. That is, I thought that if $ A$ acts on qubits 1,2,3, then $ Av$ and $ v$ have the same probability of a measurement producing a 1 in index 4, 5, etc, conditioned on fixing a measurement outcome for qubits 1,2,3. In notation, if $ x$ is a random variable whose values are binary strings and $ v$ is a state vector, I’ll call $ x \sim v$ the random process of measuring a state vector $ v$ and getting a string $ x$, then my claim was that the following was true for every $ b_1, b_2, b_3 \in \{0,1\}$ and every $ 1 \leq i \leq n$:

$ \displaystyle \begin{aligned}\Pr_{x \sim v}&[x_i = 1 \mid x_1 = b_1, x_2 = b_2, x_3 = b_3] = \\ \Pr_{x \sim Av}&[x_i = 1 \mid x_1 = b_1, x_2 = b_2, x_3 = b_3] \end{aligned}$

You could try to prove this, and you would fail because it’s false. In fact, it’s even false if $ A$ acts on only a single qubit! Because it’s so tedious to write out all of the notation, I decided to write a program to illustrate the counterexample. (The most brazenly dedicated readers will try to prove this false fact and identify where the proof fails.)

import numpy

H = (1/(2**0.5)) * numpy.array([[1,1], [1,-1]])
I = numpy.identity(4)
A = numpy.kron(H,I)

Here $ H$ is the 2 by 2 Hadamard matrix, which operates on a single qubit and maps $ e_0 \mapsto \frac{e_0 + e_1}{\sqrt{2}}$, and $ e_1 \mapsto \frac{e_0 – e_1}{\sqrt{2}}$. This matrix is famous for many reasons, but one simple use as a quantum gate is to generate uniform random coin flips. In particular, measuring $ He_0$ outputs 1 and 0 with equal probability.

So in the code sample above, $ A$ is the mapping which applies the Hadamard operation to the first qubit and leaves the other qubits alone.

Then we compute some arbitrary input state vector $ w$

def normalize(z):
   return (1.0 / (sum(abs(z)**2) ** 0.5)) * z

v = numpy.arange(1,9)
w = normalize(v)

And now we write a function to compute the probability of some query conditioned on some fixed bits. We simply sum up the square norms of all of the relevant indices in the state vector.

def condProb(state, query={}, fixed={}):
   num = 0
   denom = 0
   dim = int(math.log2(len(state)))

   for x in itertools.product([0,1], repeat=dim):
      if any(x[index] != b for (index,b) in fixed.items()):
         continue

      i = sum(d << i for (i,d) in enumerate(reversed(x)))
      denom += abs(state[i])**2
      if all(x[index] == b for (index, b) in query.items()):
         num += abs(state[i]) ** 2

   if num == 0:
      return 0 

   return num / denom

So if the query is query = {1:0} and the fixed thing is fixed = {0:0}, then this will compute the probability that the measurement results in the second qubit being zero conditioned on the first qubit also being zero.

And the result:

Aw = A.dot(w)
query = {1:0}
fixed = {0:0}
print((condProb(w, query, fixed), condProb(Aw, query, fixed)))
# (0.16666666666666666, 0.29069767441860467)

So they are not equal in general.

Also, in general we won’t work explicitly with full quantum gate matrices, since for $ n$ qubits the have size $ 2^{2n}$ which is big. But for finding counterexamples to guesses and false intuition, it’s a great tool.

Some important gates on 1-3 qubits

Let’s close this post with concrete examples of quantum gates. Based on the above discussion, we can write out the 2 x 2 or 4 x 4 matrix form of the operation and understand that it can apply to any two qubits in the state of a quantum program. Gates are most interesting when they’re operating on entangled qubits, and that will come out when we visit our first quantum algorithm next time, but for now we will just discuss at a naive level how they operate on the basis vectors.

Hadamard gate:

We introduced the Hadamard gate already, but I’ll reiterate it here.

Let $ H$ be the following 2 by 2 matrix, which operates on a single qubit and maps $ e_0 \mapsto \frac{e_0 + e_1}{\sqrt{2}}$, and $ e_1 \mapsto \frac{e_0 – e_1}{\sqrt{2}}$.

$ \displaystyle H = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

One can use $ H$ to generate uniform random coin flips. In particular, measuring $ He_0$ outputs 1 and 0 with equal probability.

Quantum NOT gate:

Let $ X$ be the 2 x 2 matrix formed by swapping the columns of the identity matrix.

$ \displaystyle X = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$

This gate is often called the “Pauli-X” gate by physicists. This matrix is far too simple to be named after a person, and I can only imagine it is still named after a person for the layer of obfuscation that so often makes people feel smarter (same goes for the Pauli-Y and Pauli-Z gates, but we’ll get to those when we need them).

If we’re thinking of $ e_0$ as the boolean value “false” and $ e_1$ as the boolean value “true”, then the quantum NOT gate simply swaps those two states. In particular, note that composing a Hadamard and a quantum NOT gate can have interesting effects: $ XH(e_0) = H(e_0)$, but $ XH(e_1) \neq H(e_1)$. In the second case, the minus sign is the culprit. Which brings us to…

Phase shift gate:

Given an angle $ \theta$, we can “shift the phase” of one qubit by an angle of $ \theta$ using the 2 x 2 matrix $ R_{\theta}$.

$ \displaystyle R_{\theta} = \begin{pmatrix} 1 & 0 \\ 0 & e^{i \theta} \end{pmatrix}$

“Phase” is a term physicists like to use for angles. Since the coefficients of a quantum state vector are complex numbers, and since complex numbers can be thought of geometrically as vectors with direction and magnitude, it makes sense to “rotate” the coefficient of a single qubit. So $ R_{\theta}$ does nothing to $ e_0$ and it rotates the coefficient of $ e_1$ by an angle of $ \theta$.

Continuing in our theme of concreteness, if I have the state vector $ v = \frac{1}{\sqrt{204}} (1,2,3,4,5,6,7,8)$ and I apply a rotation of $ pi$ to the second qubit, then my operation is the matrix $ I_2 \otimes R_{\pi} \otimes I_2$ which maps $ e_{i0k} \mapsto e_{i0k}$ and $ e_{i1k} \mapsto -e_{i1k}$. That would map the state $ v$ to $ (1,2,-3,-4,5,6,-7,-8)$.

If we instead used the rotation by $ \pi/2$ we would get the output state $ (1,2,3i, 4i, 5, 6, 7i, 8i)$.

Quantum AND/OR gate:

In the last post in this series we gave the quantum AND gate and left the quantum OR gate as an exercise. Rather than write out the matrix again, let me remind you of this gate using a description of the effect on the basis $ e_{ijk}$ where $ i,j,k \in \{ 0,1 \}$. Recall that we need three qubits in order to make the operation reversible (which is a consequence of all unitary gates being unitary matrices). Some notation: $ \oplus$ is the XOR of two bits, and $ \wedge$ is AND, $ \vee$ is OR. The quantum AND gate maps

$ \displaystyle e_{ijk} \mapsto e_{ij(k \oplus (i \wedge j))}$

In words, the third coordinate is XORed with the AND of the first two coordinates. We think of the third coordinate as a “scratchwork” qubit which is maybe prepared ahead of time to be in state zero.

Simiarly, the quantum OR gate maps $ e_{ijk} \mapsto e_{ij(k \oplus (i \vee j))}$. As we saw last time these combined with the quantum NOT gate (and some modest number of scratchwork qubits) allows quantum circuits to simulate any classical circuit.

Controlled-* gate:

The last example in this post is a meta-gate that represents a conditional branching. If we’re given a gate $ A$ acting on $ k$ qubits, then we define the controlled-A to be an operation which acts on $ k+1$ qubits. Let’s call the added qubit “qubit zero.” Then controlled-A does nothing if qubit zero is in state 0, and applies $ A$ if qubit zero is in state 1. Qubit zero is generally called the “control qubit.”

The matrix representing this operation decomposes into blocks if the control qubit is actually the first qubit (or you rearrange).

$ \displaystyle \begin{pmatrix} I_{2^{k-1}} & 0 \\ 0 & A \end{pmatrix}$

A common example of this is the controlled-NOT gate, often abbreviated CNOT, and it has the matrix

$ \displaystyle \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{pmatrix}$

Looking forward

Okay let’s take a step back and evaluate our life choices. So far we’ve spent a few hours of our time motivating quantum computing, explaining the details of qubits and quantum circuits, and seeing examples of concrete quantum gates and studying measurement. I’ve hopefully hammered into your head the notion that quantum states which aren’t pure tensors (i.e. entangled) are where the “weirdness” of quantum computing comes from. But we haven’t seen any examples of quantum algorithms yet!

Next time we’ll see our first example of an algorithm that is genuinely quantum. We won’t tackle factoring yet, but we will see quantum “weirdness” in action.

Until then!

The Complexity of Communication

satellite

One of the most interesting questions posed in the last thirty years of computer science is to ask how much “information” must be communicated between two parties in order for them to jointly compute something. One can imagine these two parties living on distant planets, so that the cost of communicating any amount of information is very expensive, but each person has an integral component of the answer that the other does not.

Since this question was originally posed by Andrew Yao in 1979, it has led to a flurry of applications in many areas of mathematics and computer science. In particular it has become a standard tool for proving lower bounds in many settings such as circuit design and streaming algorithms. And if there’s anything theory folks love more than a problem that can be solved by an efficient algorithm, it’s a proof that a problem cannot be solved by any efficient algorithm (that’s what I mean by “lower bound”).

Despite its huge applicability, the basic results in this area are elementary. In this post we’ll cover those basics, but once you get past these basic ideas and their natural extensions you quickly approach the state of the art and open research problems. Attempts to tackle these problems in recent years have used sophisticated techniques in Fourier analysis, Ramsey theory, and geometry. This makes it a very fun and exciting field.

As a quick side note before we start, the question we’re asking is different from the one of determining the information content of a specific message. That is the domain of information theory, which was posed (and answered) decades earlier. Here we’re trying to determine the complexity of a problem, where more complex messages require more information about their inputs.

The Basic Two-Player Model

The most basic protocol is simple enough to describe over a dinner table. Alice and Bob each have one piece of information $ x,y$, respectively, say they each have a number. And together they want to compute some operation that depends on both their inputs, for example whether $ x > y$. But in the beginning Alice has access only to her number $ x$, and knows nothing about $ y$. So Alice sends Bob a few bits. Depending on the message Bob computes something and replies, and this repeats until they have computed an answer. The question is: what is the minimum number of bits they need to exchange in order for both of them to be able to compute the right answer?

There are a few things to clarify here: we’re assuming that Alice and Bob have agreed on a protocol for sending information before they ever saw their individual numbers. So imagine ten years earlier Alice and Bob were on the same planet, and they agreed on the rules they’d follow for sending/replying information once they got their numbers. In other words, we’re making a worst-case assumption on Alice and Bob’s inputs, and as usual it will be measured as a function of $ n$, the lengths of their inputs. Then we take a minimum (asymptotically) over all possible protocols they could follow, and this value is the “communication complexity” of the problem. Computing the exact communication complexity of a given problem is no simple task, since there’s always the nagging question of whether there’s some cleverer protocol than the one you came up with. So most of the results are bounds on the communication complexity of a problem.

Indeed, we can give our first simple bound for the “$ x$ greater than $ y$” problem we posed above. Say the strings $ x,y$ both have $ n$ bits. What Alice does is send her entire string $ x$ to Bob, and Bob then computes the answer and sends the answer bit back to Alice. This requires $ n + 1$ bits of communication. This proves that the communication complexity of “$ x > y$” is bounded from above by $ n+1$. A much harder question is, can we do any better?

To make any progress on upper or lower bounds we need to be a bit more formal about the communication model. Basically, the useful analysis happens when the players alternate sending single bits, and this is only off by small constant factors from a more general model. This is the asymptotic analysis, that we only distinguish between things like linear complexity $ O(n)$ versus sublinear options like $ \log(n)$ or $ \sqrt{n}$ or even constant complexity $ O(1)$. Indeed, the protocol we described for $ x > y$ is the stupidest possible protocol for the problem, and it’s actually valid for any problem. For this problem it happens to be optimal, but we’re just trying to emphasize that nontrivial bounds are all sub-linear in the size of the inputs.

On to the formal model.

Definition: player is a computationally unbounded Turing machine.

And we really mean unbounded. Our players have no time or space constraints, and if they want they can solve undecidable problems like the halting problem or computing Kolmogorov complexity. This is to emphasize that the critical resource is the amount of communication between players. Moreover, it gives us a hint that lower bounds in this model won’t come form computational intractability, but instead will be “information-theoretic.”

Definition: Let $ \Sigma^*$ be the set of all binary strings. A communication protocol is a pair of functions $ A,B: \Sigma^* \times \Sigma^* \to \{ 0,1 \}$.

The input to these functions $ A(x, h)$ should be thought of as follows: $ x$ is the player’s secret input and $ h$ is the communication history so far. The output is the single bit that they will send in that round (which can be determined by the length of $ h$ since only one bit is sent in each round). The protocol then runs by having Alice send $ b_1 = A(x, \{ \})$ to Bob, then Bob replies with $ b_2 = B(y, b_1)$, Alice continues with $ b_3 = A(x, b_1b_2)$, and so on. We implicitly understand that the content of a communication protocol includes a termination condition, but we’ll omit this from the notation. We call the length of the protocol the number of rounds.

Definition: A communication protocol $ A,B$ is said to be valid for a boolean function $ f(x,y)$ if for all strings $ x, y$, the protocol for $ A, B$ terminates on some round $ t$ with $ b_t = 1$ if and only if $ f(x,y) = 1$.

So to define the communication complexity, we let the function $ L_{A,B}(n)$ be the maximum length of the protocol $ A, B$ when run on strings of length $ n$ (the worst-case for a given input size). Then the communication complexity of a function $ f$ is the minimum of $ L_{A,B}$ over all valid protocols $ A, B$. In symbols,

$ \displaystyle CC_f(n) = \min_{A,B \textup{ is valid for } f} L_{A,B}(n)$

We will often abuse the notation by writing the communication complexity of a function as $ CC(f)$, understanding that it’s measured asymptotically as a function of $ n$.

Matrices and Lower Bounds

Let’s prove a lower bound, that to compute the equality function you need to send a linear number of bits in the worst case. In doing this we’ll develop a general algebraic tool.

So let’s write out the function $ f$ as a binary matrix $ M(f)$ in the following way. Write all $ 2^n$ inputs of length $ n$ in some fixed order along the rows and columns of the matrix, and let entry $ i,j$ be the value of $ f(i,j)$. For example, the 6-bit function $ f$ which computes whether the majority of the two player’s bits are ones looks like this:

maj-matrix

The key insight to remember is that if the matrix of a function has a nice structure, then one needs very little communication to compute it. Let’s see why.

Say in the first round the row player sends a bit $ b$. This splits the matrix into two submatrices $ A_0, A_1$ by picking the rows of $ A_0$ to be those inputs for which the row player sends a $ b=0$, and likewise for $ A_1$ with $ b=1$. If you’re willing to rearrange the rows of the matrix so that $ A_0$ and $ A_1$ stack on top of each other, then this splits the matrix into two rectangles. Now we can switch to the column player and see which bit he sends in reply to each of the possible choices for $ b$ (say he sends back $ b’$). This separately splits each of $ A_0, A_1$ into two subrectangles corresponding to which inputs for the column player make him send the specific value of $ b’$. Continuing in this fashion we recurse until we find a submatrix consisting entirely of ones or entirely of zeros, and then we can say with certainty what the value of the function $ f$ is.

It’s difficult to visualize because every time we subdivide we move around the rows and columns within the submatrix corresponding to the inputs for each player. So the following would be a possible subdivision of an 8×8 matrix (with the values in the rectangles denoting which communicated bits got you there), but it’s sort of a strange one because we didn’t move the inputs around at all. It’s just a visual aid.

maj-matrix-subdivision

If we do this for $ t$ steps we get $ 2^t$ subrectangles. A crucial fact is that any valid communication protocol for a function has to give a subdivision of the matrix where all the rectangles are constant. or else there would be two pairs of inputs $ (x,y), (x’, y’)$, which are labeled identically by the communication protocol, but which have different values under $ f$.

So naturally one expects the communication complexity of $ f$ would require as many steps as there are steps in the best decomposition, that is, the decomposition with the fewest levels of subdivision. Indeed, we’ll prove this and introduce some notation to make the discourse less clumsy.

Definition: For an $ m \times n$ matrix $ M$, a rectangle is a submatrix $ A \times B$ where $ A \subset \{ 1, \dots m \}, B \subset \{ 1, \dots, n \}$. A rectangle is called monochromatic if all entires in the corresponding submatrix $ \left.M\right|_{A \times B}$ are the same. A monochromatic tiling of $ M$ is a partition of $ M$ into disjoint monochromatic rectangles. Define $ \chi(f)$ to be the minimum number of rectangles in any monochromatic tiling of $ M(f)$.

As we said, if there are $ t$ steps in a valid communication protocol for $ f$, then there are $ 2^t$ rectangles in the corresponding monochromatic tiling of $ M(f)$. Here is an easy consequence of this.

Proposition: If $ f$ has communication complexity $ CC(f)$, then there is a monochromatic tiling of $ M(f)$ with at most $ 2^{CC(f)}$ rectangles. In particular, $ \log(\chi(f)) \leq CC(f)$.

Proof. Pick any protocol that achieves the communication complexity of $ f$, and apply the process we described above to subdivide $ M(f)$. This will take exactly $ CC(f)$, and produce no more than $ 2^{CC(f)}$ rectangles.

$ \square$

This already gives us a bunch of theorems. Take the EQ function, for example. Its matrix is the identity matrix, and it’s not hard to see that every monochromatic tiling requires $ 2^n$ rectangles, one for each entry of the diagonal. I.e., $ CC(EQ) \geq n$. But we already know that one player can just send all his bits, so actually $ CC(EQ) = \Theta(n)$. Now it’s not always so easy to compute $ \chi(f)$. The impressive thing to do is to use efficiently computable information about $ M(f)$ to give bounds on $ \chi(f)$ and hence on $ CC(f)$. So can we come up with a better lower bound that depends on something we can compute? The answer is yes.

Theorem: For every function $ f$, $ \chi(f) \geq \textup{rank }M(f)$.

Proof. This just takes some basic linear algebra. One way to think of the rank of a matrix $ A$ is as the smallest way to write $ A$ as a linear combination of rank 1 matrices (smallest as in, the smallest number of terms needed to do this). The theorem is true no matter which field you use to compute the rank, although in this proof and in the rest of this post we’ll use the real numbers.

If you give me a monochromatic tiling by rectangles, I can view each rectangle as a matrix whose rank is at most one. If the entries are all zeros then the rank is zero, and if the entries are all ones then (using zero elsewhere) this is by itself a rank 1 matrix. So adding up these rectangles as separate components gives me an upper bound on the rank of $ A$. So the minimum way to do this is also an upper bound on the rank of $ A$.

$ \square$

Now computing something like $ CC(EQ)$ is even easier, because the rank of $ M(EQ) = M(I_{2^n})$ is just $ 2^n$.

Upper Bounds

There are other techniques to show lower bounds that are stronger than the rank and tiling method (because they imply the rank and tiling method). See this survey for a ton of details. But I want to discuss upper bounds a bit, because the central open conjecture in communication complexity is an upper bound.

The Log-Rank Conjecture: There is a universal constant $ c$, such that for all $ f$, the communication complexity $ CC(f) = O((\log \textup{rank }M(f))^c)$.

All known examples satisfy the conjecture, but unfortunately the farthest progress toward the conjecture is still exponentially worse than the conjecture’s statement. In 1997 the record was due to Andrei Kotlov who proved that $ CC(f) \leq \log(4/3) \textup{rank }M(f)$. It was not until 2013 that any (unconditional) improvements were made to this, when Shachar Lovett proved that $ CC(f) = O(\sqrt{\textup{rank }M(f)} \cdot \log \textup{rank }M(f))$.

The interested reader can check out this survey of Shachar Lovett from earlier this year (2014) for detailed proofs of these theorems and a discussion of the methods. I will just discuss one idea from this area that ties in nicely with our discussion: which is that finding an efficient communication protocol for a low-rank function $ f$ reduces to finding a large monochromatic rectangle in $ M(f)$.

Theorem [Nisan-Wigderson 94]: Let $ c(r)$ be a function. Suppose that for any function $ f: X \times Y \to \{ 0,1 \}$, we can find a monochromatic rectangle of size $ R \geq 2^{-c(r)} \cdot | X \times Y |$ where $ r = \textup{rank }M(f)$. Then any such $ f$ is computable by a deterministic protocol with communication complexity.

$ \displaystyle O \left ( \log^2(r) + \sum_{i=0}^{\log r} c(r/2^i) \right )$

Just to be concrete, this says that if $ c(r)$ is polylogarithmic, then finding these big rectangles implies a protocol also with polylogarithmic complexity. Since the complexity of the protocol is a function of $ r$ alone, the log-rank conjecture follows as a consequence. The best known results use the theorem for larger $ c(r) = r^b$ for some $ b < 1$, which gives communication complexity also $ O(r^b)$.

The proof of the theorem is detailed, but mostly what you’d expect. You take your function, split it up into the big monochromatic rectangle and the other three parts. Then you argue that when you recurse to one of the other three parts, either the rank is cut in half, or the size of the matrix is much smaller. In either case, you can apply the theorem once again. Then you bound the number of leaves in the resulting protocol tree by looking at each level $ i$ where the rank has dropped to $ r/2^i$. For the full details, see page 4 of the Shachar survey.

Multiple Players and More

In the future we’ll cover some applications of communication complexity, many of which are related to computing in restricted models such as parallel computation and streaming computation. For example, in parallel computing you often have processors which get arbitrary chunks of data as input and need to jointly compute something. Lower bounds on the communication complexity can help you prove they require a certain amount of communication in order to do that.

But in these models there are many players. And the type of communication matters: it can be point-to-point or broadcast, or something more exotic like MapReduce. So before we can get to these applications we need to define and study the appropriate generalizations of communication complexity to multiple interacting parties.

Until then!

The Two-Dimensional Fourier Transform and Digital Watermarking

We’ve studied the Fourier transform quite a bit on this blog: with four primers and the Fast Fourier Transform algorithm under our belt, it’s about time we opened up our eyes to higher dimensions.

Indeed, in the decades since Cooley & Tukey’s landmark paper, the most interesting applications of the discrete Fourier transform have occurred in dimensions greater than 1. But for all our work we haven’t yet discussed what it means to take an “n-dimensional” Fourier transform. Our past toiling and troubling will pay off, though, because the higher Fourier transform and its 1-dimensional cousin are quite similar. Indeed, the shortest way to describe the $ n$-dimensional transform is as the 1-dimensional transform with inner products of vector variables replacing regular products of variables.

In this post we’ll flush out these details. We’ll define the multivariable Fourier transform and it’s discrete partner, implement an algorithm to compute it (FFT-style), and then apply the transform to the problem of digitally watermarking images.

As usual, all the code, images, and examples used in this post are available on this blog’s Github page.

Sweeping Some Details Under the Rug

We spent our first and second primers on Fourier analysis describing the Fourier series in one variable, and taking a limit of the period to get the Fourier transform in one variable. By all accounts, it was a downright mess of notation and symbol manipulation that culminated in the realization that the Fourier series looks a lot like a Riemann sum. So it was in one dimension, it is in arbitrary dimension, but to save our stamina for the applications we’re going to treat the $ n$-dimensional transform differently. We’ll use the 1-dimensional transform as a model, and magically generalize it to operate on a vector-valued variable. Then the reader will take it on faith that we could achieve the same end as a limit of some kind of multidimensional Fourier series (and all that nonsense with Schwarz functions and tempered distributions is left to the analysts), or if not we’ll provide external notes with the full details.

So we start with a real-valued (or complex-valued) function $ f : \mathbb{R}^n \to \mathbb{R}$, and we write the variable as $ x = (x_1, \dots, x_n)$, so that we can stick to using the notation $ f(x)$. Rather than think of the components of $ x$ as “time variables” as we did in the one-dimensional case, we’ll usually think of $ x$ as representing physical space. And so the periodic behavior of the function $ f$ represents periodicity in space. On the other hand our transformed variables will be “frequency” in space, and this will correspond to a vector variable $ \xi = (\xi_1, \dots, \xi_n)$. We’ll come back to what the heck “periodicity in space” means momentarily.

Remember that in one dimension the Fourier transform was defined by

$ \displaystyle \mathscr{F}f(s) = \int_{-\infty}^\infty e^{-2\pi ist}f(t) dt$.

And it’s inverse transform was

$ \displaystyle \mathscr{F}^{-1}g(t) = \int_{-\infty}^\infty e^{2\pi ist}f(s) ds$.

Indeed, with the vector $ x$ replacing $ t$ and $ \xi$ replacing $ s$, we have to figure out how to make an analogous definition. The obvious thing to do is to take the place where $ st$ is multiplied and replace it with the inner product of $ x$ and $ \xi$, which for this post I’ll write $ x \cdot \xi$ (usually I write $ \left \langle x, \xi \right \rangle$). This gives us the $ n$-dimensional transform

$ \displaystyle \mathscr{F}f(\xi) = \int_{\mathbb{R}^n} e^{-2\pi i x \cdot \xi}f(x) dx$,

and its inverse

$ \displaystyle \mathscr{F}^{-1}g(t) = \int_{\mathbb{R}^n} e^{2\pi i x \cdot \xi}f( \xi ) d \xi$

Note that the integral is over all of $ \mathbb{R}^n$. To give a clarifying example, if we are in two dimensions we can write everything out in coordinates: $ x = (x_1, x_2), \xi = (\xi_1, \xi_2)$, and the formula for the transform becomes

$ \displaystyle \mathscr{F}f(\xi_1, \xi_2) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-2 \pi i (x_1 \xi_1 + x_2 \xi_2)} f(\xi_1, \xi_2) dx_1 dx_2$.

Now that’s a nasty integral if I’ve ever seen one. But for our purposes in this post, this will be as nasty as it gets, for we’re primarily concerned with image analysis. So representing things as vectors of arbitrary dimension is more compact, and we don’t lose anything for it.

Periodicity in Space? It’s All Mostly the Same

Because arithmetic with vectors and arithmetic with numbers is so similar, it turns out that most of the properties of the 1-dimensional Fourier transform hold in arbitrary dimension. For example, the duality of the Fourier transform and its inverse holds, because for vectors $ e^{-2 \pi i x \cdot (-\xi)} = e^{2 \pi i x \cdot \xi}$. So just like in on dimension, we have

$ \mathscr{F}f(-\xi) = \mathscr{F}^{-1}f(\xi)$

And again we have correspondences between algebraic operations: convolution in the spatial domain corresponds to convolution in the frequency domain, the spectrum is symmetric about the origin, etc.

At a more geometric level, though, the Fourier transform does the same sort of thing as it did in the one-dimensional case. Again the complex exponentials form the building blocks of any function we want, and performing a Fourier transform on an $ n$-dimensional function decomposes that function into its frequency components. So a function that is perfectly periodic corresponds to a Fourier spectrum that’s perfectly concentrated at a point.

But what the hell, the reader might ask, is ‘periodicity in space’? Since we’re talking about images anyway, the variables we care about (the coordinates of a pixel) are spatial variables. You could, if you were so inclined, have a function of multiple time variables, and to mathematicians a physical interpretation of dimension is just that, an interpretation. But as confusing as it might sound, it’s actually not so hard to understand the Fourier transform when it’s specialized to image analysis. The idea is that complex exponentials $ e^{\pm 2 \pi i s \cdot \xi}$ oscillate in the $ x$ variable for a fixed $ \xi$ (and since $ \mathscr{F}$ has $ \xi$ as its input, we do want to fix $ \xi$). The brief mathematical analysis goes like this: if we fix $ \xi$ then the complex exponential is periodic with magnitudinal peaks along parallel lines spaced out at a distance of $ 1/ \left \| \xi \right \|$ apart. In particular any image is a sum of a bunch of these “complex exponential with a fixed $ \xi$” images that look like stripes with varying widths and orientations (what you see here is just the real part of a particular complex exponential).

Any image can be made from a sum of a whole lot of images like this one. This corresponds to a single point in the Fourier spectrum.

Any image can be made from a sum of a whole lot of images like the ones on top. They correspond to single points in the Fourier spectrum (and their symmetries), as on bottom.

What you see on top is an image, and on bottom its Fourier spectrum. That is, each brightly colored pixel corresponds to a point $ [x_1, x_2]$ with a large magnitude for that frequency component $ |\mathscr{F}f[x_1, x_2]|$.

It might be a bit surprising that every image can be constructed as a sum of stripey things, but so was it that any sound can be constructed as a sum of sines and cosines. It’s really just a statement about a basis of some vector space of functions. The long version of this story is laid out beautifully in pages 4 – 7 of these notes. The whole set of notes is wonderful, but this section is mathematically tidy and needs no background; the remainder of the notes outline the details about multidimensional Fourier series mentioned earlier, as well as a lot of other things. In higher dimensions the “parallel lines” idea is much the same, but with lines replaced by hyperplanes normal to the given vector.

Discretizing the Transform

Recall that for a continuous function $ f$ of one variable, we spent a bit of time figuring out how to find a good discrete approximation of $ f$, how to find a good discrete approximation of the Fourier transform $ \mathscr{F}f$, and how to find a quick way to transition between the two. In brief: $ f$ was approximated by a vector of samples $ (f[0], f[1], \dots, f[N])$, reconstructed the original function (which was only correct at the sampled points) and computed the Fourier transform of that, calling it the discrete Fourier transform, or DFT. We got to this definition, using square brackets to denote list indexing (or vector indexing, whatever):

Definition: Let $ f = (f[1], \dots f[N])$ be a vector in $ \mathbb{R}^N$. Then the discrete Fourier transform of $ f$ is defined by the vector $ (\mathscr{F}f[1], \dots, \mathscr{F}f[N])$, where

$ \displaystyle \mathscr{F}f[j] = \sum_{k=0}^{N-1} f[k]e^{-2 \pi i jk/N}$

Just as with the one-dimensional case, we can do the same analysis and arrive at a discrete approximation of an $ n$-dimensional function. Instead of a vector it would be an $ N \times N \times \dots \times N$ matrix, where there are $ n$ terms in the matrix, one for each variable. In two dimensions, this means the discrete approximation of a function is a matrix of samples taken at evenly-spaced intervals in both directions.

Sticking with two dimensions, the Fourier transform is then a linear operator taking matrices to matrices (which is called a tensor if you want to scare people). It has its own representation like the one above, where each term is a double sum. In terms of image analysis, we can imagine that each term in the sum requires us to look at every pixel of the original image

Definition: Let $ f = (f[s,t])$ be a vector in $ \mathbb{R}^N \times \mathbb{R}^M$, where $ s$ ranges from $ 0, \dots, N-1$ and $ t$ from $ 0, \dots, M-1$. Then the discrete Fourier transform of $ f$ is defined by the vector $ (\mathscr{F}f[s,t])$, where each entry is given by

$ \displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s, t] e^{-2 \pi i (s x_1 / N + t x_2 / M)}$

In the one-dimensional case the inverse transform had a sign change in the exponent and an extra $ 1/N$ normalization factor. Similarly, in two dimensions the inverse transform has a normalization factor of $ 1/NM$ (1 over the total number of samples). Again we use a capital $ F$ to denote the transformed version of $ f$. The higher dimensional transforms are analogous: you get $ n$ sums, one for each component, and the normalization factor is the inverse of the total number of samples.

$ \displaystyle \mathscr{F}^{-1}F[x_1, x_2] = \frac{1}{NM} \sum_{s=0}^{N-1} \sum_{t=0}^{M-1} f[s,t] e^{2 \pi i (sx_1 / N + tx_2 / M)}$

Unfortunately, the world of the DFT disagrees a lot on the choice of normalization factor. It turns out that all that really matters is that the exponent is negated in the inverse, and that the product of the constant terms on both the transform and its inverse is $ 1/NM$. So some people will normalize both the Fourier transform and its inverse by $ 1/ \sqrt{NM}$. The reason for this is that it makes the transform and its inverse more similar-looking (it’s just that, cosmetic). The choice of normalization isn’t particularly important for us, but beware: non-canonical choices are out there, and they do affect formulas by adding multiplicative constants.

The Fast Fourier Transform, Revisited

Now one might expect that there is another clever algorithm to drastically reduce the runtime of the 2-dimensional DFT, akin to the fast Fourier transform algorithm (FFT). But actually there is almost no additional insight required to understand the “fast” higher dimensional Fourier transform algorithm, because all the work was done for us in the one dimensional case.

All that we do is realize that each of the inner summations is a 1-dimensional DFT. That is, if we write the inner-most sum as a function of two parameters

$ \displaystyle g(s, x_2) = \sum_{t=0}^{M-1} f(s,t) e^{-2 \pi i (tx_2 / M)}$

then the 2-dimensional FFT is simply

$ \displaystyle \mathscr{F}f[x_1, x_2] = \sum_{s=0}^{N-1} g(s, x_2) e^{-2 \pi i (sx_1/N)}$

But now notice, that we can forget that $ g(s,x_2)$ was ever a separate, two-dimensional function. Indeed, since it only depends on the $ x_2$ parameter from out of the sum this is precisely the formula for a 1-dimensional DFT! And so if we want to compute the 2-dimensional DFT using the 1-dimensional FFT algorithm, we can compute the matrix of 1-dimensional DFT entries for all choices of $ s, x_2$ by fixing each value of $ s$ in turn and running FFT on the resulting “column” of values. If you followed the program from our last FFT post, then the only difficulty is in understanding how the data is shuffled around and which variables are fixed during the computation of the sub-DFT’s.

To remedy the confusion, we give an example. Say we have the following 3×3 matrix whose DFT we want to compute. Remember, these values are the sampled values of a 2-variable function.

$ \displaystyle \begin{pmatrix} f[0,0] & f[0,1] & f[0,2] \\ f[1,0] & f[1,1] & f[1,2] \\ f[2,0] & f[2,1] & f[2,2] \end{pmatrix}$

The first step in the algorithm is to fix a choice of row, $ s$, and compute the DFT of the resulting row. So let’s fix $ s = 0$, and then we have the resulting row

$ \displaystyle f_0 = (f[0,0], f[0,1], f[0,2])$

It’s DFT is computed (intentionally using the same notation as the inner summation above), as

$ \displaystyle g[0,x_2] = (\mathscr{F}f_0)[x_2] = \sum_{t=0}^{M-1} f_0[t] e^{- 2 \pi i (t x_2 / M)}$

Note that $ f_0[t] = f[s,t]$ for our fixed choice of $ s=0$. And so if we do this for all $ N$ rows (all 3 rows, in this example), we’ll have performed $ N$ FFT’s of size $ M$ to get a matrix of values

$ \displaystyle \begin{pmatrix} g[0,0] & g[0,1] & g[0,2] \\ g[1,0] & g[1,1] & g[1,2] \\ g[2,0] & g[2,1] & g[2,2] \end{pmatrix}$

Now we want to compute the rest of the 2-dimensional DFT to the end, and it’s easy: now each column consists of the terms in the outermost sum above (since $ s$ is the iterating variable). So if we fix a value of $ x_2$, say $ x_2 = 1$, we get the resulting column

$ \displaystyle g_1 = (g[0, 1], g[1,1], g[2,1])$

and computing a DFT on this row gives

$ \displaystyle \mathscr{F}f[x_1, 1] = \sum_{s=0}^{N-1} g_1[s] e^{-2 \pi i sx_1 / N}$.

Expanding the definition of $ g$ as a DFT gets us back to the original formula for the 2-dimensional DFT, so we know we did it right. In the end we get a matrix of the computed DFT values for all $ x_1, x_2$.

Let’s analyze the runtime of this algorithm: in the first round of DFT’s we computed $ N$ DFT’s of size $ M$, requiring a total of $ O(N M \log M)$, since we know FFT takes time $ O(M \log M)$ for a list of length $ M$. In the second round we did it the other way around, computing $ M$ DFT’s of size $ N$ each, giving a total of

$ O(NM \log M + NM \log N) = O(NM (\log N + \log M)) = O(NM \log (NM))$

In other words, if the size of the image is $ n = NM$, then we are achieving an $ O(n \log n)$-time algorithm, which was precisely the speedup that the FFT algorithm gave us for one-dimension. We also know a lower bound on this problem: we can’t do better than $ NM$ since we have to look at every pixel at least once. So we know that we’re only a logarithmic factor away from a trivial lower bound. And indeed, all other known DFT algorithms have the same runtime. Without any assumptions on the input data (or any parallelization), nobody knows of a faster algorithm.

Now let’s turn to the code. If we use our FFT algorithm from last time, the pure Python one (read: very slow), then we can implement the 2D Fourier transform in just two lines of Python code. Full disclosure: we left out some numpy stuff in this code for readability. You can view the entire source file on this blog’s Github page.

def fft2d(matrix):
   fftRows = [fft(row) for row in matrix]
   return transpose([fft(row) for row in transpose(fftRows)])

And we can test it on a simple matrix with one nonzero value in it:

A = [[0,0,0,0], [0,1,0,0], [0,0,0,0], [0,0,0,0]]
for row in fft2d(A):
   print(', '.join(['%.3f + %.3fi' % (x.real, x.imag) for x in row]))

The output is (reformatted in LaTeX, obviously):

$ \displaystyle \begin{pmatrix} 1 & -i & -1 & i \\ -i & -1 & i & 1 \\ -1 & i & 1 & -i \\ i & 1 & -i & -1 \end{pmatrix}$

The reader can verify by hand that this is correct (there’s only one nonzero term in the double sum, so it just boils down to figuring out the complex exponential $ e^{2 \pi i (x_1 + x_2 / 4)}$). We leave it as an additional exercise to the reader to implement the inverse transform, as well as to generalize this algorithm to higher dimensional DFTs.

Some Experiments and Animations

As we did with the 1-dimensional FFT, we’re now going to switch to using an industry-strength FFT algorithm for the applications. We’ll be using the numpy library and its “fft2” function, along with scipy’s ndimage module for image manipulation. Getting all of this set up was a nightmare (thank goodness for people who guide users like me through this stuff, but even then the headache seemed unending!). As usual, all of the code and images used in the making of this post is available on this blog’s Github page.

And so we can start playing with a sample image, a still from one of my favorite television shows:

sherlock

The Fourier transform of this image (after we convert it to grayscale) can be computed in python:

def fourierSpectrumExample(filename):
   A = ndimage.imread(filename, flatten=True)
   unshiftedfft = numpy.fft.fft2(A)
   spectrum = numpy.log10(numpy.absolute(unshiftedfft) + numpy.ones(A.shape))
   misc.imsave(&quot;%s-spectrum-unshifted.png&quot; % (filename.split('.')[0]), spectrum)

With the result:

sherlock-spectrum-unshifted

The Fourier spectrum of Sherlock and Watson (and London).

A few notes: we use the ndimage library to load the image and flatten the colors to grayscale. Then, after we compute the spectrum, we shift and take a logarithm. This is because the raw spectrum values are too massive; plotting them without modification makes the image contrast too high.

Something is odd, though, because the brightest regions are on the edges of the image, where we might expect the highest-frequency elements to be. Actually, it turns out that a raw DFT (as computed by numpy, anyhow) is “shifted.” That is, the indices are much like they were in our original FFT post, so that the “center” of the spectrum (the lowest frequency component) is actually in the corner of the image array.

The numpy folks have a special function designed to alleviate this called fftshift. Applying it before we plot the image gives the following spectrum:

sherlock-spectrum

Now that’s more like it. For more details on what’s going on with shifting and how to use the shifting functions, see this matlab thread. (As a side note, the “smudges” in this image are interesting. We wonder what property of the original image contributes to the smudges)

Shifted or unshifted, this image represents the frequency spectrum of the image. In other words, we could take the inverse DFT of each pixel (and its symmetric partner) of this image separately, add them all together, and get back to our original image! We did just that using a different image (one of size 266 x 189, requiring a mere 25137 frequency components), to produce this video of the process:

Many thanks to James Hance for his relentlessly cheerful art (I have a reddish version of this particular masterpiece on my bedroom wall).

For the interested reader, I followed this youtube video’s recommended workflow to make the time-lapsed movie, along with some additional steps to make the videos play side by side. It took quite a while to generate and process the images, and the frames take up a lot of space. So instead of storing all the frames, the interested reader may find the script used to generate the frames on this blog’s Github page (along with all of the rest of the code used in this blog post).

Digital Watermarking

Now we turn to the main application of Fourier transforms to this post, the task of adding an invisible digital watermark to an image. Just in case the reader lives in a cave, a watermark is a security device used to protect the ownership or authenticity of a particular good. Usually they’re used on money to prevent counterfeits, but they’re often applied to high-resolution images on the web to protect copyrights. But perhaps more than just protect existing copyrights, watermarks as they’re used today are ugly, and mostly prevent people from taking the image (paid for or not) in the first place. Here’s an example from a big proponent of ugly watermarks, Shutterstock.com.

stock-photo

Now if you were the business of copyright litigation, you’d make a lot of money by suing people who took your clients’ images without permission. So rather than prevent people from stealing in the first place, you could put in an invisible watermark into all of your images and then crawl the web looking for stolen images with your watermark. It would be easy enough to automate (Google already did most of the work for you, if you just want to use Google’s search by image feature).

Now I’m more on the side of Fair Use For All, so I wouldn’t hope for a company to actually implement this and make using the internet that much scarier of a place. But the idea makes for an interesting thought experiment and blog post. The idea is simply to modify the spectrum of an image by adding in small, artificial frequency components. That is, the watermarked image will look identical to the original image to a human, but the Fourier spectrum will contain suspicious entries that we can extract if we know where to look.

Implementing the watermarking feature is quite easy, so let’s do that first. Let’s work again with James Hance’s fine artwork.

hance-up-sw-bw

Let’s call our image’s pixel matrix $ A$ and say we’re working with grayscale images for simplicity (for color, we just do the same thing to all three color channels). Then we can define a watermark matrix $ W$ by the following procedure:

  1. Pick a radius $ r$, a length $ L$, a watermark strength $ \alpha$, and a secret key $ k$.
  2. Using $ k$ as a seed to a random number generator, define a random binary vector $ v$ of length $ L$.
  3. Pick a subset $ S$ of the circle of coordinates centered at the image’s center of radius $ r$, chosen or rejected based on the entries of $ v$.
  4. Let $ W$ be the matrix of all zeros (of the same dimension as $ A$ with 1’s in the entries of $ S$.
  5. Compute the watermarked image as $ \mathscr{F}^{-1}(\mathscr{F}(A) + \alpha W)$. That is, compute the DFT of $ A$, add $ \alpha W$ to it, and then compute the inverse Fourier transform of the result.

The code for this is simple enough. To create a random vector:

import random
def randomVector(seed, length):
   random.seed(secretKey)
   return [random.choice([0,1]) for _ in range(length)]

To make the watermark (and flush out all of the technical details of how it’s done:

def makeWatermark(imageShape, radius, secretKey, vectorLength=50):
   watermark = numpy.zeros(imageShape)
   center = (int(imageShape[0] / 2) + 1, int(imageShape[1] / 2) + 1)

   vector = randomVector(secretKey, vectorLength)

   x = lambda t: center[0] + int(radius * math.cos(t * 2 * math.pi / vectorLength))
   y = lambda t: center[1] + int(radius * math.sin(t * 2 * math.pi / vectorLength))
   indices = [(x(t), y(t)) for t in range(vectorLength)]

   for i,location in enumerate(indices):
      watermark[location] = vector[i]

   return watermark

We use the usual parameterization of the circle as $ t \mapsto (\cos(2 \pi t / n), \sin(2 \pi t / n)$ scaled to the appropriate radius. Here’s what the watermark looks like as a spectrum:

watermark-spectrum

It’s hard to see the individual pixels, so click it to enlarge.

And then applying a given watermark to an image is super simple.

def applyWatermark(imageMatrix, watermarkMatrix, alpha):
   shiftedDFT = fftshift(fft2(imageMatrix))
   watermarkedDFT = shiftedDFT + alpha * watermarkMatrix
   watermarkedImage = ifft2(ifftshift(watermarkedDFT))

   return watermarkedImage

And that’s all there is to it! One might wonder how the choice of $ \alpha$ affects the intensity of the watermark, and indeed here we show a few example values of this method applied to Hance’s piece:

Click to enlarge. It appears that it's not until alpha becomes egregiously large that we notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background).

Click to enlarge. The effects are most visible in the rightmost image where alpha = 1,000,000

It appears that it’s not until $ \alpha$ becomes egregiously large (over 10,000) that we visibly notice the effects. This could be in part due to the fact that this is an image of a canvas (which has lots of small textures in the background). But it’s good to keep in mind the range of acceptable values when designing a decoding mechanism.

Indeed, a decoding mechanism is conceptually much messier; it’s the art to the encoding mechanism’s science. This paper details one possible way to do it, which is essentially to scale everything up or down to 512×512 pixels and try circles of every possible radius until you find one (or don’t) which is statistically similar to the your random vector. And note that since we have the secret key we can generate the exact same random vector. So what the author of that paper suggests is to extract each circle of pixels from the Fourier spectrum, treating it as a single vector with first entry at angle 0. Then you do some statistical magic (compute cross-correlation or some other similarity measure) between the extracted pixels and your secret-key-generated random vector. If they’re sufficiently similar, then you’ve found your watermark, and otherwise there’s no watermark present.

The code required to do this only requires a few extra lines that aren’t present in the code we’re already presented in this article (numpy does cross-correlation for you), so we leave it as an exercise to the reader: write a program that determines if an image contains our watermark, and test the algorithm on various $ \alpha$ and with modifications of the image like rotation, scaling, cropping, and jpeg compression. Part of the benefit of Fourier-based techniques is the resilience of the spectrum to mild applications of these transformations.

Next time we’ll use the Fourier transform to do other cool things to images, like designing filters and combining images in interesting ways.

Until then!