# Tensorphobia and the Outer Product

## Variations on a theme

Back in 2014 I wrote a post called How to Conquer Tensorphobia that should end up on Math $\cap$ Programming’s “greatest hits” album. One aspect of tensors I neglected to discuss was the connection between the modern views of tensors and the practical views of linear algebra. I feel I need to write this because every year or two I forget why it makes sense.

The basic question is:

What the hell is going on with the outer product of vectors?

The simple answer, the one that has never satisfied me, is that the outer product of $v,w$ is the matrix $vw^T$ whose $i,j$ entry is the product $v_i w_j$. This doesn’t satisfy me because it’s an explanation by fiat. It lacks motivation and you’re supposed to trust (or verify) things magically work out. To me this definition is like the definition of matrix multiplication: having it dictated to you before you understand why it makes sense is a cop out. Math isn’t magic, it needs to make perfect sense.

The answer I like, and the one I have to re-derive every few years because I never wrote it down, is a little bit longer.

To borrow a programming term, the basic confusion is a type error. We start with two vectors, $v,w$ in some vector space $V$ (let’s say everything is finite dimensional), and we magically turn them into a matrix. Let me reiterate for dramatic effect: we start with vectors, which I have always thought of as objects, things you can stretch, rotate, or give as a gift to your cousin Mauricio. While a matrix is a mapping, a thing that takes vectors as input and spits out new vectors. Sure, you can play with the mappings, feed them kibble and wait for them to poop or whatever. And sure, sometimes vectors are themselves maps, like in the vector space of real-valued functions (where the word “matrix” is a stretch, since it’s infinite dimensional).

But to take two vectors and *poof* get a mapping of all vectors, it’s a big jump. And part of the discomfort is that it feels random. To be happy we have to understand that the construction is natural, or even better canonical, meaning this should be the only way to turn two vectors into a linear map. Then the definition would make sense.

So let’s see how we can do that. Just to be clear, everything we do in this post will be for finite-dimensional vector spaces over $\mathbb{R}$, but we’ll highlight the caveats when they come up.

## Dual vector spaces

The first step is understanding how to associate a vector with a linear map in a “natural” or “canonical” way. There is one obvious candidate: if you give me a vector $v$, I can make a linear map by taking the dot product of $v$ with the input. So I can associate

$\displaystyle v \mapsto \langle v,- \rangle$

The dash is a placeholder for the input. Another way to say it is to define $\varphi_v(w) = \langle v,w \rangle$ and say the association takes $v \mapsto \varphi_v$. So this “association,” taking $v$ to the inner product, is itself  a mapping from $V$ to “maps from $V$ to $\mathbb{R}$.” Note that $\varphi_v(w)$ is linear in $v$ and $w$ because that’s part of the definition of an inner product.

To avoid saying “maps from $V$ to $\mathbb{R}$” all the time, we’ll introduce some notation.

Definition: Let $V$ be a vector space over a field $k$. The set of $k$-linear maps from $V \to k$ is called $\textup{Hom}(V,k)$. More generally, the set of $k$-linear maps from $V$ to another $k$-vector space $W$ is called $\textup{Hom}(V,W)$.

“Hom” stands for “homomorphism,” and in general it just means “maps with the structure I care about.” For this post $k = \mathbb{R}$, but most of what we say here will be true for any field. If you go deeply into this topic, it matters whether $k$ is algebraically closed, or has finite characteristic, but for simplicity we’ll ignore all of that. We’ll also ignore the fact that these maps are called linear functionals and this is where the name “functional analysis” comes from. All we really want to do is understand the definition of the outer product.

Another bit of notation for brevity:

Definition: Let $V$ be a $\mathbb{R}$-vector space. The dual vector space for $V$, denoted $V^*$, is $\textup{Hom}(V, \mathbb{R})$.

So the “vector-to-inner-product” association we described above is a map $V \to V^*$. It takes in $v \in V$ and spits out $\varphi_v \in V^*$.

Now here’s where things start to get canonical (interesting). First, $V^*$ is itself a vector space. This is an easy exercise, and the details are not too important for us, but I’ll say the key: if you want to add two functions, you just add their (real number) outputs. In fact we can say more:

Theorem: $V$ and $V^*$ are isomorphic as vector spaces, and the map $v \mapsto \varphi_v$ is the canonical isomorphism.

Confessions of a mathematician: we’re sweeping some complexity under the rug. When we upgraded our vector space to an inner product space, we fixed a specific (but arbitrary) inner product on $V$. For finite dimensional vector spaces it makes no difference, because every finite-dimensional $\mathbb{R}$-inner product space is isomorphic to $\mathbb{R}^n$ with the usual inner product. But the theorem is devastatingly false for infinite-dimensional vector spaces. There are two reasons: (1) there are many (non-canonical) choices of inner products and (2) the mapping for any given inner product need not span $V^*$. Luckily we’re in finite dimensions so we can ignore all that. [Edit: see Emilio’s comments for a more detailed discussion of what’s being swept under the rug, and how we’re ignoring the categorical perspective when we say “natural” and “canonical.”]

Before we make sense of the isomorphism let’s talk more about $V^*$. First off, it’s not even entirely obvious that $V^*$ is finite-dimensional. On one hand, if $v_1, \dots, v_n$ is a basis of $V$ then we can quickly prove that $\varphi_{v_1}, \dots ,\varphi_{v_n}$ are linearly independent in $V^*$. Indeed, if they weren’t then there’d be some linear combination $a_1 \varphi_{v_1} + \dots + a_n \varphi_{v_n}$ that is the zero function, meaning that for every vector $w$, the following is zero

$\displaystyle a_1 \langle v_1, w \rangle + \dots + a_n \langle v_n, w \rangle = 0.$

But since the inner product is linear in both arguments we get that $\langle a_1 v_1 + \dots + a_n v_n , w \rangle = 0$ for every $w$. And this can only happen when $a_1v_1 + \dots + a_nv_n$ is the zero vector (prove this).

One consequence is that the linear map $v \mapsto \varphi_v$ is injective. So we can think of $V$ as “sitting inside” $V^*$. Now here’s a very slick way to show that the $\varphi_{v_i}$ span all of $V^*$. First we can assume our basis $v_1, \dots, v_n$ is actually an orthonormal basis with respect to our inner product (this is without loss of generality). Then we write any linear map $f \in V^*$ as

$f = f(v_1)\varphi_{v_1} + \dots + f(v_n) \varphi_{v_n}$

To show these two are actually equal, it’s enough to show they agree on a basis for $V$. That is, if you plug in $v_1$ to the function on the left- and right-hand side of the above, you’ll get the same thing. The orthonormality of the basis makes it work, since all the irrelevant inner products are zero.

In case you missed it, that completes the proof that $V$ and $V^*$ are isomorphic. Now when I say that the isomorphism is “canonical,” I mean that if you’re willing to change the basis of $V$ and $V^*$, then $v \mapsto \varphi_v$ is the square identity matrix, i.e. the only isomorphism between any two finite vector spaces (up to a change of basis).

## Tying in tensors

At this point we have a connection between single vectors $v$ and linear maps $V^*$ whose codomain has dimension 1. If we want to understand the outer product, we need a connection between pairs of vectors and matrices, i.e. $\textup{Hom}(V,V)$. In other words, we’d like to find a canonical isomorphism between $V \times V$ and $\textup{Hom}(V,V)$. But already it’s not possible because the spaces have different dimensions. If $\textup{dim}(V) = n$ then the former has dimension $2n$ and the latter has dimension $n^2$. So any “natural” relation between these spaces has to be a way to embed $V \times V \subset \textup{Hom}(V,V)$ as a subspace via some injective map.

There are two gaping problems with this approach. First, the outer product is not linear as a map from $V \times V \to \textup{Hom}(V,V)$. To see this, take any $v,w \in V$, pick any scalar $\lambda \in \mathbb{R}$. Scaling the pair $(v,w)$ means scaling both components to $(\lambda v, \lambda w)$, and so the outer product is the matrix $(\lambda v)(\lambda w^T) = \lambda^2 vw^T$.

The second problem is that the only way to make $V \times V$ a subspace of $\textup{Hom}(V,V)$ (up to a change of basis) is to map $v,w$ to the first two rows of a matrix with zeros elsewhere. This is canonical but it doesn’t have the properties that the outer product promises us. Indeed, the outer product let’s us uniquely decompose a matrix as a “sum of rank 1 matrices,” but we don’t get a unique decomposition of a matrix as a sum of these two-row things. We also don’t even get a well-defined rank by decomposing into a sum of two-row matrices (you can get cancellation by staggering the sum). This injection is decisively useless.

It would seem like we’re stuck, until we think back to our association between $V$ and $V^*$. If we take one of our two vectors, say $v$, and pair it with $w$, we can ask how $(\varphi_v, w)$ could be turned into a linear map in $\textup{Hom}(V,V)$. A few moments of guessing and one easily discovers the map

$\displaystyle x \mapsto \varphi_v(x) w = \langle v,x \rangle w$

In words, we’re scaling $w$ by the inner product of $x$ and $v$. In geometric terms, we project onto $v$ and scale $w$ by the signed length of that projection. Let’s call this map $\beta_{v,w}(x)$, so that the association maps $(v,w) \mapsto \beta_{v,w}$. The thought process of “easily discovering” this is to think, “What can you do with a function $\varphi_v$ and an input $x$? Plug it in. Then what can you do with the resulting number and a vector $w$? Scale $w$.”

If you look closely you’ll see we’ve just defined the outer product. This is because the outer product works by saying $uv^T$ is a matrix, which acts on a vector $x$ by doing $(uv^T)x = u(v^T x)$. But the important thing is that, because $V$ and $V^*$ are canonically isomorphic, this is a mapping

$\displaystyle V \times V = V^* \times V \to \textup{Hom}(V,V)$

Now again, this mapping is not linear. In fact, it’s bilinear, and if there’s one thing we know about bilinear maps, it’s that tensors are their gatekeepers. If you recall our previous post on tensorphobia, this means that this bilinear map “factors through” the tensor product in a canonical way. So the true heart of this association $(v,w) \mapsto \beta_{v,w}$ is a map $B: V \otimes V \to \textup{Hom}(V,V)$ defined by

$\displaystyle B(v \otimes w) = \beta_{v,w}$

And now the punchline,

Theorem: $B$ is an isomorphism of vector spaces.

Proof. If $v_1, \dots, v_n$ is a basis for $V$ then it’s enough to show that $\beta_{v_i, v_j}$ forms a basis for $\textup{Hom}(V,V)$. Since we already know $\dim(\textup{Hom}(V,V)) = n^2$ and there are $n^2$ of the $\beta_{v_i, v_j}$, all we need to do is show that the $\beta$‘s are linearly independent. For brevity let me remove the $v$‘s and call $\beta_{i,j} =\beta_{v_i, v_j}$.

Suppose they are not linearly independent. Then there is some choice of scalars $a_{i,j}$ so that the linear combination below is the identically zero function

$\displaystyle \sum_{i,j=1}^n a_{i,j}\beta_{i,j} = 0$

In other words, if I plug in any $v_i$ from my (orthonormal) basis, the result is zero. So let’s plug in $v_1$.

\displaystyle \begin{aligned} 0 &= \sum_{i,j=1}^n a_{i,j} \beta_{i,j}(v_1) \\ &= \sum_{i,j=1}^n a_{i,j} \langle v_i, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} \langle v_1, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} v_j \end{aligned}

The orthonormality makes all of the $\langle v_i ,v_1 \rangle = 0$ when $i \neq 1$, so we get a linear combination of the $v_j$ being zero. Since the $v_i$ form a basis, it must be that all the $a_{1,j} = 0$. The same thing happens when you plug in $v_2$ or any other $v_k$, and so all the $a_{i,j}$ are zero, proving linear independence.

$\square$

This theorem immediately implies some deep facts, such as that every matrix can be uniquely decomposed as a sum of the $\beta_{i,j}$‘s. Moreover, facts like the $\beta_{i,j}$‘s being rank 1 are immediate: by definition the maps scale a single vector by some number. So of course the image will be one-dimensional. Finding a useful basis $\{ v_i \}$ with which to decompose a matrix is where things get truly fascinating, and we’ll see that next time when we study the singular value decomposition.

In the mean time, this understanding generalizes nicely (via induction/recursion) to higher dimensional tensors. And you don’t need to talk about slices or sub-tensors or lose your sanity over $n$-tuples of indices.

Lastly, all of this duality stuff provides a “coordinate-free” way to think about the transpose of a linear map. We can think of the “transpose” operation as a linear map $-^T : \textup{Hom}(V,W) \to \textup{Hom}(W^*,V^*)$ which (even in infinite dimensions) has the following definition. If $f \in \textup{Hom}(V,W)$ then $f^T$ is a linear map taking $h \in W^*$ to $f^T(h) \in V^*$. The latter is a function in $V^*$, so we need to say what it does on inputs $v \in V$. The only definition that doesn’t introduce any type errors is $(f^T(h))(v) = h(f(v))$. A more compact way to say this is that $f^T(h) = h \circ f$.

Until next time!

# Multiple Qubits and the Quantum Circuit

Last time we left off with the tantalizing question: how do you do a quantum “AND” operation on two qubits? In this post we’ll see why the tensor product is the natural mathematical way to represent the joint state of multiple qubits. Then we’ll define some basic quantum gates, and present the definition of a quantum circuit.

## Working with Multiple Qubits

In a classical system, if you have two bits with values $b_1, b_2$, then the “joint state” of the two bits is given by the concatenated string $b_1b_2$. But if we have two qubits $v, w$, which are vectors in $\mathbb{C}^2$, how do we represent their joint state?

There are seemingly infinitely many things we could try, but let’s entertain the simplest idea for the sake of exercising our linear algebra intuition. The simplest idea is to just “concatenate” the vectors as one does in linear algebra: represent the joint system as $(v, w) \in \mathbb{C}^2 \oplus \mathbb{C}^2$. Recall that the direct sum of two vector spaces is just what you’d want out of “concatenation” of vectors. It treats the two components as completely independent of each other, and there’s an easy way to take any vector in the sum and decompose it into two vectors in the pieces.

Why does this fail to meet our requirements of qubits? Here’s one reason: $(v, w)$ is not a unit vector when $v$ and $w$ are separately unit vectors. Indeed, $\left \| (v,w) \right \|^2 = \left \| v \right \|^2 + \left \| w \right \|^2 = 2$. We could normalize everything, and that would work for a while, but we would still run into problems. A better reason is that direct sums screw up measurement. In particular, if you have two qubits (and they’re independent, in a sense we’ll make clear later), you should be able to measure one without affecting the other. But if we use the direct sum method for combining qubits, then measuring one qubit would collapse the other! There are times when we want this to happen, but we don’t always want it to happen. Alas, there should be better reasons out there (besides, “physics says so”) but I haven’t come across them yet.

So the nice mathematical alternative is to make the joint state of two qubits $v,w$ the tensor product $v \otimes w$. For a review of the basic properties of tensors and multilinear maps, see our post on the subject. Suffice it for now to remind the reader that the basis of the tensor space $U \otimes V$ consists of all the tensors of the basis elements of the pieces $U$ and $V$: $u_i \otimes v_j$. As such, the dimension of $U \otimes V$ is the product of the dimensions $\text{dim}(U) \text{dim}(V)$.

As a consequence of this and the fact that all $\mathbb{C}$-vector spaces of the same dimension are the same (isomorphic), the state space of a set of $n$ qubits can be identified with $\mathbb{C}^{2^n}$. This is one way to see why quantum computing has the potential to be strictly more powerful than classical computing: $n$ qubits provide a state space with $2^n$ coefficients, each of which is a complex number. With classical probabilistic computing we only get $n$ “coefficients.” This isn’t a proof that quantum computing is more powerful, but a wink and a nudge that it could be.

While most of the time we’ll just write our states in terms of tensors (using the $\otimes$ symbol), we could write out the vector representation of $v \otimes w$ in terms of the vectors $v = (v_1, v_2), w=(w_1, w_2)$. It’s just $(v_1w_1, v_1w_2, v_2w_1, v_2w_2)$, with the obvious generalization to vectors of any dimension. This already fixes our earlier problem with norms: the norm of a tensor of two vectors is the product of the two norms. So tensors of unit vectors are unit vectors. Moreover, if you measure the first qubit, that just sets the $v_1, v_2$ above to zero or one, leaving a joint state that is still a valid

Likewise, given two linear maps $A, B$, we can describe the map $A \otimes B$ on the tensor space both in terms of pure tensors ($(A \otimes B)(v \otimes w) = Av \otimes Bw$) and in terms of a matrix. In the same vein as the representation for vectors, the matrix corresponding to $A \otimes B$ is

$\displaystyle \begin{pmatrix} a_{1,1}B & a_{1,2}B & \dots & a_{1,n}B \\ a_{2,1}B & a_{2,2}B & \dots & a_{2,n}B \\ \vdots & \vdots & \ddots & \vdots \\ a_{n,1}B & a_{n,2}B & \dots & a_{n,n}B \end{pmatrix}$

This is called the Kronecker product.

One of the strange things about tensor products, which very visibly manifests itself in “strange quantum behavior,” is that not every vector in a tensor space can be represented as a single tensor product of some vectors. Let’s work with an example: $\mathbb{C}^2 \otimes \mathbb{C}^2$, and denote by $e_0, e_1$ the computational basis vectors (the same letters are used for each copy of $\mathbb{C}^2$). Sometimes you’ll get a vector like

$\displaystyle v = \frac{1}{\sqrt{2}} e_0 \otimes e_0 + \frac{1}{\sqrt{2}} e_1 \otimes e_0$

And if you’re lucky you’ll notice that this can be factored and written as $\frac{1}{\sqrt{2}}(e_0 + e_1) \otimes e_0$. Other times, though, you’ll get a vector like

$\displaystyle \frac{1}{\sqrt{2}}(e_0 \otimes e_0 + e_1 \otimes e_1)$

And it’s a deep fact that this cannot be factored into a tensor product of two vectors (prove it as an exercise). If a vector $v$ in a tensor space can be written as a single tensor product of vectors, we call $v$ a pure tensor. Otherwise, using some physics lingo, we call the state represented by $v$ entangled. So if you did the exercise you proved that not all tensors are pure tensors, or equivalently that there exist entangled quantum states. The latter sounds so much more impressive. We’ll see in a future post why these entangled states are so important in quantum computing.

Now we need to explain how to extend gates and qubit measurements to state spaces with multiple qubits. The first is easy: just as we often restrict our classical gates to a few bits (like the AND of two bits), we restrict multi-qubit quantum gates to operate on at most three qubits.

Definition: A quantum gate $G$ is a unitary map $\mathbb{C}^{2^n} \to \mathbb{C}^{2^n}$ where $n$ is at most 3, (recall, $(\mathbb{C}^2)^{\otimes 3} = \mathbb{C}^{2^3}$ is the state space for 3 qubits).

Now let’s see how to implement AND and OR for two qubits. You might be wondering why we need three qubits in the definition above, and, perhaps surprisingly, we’ll see that AND and OR require us to work with three qubits.

Because how would one compute an AND of two qubits? Taking a naive approach from how we did the quantum NOT, we would label $e_0$ as “false” and $e_1$ as “true,” and we’d want to map $e_1 \otimes e_1 \mapsto e_1$ and all other possibilities to $e_0$. The main problem is that this is not an invertible function! Remember, all quantum operations are unitary matrices and all unitary matrices have inverses, so we have to model AND and OR as an invertible operation. We also have a “type error,” since the output is not even in the same vector space as the input, but any way to fix that would still run into the invertibility problem.

The way to deal with this is to add an extra “scratch work” qubit that is used for nothing else except to make the operation invertible. So now say we have three qubits $a, b, c$, and we want to compute $a$ AND $b$ in the sensible way described above. What we do is map

$\displaystyle a \otimes b \otimes c \mapsto a \otimes b \otimes (c \oplus (a \wedge b))$

Here $a \wedge b$ is the usual AND (where we interpret, e.g., $e_1 \wedge e_0 = e_0$), and $\oplus$ is the exclusive or operation on bits. It’s clear that this mapping makes sense for “bits” (the true/false interpretation of basis vectors) and so we can extend it to a linear map by writing down the matrix.

This gate is often called the Toffoli gate by physicists, but we’ll just call it the (quantum) AND gate. Note that the column $ijk$ represents the input $e_i \otimes e_j \otimes e_k$, and the 1 in that column denotes the row whose label is the output. In particular, if we want to do an AND then we’ll ensure the “scratch work” qubit is $e_0$, so we can ignore half the columns above where the third qubit is 1. The reader should write down the analogous construction for a quantum OR.

From now on, when we’re describing a basis state like $e_1 \otimes e_0 \otimes e_1$, we’ll denote it as $e_{101}$, and more generally when $i$ is a nonnegative integer or a binary string we’ll denote the basis state as $e_i$. We’re taking advantage of the correspondence between the $2^n$ binary strings and the $2^n$ basis states, and it compactifies notation.

Once we define a quantum circuit, it will be easy to show that using quantum AND’s, OR’s and NOT’s, we can achieve any computation that a classical circuit can.

We have one more issue we’d like to bring up before we define quantum circuits. We’re being a bit too slick when we say we’re working with “at most three qubits.” If we have ten qubits, potentially all entangled up in a weird way, how can we apply a mapping to only some of those qubits? Indeed, we only defined AND for $\mathbb{C}^8$, so how can we extend that to an AND of three qubits sitting inside any $\mathbb{C}^{2^n}$ we please? The answer is to apply the Kronecker product with the identity matrix appropriately. Let’s do a simple example of this to make everything stick.

Say I want to apply the quantum NOT gate to a qubit $v$, and I have four other qubits $w_1, w_2, w_3, w_4$ so that they’re all in the joint state $x = v \otimes w_1 \otimes w_2 \otimes w_3 \otimes w_4$. I form the NOT gate, which I’ll call $A$, and then I apply the gate $A \otimes I_{2^4}$ to $x$ (since there are 4 of the $w_i$). This will compute the tensor $Av \otimes I_2 w_1 \otimes I_2 w_2 \otimes I_2 w_3 \otimes I_2 w_4$, as desired.

In particular, you can represent a gate that depends on only 3 qubits by writing down the 3×3 matrix and the three indices it operates on. Note that this requires only 12 (possibly complex) numbers to write down, and so it takes “constant space” to represent a single gate.

## Quantum Circuits

Here we are at the definition of a quantum circuit.

Definition: quantum circuit is a list $G_1, \dots, G_T$ of $2^m \times 2^m$ unitary matrices, such that each $G_i$ depends on at most 3 qubits.

We’ll write down what it means to “compute” something with a quantum circuit, but for now we can imagine drawing it like a usual circuit. We write the input state as some unit vector $x \in C^{2^n}$ (which may or may not be a pure tensor), each qubit making up the vector is associated to a “wire,” and at each step we pick three of the wires, send them to the next quantum gate $G_i$, and use the three output wires for further computations. The final output is the matrix product applied to the input $G_T \dots G_1x$. We imagine that each gate takes only one step to compute (recall, in our first post one “step” was a photon flying through a special material, so it’s not like we have to multiply these matrices by hand).

So now we have to say how a quantum circuit could solve a problem. At all levels of mathematical maturity we should have some idea how a regular circuit solves a problem: there is some distinguished output wire or set of wires containing the answer. For a quantum circuit it’s basically the same, except that at the end of the circuit we get a single quantum state (a tensor in this big vector space), and we just measure that state. Like the case of a single qubit, if the vector has coordinates $x = (x_1, \dots, x_{2^n})$, they must satisfy $\sum_i |x_i|^2 = 1$, and the probability of the measurement producing index $j$ is $|x_j|^2$. The result of that measurement is an integer (some classical bits) that represent our answer. As a side effect, the vector $x$ is mutated into the basis state $e_j$. As we’ve said we may need to repeat a quantum computation over and over to get a good answer with high probability, so we can imagine that a quantum circuit is used as some subroutine in a larger (otherwise classical) algorithm that allows for pre- and post-processing on the quantum part.

The final caveat is that we allow one to include as many scratchwork qubits as one needs in their circuit. This makes it possible already to simulate any classical circuit using a quantum circuit. Let’s prove it as a theorem.

Theorem: Given a classical circuit $C$ with a single output bit, there is a quantum circuit $D$ that computes the same function.

Proof. Let $x$ be a binary string input to $C$, and suppose that $C$ has $s$ gates $g_1, \dots, g_s$, each being either AND, OR, or NOT, and with $g_s$ being the output gate. To construct $D$, we can replace every $g_i$ with their quantum counterparts $G_i$. Recall that this takes $e_{b_1b_20} \mapsto e_{b_1b_2(g_i(b_1, b_2))}$. And so we need to add a single scratchwork qubit for each one (really we only need it for the ANDs and ORs, but who cares). This means that our start state is $e_{x} \otimes e_{0^s} = e_{x0^s}$. Really, we need one of these gates $G_i$ for each wire going out of the classical gate $g_i$, but with some extra tricks one can do it with a single quantum gate that uses multiple scratchwork qubits. The crucial thing to note is that the state vector is always a basis vector!

If we call $z$ the contents of all the scratchwork after the quantum circuit described above runs and $z_0$ the initial state of the scratchwork, then what we did was extend the function $x \mapsto C(x)$ to a function $e_{xz_0} \mapsto e_{xz}$. In particular, one of the bits in the $z$ part is the output of the last gate of $C$, and everything is 0-1 valued. So we can measure the state vector, get the string $xz$ and inspect the bit of $z$ which corresponds to the output wire of the final gate of the original circuit $C$. This is your answer.

$\square$

It should be clear that the single output bit extends to the general case easily. We can split a circuit with lots of output bits into a bunch of circuits with single output bits in the obvious way and combine the quantum versions together.

Next time we’ll finally look at our first quantum algorithms. And along the way we’ll see some more significant quantum operations that make use of the properties that make the quantum world interesting. Until then!

# How to Conquer Tensorphobia

A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke tensor products… People run in terror from the $\otimes$ symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

## Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. Tensors are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a tensor product. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word product is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual direct product of vector spaces. In fact they are related (in very precise sense), but they are far from similar. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces $V, W$ of dimensions $n, m$. Recall that the direct product  $V \times W$ is the vector space of pairs $(v,w)$ where $v$ comes from $V$ and $w$ from $W$. Recall that addition in this vector space is defined componentwise ($(v_1,w_1) + (v_2, w_2) = (v_1 + v_2, w_1 + w_2$)) and scalar multiplication scales both components $\lambda (v,w) = (\lambda v, \lambda w)$.

To get the tensor product space $V \otimes W$, we make the following modifications. First, we redefine what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is declared to be the same as scalar multiplication of any component you want. In symbols,

$\displaystyle \lambda (v, w) = (\lambda v, w) = (v, \lambda w)$

for all choices of scalars $\lambda$ and vectors $v, w$. Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

$(v, w) + (v', w) = (v + v', w)$

only works because $w$ is the same in both pieces, and with the same rule applying if we switch the positions of $v,w$ above. All other additions are simply declared to be new vectors. I.e. $(x,y) + (z,w)$ is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that $y=w$ or $x=z$. To say it still one more time, a general element of the tensor $V \otimes W$ is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we rename the pair $(v,w)$ to $v \otimes w$, to distinguish it from the old vector space $V \times W$ that we’ve totally butchered and reanimated, and we call the tensor product space as a whole $V \otimes W$. Those familiar with this kind of abstract algebra will recognize quotient spaces at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of $\mathbb{R}$. This means that our space $\mathbb{R} \otimes \mathbb{R}$ is comprised of vectors like $3 \otimes 5$, and moreover that the following operations are completely legitimate.

$3 \otimes 5 + 1 \otimes (-5) = 3 \otimes 5 + (-1) \otimes 5 = 2 \otimes 5$

$6 \otimes 1 + 3\pi \otimes \pi = 3 \otimes 2 + 3 \otimes \pi^2 = 3 \otimes (2 + \pi^2)$

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just declaring that all the axioms of a vector space hold when we want them to. So if you were wondering whether

$\lambda (a \otimes b + c \otimes d) = \lambda(a \otimes b) + \lambda(c \otimes d)$

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space $V \otimes W$ are

1. The “basic” vectors are $v \otimes w$ for $v \in V, w \in W$, and they’re used to build up all other vectors.
2. Addition is symbolic, unless one of the components is the same in both addends, in which case $(v_1, w) + (v_2, w) = (v_1+ v_2, w)$ and $(v, w_1) + (v,w_2) = (v, w_1 + w_2)$.
3. You can freely move scalar multiples around the components of $v \otimes w$.
4. The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to $n$-fold tensor products, like $V_1 \otimes V_2 \otimes \dots \otimes V_d$. Here we write the vectors as sums of things like $v_1 \otimes \dots \otimes v_d$, and we enforce that addition can only be combined if all but one coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

## So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the maps between objects are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a linear map between vector spaces. That is, a map $f: V \to W$ that factors through addition and scalar multiplication (i.e. $f(v + v') = f(v) + f(v')$ and $f(\lambda v) = \lambda f(v)$).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it $m$, then $m$ takes in two matrices and spits out their product

$m(A,B) = AB$

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix $B$ then

$m(A+C, B) = (A+C)B = AB + CB = m(A,B) + m(C,B)$

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates simultaneously. In other words,

$m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD \neq AC + BD = m(A,C) + m(B,D)$

In fact, there is only one function that satisfies both “linearity in its two coordinates separately” and also “linearity in both coordinates simultaneously,” and it’s the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that $m$ is linear in each coordinate when all else is fixed. Note that this property allows us to shift around scalar multiples, too. For example,

$\displaystyle m(\lambda A, B) = \lambda AB = A (\lambda B) = m(A, \lambda B) = \lambda m(A,B)$

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called bilinearity when we only have two coordinates, and multilinearity when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

• If $V$ is an inner product space over $\mathbb{R}$, then the inner product is bilinear.
• The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
• The cross product of vectors in $\mathbb{R}^3$ is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of classification of multilinear maps. The idea is that every multilinear map $f$ from a product vector space $U_1 \times \dots \times U_d$ to any vector space $Y$ can be written first as a multilinear map to the tensor space

$\displaystyle \alpha : U_1 \times \dots \times U_d \to U_1 \otimes \dots \otimes U_d$

Followed by a linear map to $Y$,

$\displaystyle \hat{f} : U_1 \otimes \dots \otimes U_d \to Y$

And the important part is that $\alpha$ doesn’t depend on the original $f$ (but $\hat{f}$ does). One usually draws this as a single diagram:

And to say this diagram commutes is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. $f = \hat{f} \alpha$).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and $\alpha$ is the gate. Yet another way to say this is that $\alpha$ is the most general possible multilinear map that can be constructed from $U_1 \times \dots \times U_d$. Moreover, the tensor product itself is uniquely defined by having a “most-general” $\alpha$ (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of $\alpha$ (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix $V, W$ as vector spaces and look at the diagram

What is $\alpha$ in this case? Well it just sends $(v,w) \mapsto v \otimes w$. Is this map multilinear? Well if we fix $w$ then

$\displaystyle \alpha(v_1 + v_2, w) = (v_1 + v_2) \otimes w = v_1 \otimes w + v_2 \otimes w = \alpha(v_1, w) + \alpha (v_2, w)$

and

$\displaystyle \alpha(\lambda v, w) = (\lambda v) \otimes w = (\lambda) (v \otimes w) = \lambda \alpha(v,w)$

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we designed tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps $f : V \times W \to Y$ can be decomposed into an $\alpha$ part and a $\hat{f}$ part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of $V \times W$ is, it’s just the union of the bases of the pieces. Say that $V$ has a basis $v_1, \dots, v_n$ and $W$ has $w_1, \dots, w_m$, then a basis for the product is just $((v_1, 0), \dots, (v_n,0), (0,w_1), \dots, (0,w_m))$.

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on all possible pairs of basis elements. For how else could we determine $f(v_1 + v_2, w_1)$? If there are $n$ of the $v_i$‘s and $m$ of the $w_i$‘s, then there are $nm$ such pairs $f(v_i, w_j)$.

Uncoincidentally, as $V \otimes W$ is a vector space, its basis can also be constructed in terms of the bases of $V$ and $W$. You simply take all possible tensors $v_i \otimes w_j$. Since every $v \in V, w \in W$ can be written in terms of their bases, it’s clear than any tensor $\sum_{k} a_k \otimes b_k$ can also be written in terms of the basis tensors $v_i \otimes w_j$ (by simply expanding each $a_k, b_k$ in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if $(e_1, e_2, e_3)$ is a basis for $\mathbb{R}^3$, and $(g_1, g_2)$ a basis for $\mathbb{R}^2$, then the tensor space $\mathbb{R}^3 \otimes \mathbb{R}^2$ has basis

$(e_1 \otimes g_1, e_1 \otimes g_2, e_2 \otimes g_1, e_2 \otimes g_2, e_3 \otimes g_1, e_3 \otimes g_2)$

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that $\mathbb{R}^3 \otimes \mathbb{R}^2 \cong \mathbb{R}^6$.

So fine, back to decomposing $f$. All we have left to do is use the data given by $f$ (the effect on pairs of basis elements) to define $\hat{f} : V \otimes W \to Y$. The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space ($v_i \otimes w_j$) and the definition of $f(v_i, w_j)$ are essentially the same.

That is, just take $\hat{f}(v_i \otimes w_j) = f(v_i, w_j)$. Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining $\hat{f}$ to split across sums of tensors as needed). Is this well defined? Well, multilinearity of $f$ forces it to be so. For if we had two equal tensors, say, $\lambda v \otimes w = v \otimes \lambda w$, then we know that $f$ has to respect their equality, because $f(\lambda v_i, w_j) = f(v_i, \lambda w_j)$, so $\hat{f}$ will take the same value on equal tensors regardless of which representative we pick (where we decide to put the $\lambda$). The same idea works for sums, so everything checks out, and $f(v,w)$ is equal to $\hat{f} \alpha$, as desired. Moreover, we didn’t make any choices in constructing $\hat{f}$. If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining $V \otimes W$). Since the construction would be isomorphic if we changed the basis, our choice of $\hat{f}$ is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear why we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

## Computability and Stuff

It should be clear by now that plain product spaces $V \times W$ and tensor product spaces $V \otimes W$ are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in $V$ and $W$. Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space $U_1 \otimes \dots \otimes U_d$ is as a sum of the basic-looking tensors.

$\sum_k \displaystyle a_{1,k} \otimes a_{2,k} \otimes \dots \otimes a_{d,k}$

where the $a_{i,d}$ are linear combinations of basis vectors in the $U_i$. But as we saw with our examples over $\mathbb{R}$, there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor $a_1 \otimes \dots \otimes a_d$. If you can do this we call the tensor a pure tensor, or a rank 1 tensor. We then have the following natural definition and problem:

Definition: The rank of a tensor $x \in U_1 \otimes \dots \otimes U_d$ is the minimum number of terms in any representation of $x$ as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

Problem: Given a tensor $x \in k^{n_1} \otimes k^{n_2} \otimes k^{n_3}$ where $k$ is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict $k$ to be either the rational numbers $\mathbb{Q}$ or a finite field $\mathbb{F}_{q}$.

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

Theorem: Computing tensor rank is NP-hard when $k = \mathbb{Q}$ and NP-complete when $k$ is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!