# A parlor trick for SET

Tai-Danae Bradley is one of the hosts of PBS Infinite Series, a delightful series of vignettes into fun parts of math. The video below is about the same of SET, a favorite among mathematicians. Specifically, Tai-Danae explains how SET cards lie in (using more technical jargon) a vector space over a finite field, and that valid sets correspond to lines. If you don’t immediately know how this would work, watch the video.

In this post I want to share a parlor trick for SET that I originally heard from Charlotte Chan. It uses the same ideas from the video above, which I’ll only review briefly.

In the game of SET you see a board of cards like the following, and players look for sets.

Image source: theboardgamefamily.com

A valid set is a triple of cards where, feature by feature, the characteristics on the cards are either all the same or all different. A valid set above is {one empty blue oval, two solid blue ovals, three shaded blue ovals}. The feature of “fill” is different on all the cards, but the feature of “color” is the same, etc.

In a game of SET, the cards are dealt in order from a shuffled deck, players race to claim sets, removing the set if it’s valid, and three cards are dealt to replace the removed set. Eventually the deck is exhausted and the game is over, and the winner is the player who collected the most sets.

There are a handful of mathematical tricks you can use to help you search for sets faster, but the parlor trick in this post adds a fun variant to the end of the game.

Play the game of SET normally, but when you get down to the last card in the deck, don’t reveal it. Keep searching for sets until everyone agrees no visible sets are left. Then you start the variant: the first player to guess the last un-dealt card in the deck gets a bonus set.

The math comes in when you discover that you don’t need to guess, or remember anything about the game that was just played! A clever stranger could walk into the room at the end of the game and win the bonus point.

Theorem: As long as every player claimed a valid set throughout the game, the information on the remaining board uniquely determines the last (un-dealt) card.

Before we get to the proof, some reminders. Recall that there are four features on a SET card, each of which has three options. Enumerate the options for each feature (e.g., {Squiggle, Oval, Diamond} = {0, 1, 2}).

While we will not need the geometry induced by this, this implies each card is a vector in the vector space $\mathbb{F}_3^4$, where $\mathbb{F}_3 = \mathbb{Z}/3\mathbb{Z}$ is the finite field of three elements, and the exponent means “dimension 4.” As Tai-Danae points out in the video, each SET is an affine line in this vector space. For example, if this is the enumeration:

Source: “The Joy of Set

Then using the enumeration, a set might be given by

$\displaystyle \{ (1, 1, 1, 1), (1, 2, 0, 1), (1, 0, 2, 1) \}$

The crucial feature for us is that the vector-sum (using the modular field arithmetic on each entry) of the cards in a valid set is the zero vector $(0, 0, 0, 0)$. This is because $1+1+1 = 0, 2+2+2 = 0,$ and $1+2+3=0$ are all true mod 3.

Proof of Theorem. Consider the vector-valued invariant $S_t$ equal to the sum of the remaining cards after $t$ sets have been taken. At the beginning of the game the deck has 81 cards that can be partitioned into valid sets. Because each valid set sums to the zero vector, $S_0 = (0, 0, 0, 0)$. Removing a valid set via normal play does not affect the invariant, because you’re subtracting a set of vectors whose sum is zero. So $S_t = 0$ for all $t$.

At the end of the game, the invariant still holds even if there are no valid sets left to claim. Let $x$ be the vector corresponding to the last un-dealt card, and $c_1, \dots, c_n$ be the remaining visible cards. Then $x + \sum_{i=1}^n c_i = (0,0,0,0)$, meaning $x = -\sum_{i=1}^n c_i$.

$\square$

I would provide an example, but I want to encourage everyone to play a game of SET and try it out live!

Charlotte, who originally showed me this trick, was quick enough to compute this sum in her head. So were the other math students we played SET with. It’s a bit easier than it seems since you can do the sum feature by feature. Even though I’ve known about this trick for years, I still require a piece of paper and a few minutes.

Because this is Math Intersect Programming, the reader is encouraged to implement this scheme as an exercise, and simulate a game of SET by removing randomly chosen valid sets to verify experimentally that this scheme works.

Until next time!

# The Inner Product as a Decision Rule

The standard inner product of two vectors has some nice geometric properties. Given two vectors $x, y \in \mathbb{R}^n$, where by $x_i$ I mean the $i$-th coordinate of $x$, the standard inner product (which I will interchangeably call the dot product) is defined by the formula

$\displaystyle \langle x, y \rangle = x_1 y_1 + \dots + x_n y_n$

This formula, simple as it is, produces a lot of interesting geometry. An important such property, one which is discussed in machine learning circles more than pure math, is that it is a very convenient decision rule.

In particular, say we’re in the Euclidean plane, and we have a line $L$ passing through the origin, with $w$ being a unit vector perpendicular to $L$ (“the normal” to the line).

If you take any vector $x$, then the dot product $\langle x, w \rangle$ is positive if $x$ is on the same side of $L$ as $w$, and negative otherwise. The dot product is zero if and only if $x$ is exactly on the line $L$, including when $x$ is the zero vector.

Left: the dot product of $w$ and $x$ is positive, meaning they are on the same side of $w$. Right: The dot product is negative, and they are on opposite sides.

Here is an interactive demonstration of this property. Click the image below to go to the demo, and you can drag the vector arrowheads and see the decision rule change.

Click above to go to the demo

The code for this demo is available in a github repository.

It’s always curious, at first, that multiplying and summing produces such geometry. Why should this seemingly trivial arithmetic do anything useful at all?

The core fact that makes it work, however, is that the dot product tells you how one vector projects onto another. When I say “projecting” a vector $x$ onto another vector $w$, I mean you take only the components of $x$ that point in the direction of $w$. The demo shows what the result looks like using the red (or green) vector.

In two dimensions this is easy to see, as you can draw the triangle which has $x$ as the hypotenuse, with $w$ spanning one of the two legs of the triangle as follows:

If we call $a$ the (vector) leg of the triangle parallel to $w$, while $b$ is the dotted line (as a vector, parallel to $L$), then as vectors $x = a + b$. The projection of $x$ onto $w$ is just $a$.

Another way to think of this is that the projection is $x$, modified by removing any part of $x$ that is perpendicular to $w$. Using some colorful language: you put your hands on either side of $x$ and $w$, and then you squish $x$ onto $w$ along the line perpendicular to $w$ (i.e., along $b$).

And if $w$ is a unit vector, then the length of $a$—that is, the length of the projection of $x$ onto $w$—is exactly the inner product product $\langle x, w \rangle$.

Moreover, if the angle between $x$ and $w$ is larger than 90 degrees, the projected vector will point in the opposite direction of $w$, so it’s really a “signed” length.

Left: the projection points in the same direction as $w$. Right: the projection points in the opposite direction.

And this is precisely why the decision rule works. This 90-degree boundary is the line perpendicular to $w$.

More technically said: Let $x, y \in \mathbb{R}^n$ be two vectors, and $\langle x,y \rangle$ their dot product. Define by $\| y \|$ the length of $y$, specifically $\sqrt{\langle y, y \rangle}$. Define by $\text{proj}_{y}(x)$ by first letting $y' = \frac{y}{\| y \|}$, and then let $\text{proj}_{y}(x) = \langle x,y' \rangle y'$. In words, you scale $y$ to a unit vector $y'$, use the result to compute the inner product, and then scale $y$ so that it’s length is $\langle x, y' \rangle$. Then

Theorem: Geometrically, $\text{proj}_y(x)$ is the projection of $x$ onto the line spanned by $y$.

This theorem is true for any $n$-dimensional vector space, since if you have two vectors you can simply apply the reasoning for 2-dimensions to the 2-dimensional plane containing $x$ and $y$. In that case, the decision boundary for a positive/negative output is the entire $n-1$ dimensional hyperplane perpendicular to $y$ (the projected vector).

In fact, the usual formula for the angle between two vectors, i.e. the formula $\langle x, y \rangle = \|x \| \cdot \| y \| \cos \theta$, is a restatement of the projection theorem in terms of trigonometry. The $\langle x, y' \rangle$ part of the projection formula (how much you scale the output) is equal to $\| x \| \cos \theta$. At the end of this post we have a proof of the cosine-angle formula above.

Part of why this decision rule property is so important is that this is a linear function, and linear functions can be optimized relatively easily. When I say that, I specifically mean that there are many known algorithms for optimizing linear functions, which don’t have obscene runtime or space requirements. This is a big reason why mathematicians and statisticians start the mathematical modeling process with linear functions. They’re inherently simpler.

In fact, there are many techniques in machine learning—a prominent one is the so-called Kernel Trick—that exist solely to take data that is not inherently linear in nature (cannot be fruitfully analyzed by linear methods) and transform it into a dataset that is. Using the Kernel Trick as an example to foreshadow some future posts on Support Vector Machines, the idea is to take data which cannot be separated by a line, and transform it (usually by adding new coordinates) so that it can. Then the decision rule, computed in the larger space, is just a dot product. Irene Papakonstantinou neatly demonstrates this with paper folding and scissors. The tradeoff is that the size of the ambient space increases, and it might increase so much that it makes computation intractable. Luckily, the Kernel Trick avoids this by remembering where the data came from, so that one can take advantage of the smaller space to compute what would be the inner product in the larger space.

Next time we’ll see how this decision rule shows up in an optimization problem: finding the “best” hyperplane that separates an input set of red and blue points into monochromatic regions (provided that is possible). Finding this separator is core subroutine of the Support Vector Machine technique, and therein lie interesting algorithms. After we see the core SVM algorithm, we’ll see how the Kernel Trick fits into the method to allow nonlinear decision boundaries.

Proof of the cosine angle formula

Theorem: The inner product $\langle v, w \rangle$ is equal to $\| v \| \| w \| \cos(\theta)$, where $\theta$ is the angle between the two vectors.

Note that this angle is computed in the 2-dimensional subspace spanned by $v, w$, viewed as a typical flat plane, and this is a 2-dimensional plane regardless of the dimension of $v, w$.

Proof. If either $v$ or $w$ is zero, then both sides of the equation are zero and the theorem is trivial, so we may assume both are nonzero. Label a triangle with sides $v,w$ and the third side $v-w$. Now the length of each side is $\| v \|, \| w\|,$ and $\| v-w \|$, respectively. Assume for the moment that $\theta$ is not 0 or 180 degrees, so that this triangle is not degenerate.

The law of cosines allows us to write

$\displaystyle \| v - w \|^2 = \| v \|^2 + \| w \|^2 - 2 \| v \| \| w \| \cos(\theta)$

Moreover, The left hand side is the inner product of $v-w$ with itself, i.e. $\| v - w \|^2 = \langle v-w , v-w \rangle$. We’ll expand $\langle v-w, v-w \rangle$ using two facts. The first is trivial from the formula, that inner product is symmetric: $\langle v,w \rangle = \langle w, v \rangle$. Second is that the inner product is linear in each input. In particular for the first input: $\langle x + y, z \rangle = \langle x, z \rangle + \langle y, z \rangle$ and $\langle cx, z \rangle = c \langle x, z \rangle$. The same holds for the second input by symmetry of the two inputs. Hence we can split up $\langle v-w, v-w \rangle$ as follows.

\displaystyle \begin{aligned} \langle v-w, v-w \rangle &= \langle v, v-w \rangle - \langle w, v-w \rangle \\ &= \langle v, v \rangle - \langle v, w \rangle - \langle w, v \rangle + \langle w, w \rangle \\ &= \| v \|^2 - 2 \langle v, w \rangle + \| w \|^2 \\ \end{aligned}

Combining our two offset equations, we can subtract $\| v \|^2 + \| w \|^2$ from each side and get

$\displaystyle -2 \|v \| \|w \| \cos(\theta) = -2 \langle v, w \rangle,$

Which, after dividing by $-2$, proves the theorem if $\theta \not \in \{0, 180 \}$.

Now if $\theta = 0$ or 180 degrees, the vectors are parallel, so we can write one as a scalar multiple of the other. Say $w = cv$ for $c \in \mathbb{R}$. In that case, $\langle v, cv \rangle = c \| v \| \| v \|$. Now $\| w \| = | c | \| v \|$, since a norm is a length and is hence non-negative (but $c$ can be negative). Indeed, if $v, w$ are parallel but pointing in opposite directions, then $c < 0$, so $\cos(\theta) = -1$, and $c \| v \| = - \| w \|$. Otherwise $c > 0$ and $\cos(\theta) = 1$. This allows us to write $c \| v \| \| v \| = \| w \| \| v \| \cos(\theta)$, and this completes the final case of the theorem.

$\square$

# Singular Value Decomposition Part 1: Perspectives on Linear Algebra

The singular value decomposition (SVD) of a matrix is a fundamental tool in computer science, data analysis, and statistics. It’s used for all kinds of applications from regression to prediction, to finding approximate solutions to optimization problems. In this series of two posts we’ll motivate, define, compute, and use the singular value decomposition to analyze some data. (Jump to the second post)

I want to spend the first post entirely on motivation and background. As part of this, I think we need a little reminder about how linear algebra equivocates linear subspaces and matrices. I say “I think” because what I’m going to say seems rarely spelled out in detail. Indeed, I was confused myself when I first started to read about linear algebra applied to algorithms, machine learning, and data science, despite having a solid understanding of linear algebra from a mathematical perspective. The concern is the connection between matrices as transformations and matrices as a “convenient” way to organize data.

## Data vs. maps

Linear algebra aficionados like to express deep facts via statements about matrix factorization. That is, they’ll say something opaque like (and this is the complete statement for SVD we’ll get to in the post):

The SVD of an $m \times n$ matrix $A$ with real values is a factorization of $A$ as $U\Sigma V^T$, where $U$ is an $m \times m$ orthogonal matrix, $V$ is an $n \times n$ orthogonal matrix, and $\Sigma$ is a diagonal matrix with nonnegative real entries on the diagonal.

Okay, I can understand the words individually, but what does it mean in terms of the big picture? There are two seemingly conflicting interpretations of matrices that muddle our vision.

The first is that $A$ is a linear map from some $n$-dimensional vector space to an $m$-dimensional one. Let’s work with real numbers and call the domain vector space $\mathbb{R}^n$ and the codomain $\mathbb{R}^m$. In this interpretation the factorization expresses a change of basis in the domain and codomain. Specifically, $V$ expresses a change of basis from the usual basis of $\mathbb{R}^n$ to some other basis, and $U$ does the same for the co-domain $\mathbb{R}^m$.

That’s fine and dandy, so long as the data that makes up $A$ is the description of some linear map we’d like to learn more about. Years ago on this blog we did exactly this analysis of a linear map modeling a random walk through the internet, and we ended up with Google’s PageRank algorithm. However, in most linear algebra applications $A$ actually contains data in its rows or columns. That’s the second interpretation. That is, each row of $A$ is a data point in $\mathbb{R}^n$, and there are $m$ total data points, and they represent observations of some process happening in the world. The data points are like people rating movies or weather sensors measuring wind and temperature. If you’re not indoctrinated with the linear algebra world view, you might not naturally think of this as a mapping of vectors from one vector space to another.

Most of the time when people talk about linear algebra (even mathematicians), they’ll stick entirely to the linear map perspective or the data perspective, which is kind of frustrating when you’re learning it for the first time. It seems like the data perspective is just a tidy convenience, that it just “makes sense” to put some data in a table. In my experience the singular value decomposition is the first time that the two perspectives collide, and (at least in my case) it comes with cognitive dissonance.

The way these two ideas combine is that the data is thought of as the image of the basis vectors of $\mathbb{R}^n$ under the linear map specified by $A$. Here is an example to make this concrete. Let’s say I want to express people rating movies. Each row will correspond to the ratings of a movie, and each column will correspond to a person, and the $i,j$ entry of the matrix $A$ is the rating person $j$ gives to movie $i$.

In reality they’re rated on a scale from 1 to 5 stars, but to keep things simple we’ll just say that the ratings can be any real numbers (they just happened to pick integers). So this matrix represents a linear map. The domain is $\mathbb{R}^3$, and the basis vectors are called people, and the codomain is $\mathbb{R}^8$, whose basis vectors are movies.

Now the data set is represented by $A(\vec e_{\textup{Aisha}}),A(\vec e_{\textup{Bob}}),A(\vec e_{\textup{Chandrika}})$, and by the definition of how a matrix represents a linear map, the entires of these vectors are exactly the columns of $A$. If the codomain is really big, then the image of $A$ is a small-dimensional linear subspace of the codomain. This is an important step, that we’ve increased our view from just the individual data points to all of their linear combinations as a subspace.

Why is this helpful at all? This is where we start to see the modeling assumptions of linear algebra show through. If we’re trying to use this matrix to say something about how people rate movies (maybe we want to predict how a new person will rate these movies), we would need to be able to represent that person as a linear combination of Aisha, Bob, and Chandrika. Likewise, if we had a new movie and we wanted to use this matrix to say anything about it, we’d have to represent the movie as a linear combination of the existing movies.

Of course, I don’t literally mean that a movie (as in, the bits comprising a file containing a movie) can be represented as a linear combination of other movies. I mean that we can represent a movie formally as a linear combination in some abstract vector space for the task at hand. In other words, we’re representing those features of the movie that influence its rating abstractly as a vector. We don’t have a legitimate mathematical way to understand that, so the vector is a proxy.

It’s totally unclear what this means in terms of real life, except that you can hope (or hypothesize, or verify), that if the rating process of movies is “linear” in nature then this formal representation will accurately reflect the real world. It’s like how physicists all secretly know that mathematics doesn’t literally dictate the laws of nature, because humans made up math in their heads and if you poke nature too hard the math breaks down, but it’s so damn convenient to describe hypotheses (and so damn accurate), that we can’t avoid using it to design airplanes. And we haven’t found anything better than math for this purpose.

Likewise, movie ratings aren’t literally a linear map, but if we pretend they are we can make algorithms that predict how people rate movies with pretty good accuracy. So if you know that Skyfall gets ratings 1,2, and 1 from Aisha, Bob, and Chandrika, respectively, then a new person would rate Skyfall based on a linear combination of how well they align with these three people. In other words, up to a linear combination, in this example Aisha, Bob, and Chandrika epitomize the process of rating movies.

And now we get to the key: factoring the matrix via SVD provides an alternative and more useful way to represent the process of people rating movies. By changing the basis of one or both vector spaces involved, we isolate the different (orthogonal) characteristics of the process. In the context of our movie example, “factorization” means the following:

1. Come up with a special list of vectors $v_1, v_2, \dots, v_8$ so that every movie can be written as a linear combination of the $v_i$.
2. Do the analogous thing for people to get $p_1, p_2, p_3$.
3. Do (1) and (2) in such a way that the map $A$ is diagonal with respect to both new bases simultaneously.

One might think of the $v_i$ as “idealized movies” and the $p_j$ as “idealized critics.” If you want to use this data to say things about the world, you’d be making the assumption that any person can be written as a linear combination of the $p_j$ and any movie can be written as a linear combination of the $v_i$. These are the rows/columns of $U, V$ from the factorization. To reiterate, these linear combinations are only with respect to the task of rating movies. And they’re “special” because they make the matrix diagonal.

If the world was logical (and I’m not saying it is) then maybe $v_1$ would correspond to some idealized notion of “action movie,” and $p_1$ would correspond to some idealized notion of “action movie lover.” Then it makes sense why the mapping would be diagonal in this basis: an action movie lover only loves action movies, so $p_1$ gives a rating of zero to everything except $v_1$. A movie is represented by how it decomposes (linearly) into “idealized” movies. To make up some arbitrary numbers, maybe Skyfall is 2/3 action movie, 1/5 dystopian sci-fi, and -6/7 comedic romance. Likewise a person would be represented by how they decompose (via linear combination) into a action movie lover, rom-com lover, etc.

To be completely clear, the singular value decomposition does not find the ideal sci-fi movie. The “ideal”ness of the singular value decomposition is with respect to the inherent geometric structure of the data coupled with the assumptions of linearity. Whether this has anything at all to do with how humans classify movies is a separate question, and the answer is almost certainly no.

With this perspective we’re almost ready to talk about the singular value decomposition. I just want to take a moment to write down a list of the assumptions that we’d need if we want to ensure that, given a data set of movie ratings, we can use linear algebra to make exact claims about world.

1. All people rate movies via the same linear map.
2. Every person can be expressed (for the sole purpose of movie ratings) as linear combinations of “ideal” people. Likewise for movies.
3. The “idealized” movies and people can be expressed as linear combinations of the movies/people in our particular data set.
4. There are no errors in the ratings.

One could have a deep and interesting discussion about the philosophical (or ethical, or cultural) aspects of these assumptions. But since the internet prefers to watch respectful discourse burn, we’ll turn to algorithms instead.

## Approximating subspaces

In our present context, the singular value decomposition (SVD) isn’t meant to be a complete description of a mapping in a new basis, as we said above. Rather, we want to use it to approximate the mapping $A$ by low-dimensional linear things. When I say “low-dimensional linear things” I mean that given $A$, we’d want to find another matrix $B$ which is measurably similar to $A$ in some way, and has low rank compared to $A$.

How do we know that $A$ isn’t already low rank? The reasons is that data with even the tiniest bit of noise is full rank with overwhelming probability. A concrete way to say this is that the space of low-rank matrices has small dimension (in the sense of a manifold) inside the space of all matrices. So perturbing even a single entry by an infinitesimally small amount would increase the rank.

We don’t need to understand manifolds to understand the SVD, though. For our example of people rating movies the full-rank property should be obvious. The noise and randomness and arbitrariness in human preferences certainly destroys any “perfect” linear structure we could hope to find, and in particular that means the data set itself, i.e. the image of $A$, is a large-dimensional subspace of the codomain.

Finding a low-rank approximation can be thought of as “smoothing” the noise out of the data. And this works particularly well when the underlying process is close to a linear map. That is, when the data is close to being contained entirely in a single subspace of relatively low-dimension. One way to think of why this might be the case is that if the process you’re observing is truly linear, but the data you get is corrupted by small amounts of noise. Then $A$ will be close to low rank in a measurable sense (to be defined mathematically in the sequel post) and the low-rank approximation $B$ will be a more efficient, accurate, and generalizable surrogate for $A$.

In terms of our earlier list of assumptions about when you can linear algebra to solve problems, for the SVD we can add “approximately” to the first three assumptions, and “not too many errors” to the fourth. If those assumptions hold, SVD will give us a matrix $B$ which accurately represents the process being measured. Conversely, if SVD does well, then you have some evidence that the process is linear-esque.

To be more specific with notation, if $A$ is a matrix representing some dataset via the image of $A$ and you provide a small integer $k$, then the singular value decomposition computes the rank $k$ matrix $B_k$ which best approximates $A$. And since now we’re comfortable identifying a data matrix with the subspace defined by its image, this is the same thing as finding the $k$-dimensional subspace of the image of $A$ which is the best approximation of the data (i.e., the image of $B_k$). We’ll quantify what we mean by “best approximation” in the next post.

That’s it, as far as intuitively understanding what the SVD is. I should add that the SVD doesn’t only allow one to compute a rank $k$ approximation, it actually allows you to set $k=n$ and get an exact representation of $A$. We just won’t use it for that purpose in this series.

The second bit of intuition is the following. It’s only slightly closer to rigor, but somehow this little insight really made SVD click for me personally:

The SVD is what you get when you iteratively solve the greedy optimization problem of fitting data to a line.

By that I mean, you can compute the SVD by doing the following:

1. What’s the best line fitting my data?
2. Okay, ignoring that first line, what’s the next best line?
3. Okay, ignoring all the lines in the span of those first two lines, what’s the next best line?
4. Ignoring all the lines in the span of the first three lines, what’s the next best line?
5. (repeat)

It should be shocking that this works. For most problems, in math and in life, the greedy algorithm is far from optimal. When it happens, once every blue moon, that the greedy algorithm is the best solution to a natural problem (and not obviously so, or just approximately so), it’s our intellectual duty to stop what we’re doing, sit up straight, and really understand and appreciate it. These wonders transcend political squabbles and sports scores. And we’ll start the next post immediately by diving into this greedy optimization problem.

## The geometric perspective

There are two other perspectives I want to discuss here, though it may be more appropriate for a reader who is not familiar with the SVD to wait to read this after the sequel to this post. I’m just going to relate my understanding (in terms of the greedy algorithm and data approximations) to the geometric and statistical perspectives on the SVD.

Michael Nielsen wrote a long and detailed article presenting some ideas about a “new medium” in which to think about mathematics. He demonstrates is framework by looking at the singular value decomposition for 2×2 matrices. His explanation for the intuition behind the SVD is that you can take any matrix (linear map) and break it up into three pieces: a rotation about the origin, a rescaling of each coordinate, followed by another rotation about the origin. While I have the utmost respect for Nielsen (his book on quantum mechanics is the best text in the field), this explanation never quite made SVD click for me personally. It seems like a restatement of the opaque SVD definition (as a matrix factorization) into geometric terms. Indeed, an orthogonal matrix is a rotation, and a diagonal matrix is a rescaling of each coordinate.

To me, the key that’s missing from this explanation is the emphasis on the approximation. What makes the SVD so magical isn’t that the factorization exists in the first place, but rather that the SVD has these layers of increasingly good approximation. Though the terminology will come in the next post, these layers are the (ordered) singular vectors and singular values. And moreover, that the algorithmic process of constructing these layers necessarily goes in order from strongest approximation to weakest.

Another geometric perspective that highlights this is that the rank-$k$ approximation provided by the SVD is a geometric projection of a matrix onto the space of rank at-most-k matrices with respect to the “spectral norm” on matrices (the spectral norm of $A$ is the largest eigenvalue of $A^TA$). The change of basis described above makes this projection very easy: given a singular value decomposition you just take the top $k$ singular vectors. Indeed, the Eckart-Young theorem formalizes this with the statement that the rank-$k$ SVD $B_k$ minimizes the distance (w.r.t. the spectral norm) between the original matrix $A$ and any rank $k$ matrix. So you can prove that SVD gives you the best rank $k$ approximation of $A$ by some reasonable measure.

## Next time: algorithms

Next time we’ll connect all this to the formal definitions and rigor. We’ll study the greedy algorithm approach, and then we’ll implement the SVD and test it on some data.

Until then!

# Multiple Qubits and the Quantum Circuit

Last time we left off with the tantalizing question: how do you do a quantum “AND” operation on two qubits? In this post we’ll see why the tensor product is the natural mathematical way to represent the joint state of multiple qubits. Then we’ll define some basic quantum gates, and present the definition of a quantum circuit.

## Working with Multiple Qubits

In a classical system, if you have two bits with values $b_1, b_2$, then the “joint state” of the two bits is given by the concatenated string $b_1b_2$. But if we have two qubits $v, w$, which are vectors in $\mathbb{C}^2$, how do we represent their joint state?

There are seemingly infinitely many things we could try, but let’s entertain the simplest idea for the sake of exercising our linear algebra intuition. The simplest idea is to just “concatenate” the vectors as one does in linear algebra: represent the joint system as $(v, w) \in \mathbb{C}^2 \oplus \mathbb{C}^2$. Recall that the direct sum of two vector spaces is just what you’d want out of “concatenation” of vectors. It treats the two components as completely independent of each other, and there’s an easy way to take any vector in the sum and decompose it into two vectors in the pieces.

Why does this fail to meet our requirements of qubits? Here’s one reason: $(v, w)$ is not a unit vector when $v$ and $w$ are separately unit vectors. Indeed, $\left \| (v,w) \right \|^2 = \left \| v \right \|^2 + \left \| w \right \|^2 = 2$. We could normalize everything, and that would work for a while, but we would still run into problems. A better reason is that direct sums screw up measurement. In particular, if you have two qubits (and they’re independent, in a sense we’ll make clear later), you should be able to measure one without affecting the other. But if we use the direct sum method for combining qubits, then measuring one qubit would collapse the other! There are times when we want this to happen, but we don’t always want it to happen. Alas, there should be better reasons out there (besides, “physics says so”) but I haven’t come across them yet.

So the nice mathematical alternative is to make the joint state of two qubits $v,w$ the tensor product $v \otimes w$. For a review of the basic properties of tensors and multilinear maps, see our post on the subject. Suffice it for now to remind the reader that the basis of the tensor space $U \otimes V$ consists of all the tensors of the basis elements of the pieces $U$ and $V$: $u_i \otimes v_j$. As such, the dimension of $U \otimes V$ is the product of the dimensions $\text{dim}(U) \text{dim}(V)$.

As a consequence of this and the fact that all $\mathbb{C}$-vector spaces of the same dimension are the same (isomorphic), the state space of a set of $n$ qubits can be identified with $\mathbb{C}^{2^n}$. This is one way to see why quantum computing has the potential to be strictly more powerful than classical computing: $n$ qubits provide a state space with $2^n$ coefficients, each of which is a complex number. With classical probabilistic computing we only get $n$ “coefficients.” This isn’t a proof that quantum computing is more powerful, but a wink and a nudge that it could be.

While most of the time we’ll just write our states in terms of tensors (using the $\otimes$ symbol), we could write out the vector representation of $v \otimes w$ in terms of the vectors $v = (v_1, v_2), w=(w_1, w_2)$. It’s just $(v_1w_1, v_1w_2, v_2w_1, v_2w_2)$, with the obvious generalization to vectors of any dimension. This already fixes our earlier problem with norms: the norm of a tensor of two vectors is the product of the two norms. So tensors of unit vectors are unit vectors. Moreover, if you measure the first qubit, that just sets the $v_1, v_2$ above to zero or one, leaving a joint state that is still a valid

Likewise, given two linear maps $A, B$, we can describe the map $A \otimes B$ on the tensor space both in terms of pure tensors ($(A \otimes B)(v \otimes w) = Av \otimes Bw$) and in terms of a matrix. In the same vein as the representation for vectors, the matrix corresponding to $A \otimes B$ is

$\displaystyle \begin{pmatrix} a_{1,1}B & a_{1,2}B & \dots & a_{1,n}B \\ a_{2,1}B & a_{2,2}B & \dots & a_{2,n}B \\ \vdots & \vdots & \ddots & \vdots \\ a_{n,1}B & a_{n,2}B & \dots & a_{n,n}B \end{pmatrix}$

This is called the Kronecker product.

One of the strange things about tensor products, which very visibly manifests itself in “strange quantum behavior,” is that not every vector in a tensor space can be represented as a single tensor product of some vectors. Let’s work with an example: $\mathbb{C}^2 \otimes \mathbb{C}^2$, and denote by $e_0, e_1$ the computational basis vectors (the same letters are used for each copy of $\mathbb{C}^2$). Sometimes you’ll get a vector like

$\displaystyle v = \frac{1}{\sqrt{2}} e_0 \otimes e_0 + \frac{1}{\sqrt{2}} e_1 \otimes e_0$

And if you’re lucky you’ll notice that this can be factored and written as $\frac{1}{\sqrt{2}}(e_0 + e_1) \otimes e_0$. Other times, though, you’ll get a vector like

$\displaystyle \frac{1}{\sqrt{2}}(e_0 \otimes e_0 + e_1 \otimes e_1)$

And it’s a deep fact that this cannot be factored into a tensor product of two vectors (prove it as an exercise). If a vector $v$ in a tensor space can be written as a single tensor product of vectors, we call $v$ a pure tensor. Otherwise, using some physics lingo, we call the state represented by $v$ entangled. So if you did the exercise you proved that not all tensors are pure tensors, or equivalently that there exist entangled quantum states. The latter sounds so much more impressive. We’ll see in a future post why these entangled states are so important in quantum computing.

Now we need to explain how to extend gates and qubit measurements to state spaces with multiple qubits. The first is easy: just as we often restrict our classical gates to a few bits (like the AND of two bits), we restrict multi-qubit quantum gates to operate on at most three qubits.

Definition: A quantum gate $G$ is a unitary map $\mathbb{C}^{2^n} \to \mathbb{C}^{2^n}$ where $n$ is at most 3, (recall, $(\mathbb{C}^2)^{\otimes 3} = \mathbb{C}^{2^3}$ is the state space for 3 qubits).

Now let’s see how to implement AND and OR for two qubits. You might be wondering why we need three qubits in the definition above, and, perhaps surprisingly, we’ll see that AND and OR require us to work with three qubits.

Because how would one compute an AND of two qubits? Taking a naive approach from how we did the quantum NOT, we would label $e_0$ as “false” and $e_1$ as “true,” and we’d want to map $e_1 \otimes e_1 \mapsto e_1$ and all other possibilities to $e_0$. The main problem is that this is not an invertible function! Remember, all quantum operations are unitary matrices and all unitary matrices have inverses, so we have to model AND and OR as an invertible operation. We also have a “type error,” since the output is not even in the same vector space as the input, but any way to fix that would still run into the invertibility problem.

The way to deal with this is to add an extra “scratch work” qubit that is used for nothing else except to make the operation invertible. So now say we have three qubits $a, b, c$, and we want to compute $a$ AND $b$ in the sensible way described above. What we do is map

$\displaystyle a \otimes b \otimes c \mapsto a \otimes b \otimes (c \oplus (a \wedge b))$

Here $a \wedge b$ is the usual AND (where we interpret, e.g., $e_1 \wedge e_0 = e_0$), and $\oplus$ is the exclusive or operation on bits. It’s clear that this mapping makes sense for “bits” (the true/false interpretation of basis vectors) and so we can extend it to a linear map by writing down the matrix.

This gate is often called the Toffoli gate by physicists, but we’ll just call it the (quantum) AND gate. Note that the column $ijk$ represents the input $e_i \otimes e_j \otimes e_k$, and the 1 in that column denotes the row whose label is the output. In particular, if we want to do an AND then we’ll ensure the “scratch work” qubit is $e_0$, so we can ignore half the columns above where the third qubit is 1. The reader should write down the analogous construction for a quantum OR.

From now on, when we’re describing a basis state like $e_1 \otimes e_0 \otimes e_1$, we’ll denote it as $e_{101}$, and more generally when $i$ is a nonnegative integer or a binary string we’ll denote the basis state as $e_i$. We’re taking advantage of the correspondence between the $2^n$ binary strings and the $2^n$ basis states, and it compactifies notation.

Once we define a quantum circuit, it will be easy to show that using quantum AND’s, OR’s and NOT’s, we can achieve any computation that a classical circuit can.

We have one more issue we’d like to bring up before we define quantum circuits. We’re being a bit too slick when we say we’re working with “at most three qubits.” If we have ten qubits, potentially all entangled up in a weird way, how can we apply a mapping to only some of those qubits? Indeed, we only defined AND for $\mathbb{C}^8$, so how can we extend that to an AND of three qubits sitting inside any $\mathbb{C}^{2^n}$ we please? The answer is to apply the Kronecker product with the identity matrix appropriately. Let’s do a simple example of this to make everything stick.

Say I want to apply the quantum NOT gate to a qubit $v$, and I have four other qubits $w_1, w_2, w_3, w_4$ so that they’re all in the joint state $x = v \otimes w_1 \otimes w_2 \otimes w_3 \otimes w_4$. I form the NOT gate, which I’ll call $A$, and then I apply the gate $A \otimes I_{2^4}$ to $x$ (since there are 4 of the $w_i$). This will compute the tensor $Av \otimes I_2 w_1 \otimes I_2 w_2 \otimes I_2 w_3 \otimes I_2 w_4$, as desired.

In particular, you can represent a gate that depends on only 3 qubits by writing down the 3×3 matrix and the three indices it operates on. Note that this requires only 12 (possibly complex) numbers to write down, and so it takes “constant space” to represent a single gate.

## Quantum Circuits

Here we are at the definition of a quantum circuit.

Definition: quantum circuit is a list $G_1, \dots, G_T$ of $2^m \times 2^m$ unitary matrices, such that each $G_i$ depends on at most 3 qubits.

We’ll write down what it means to “compute” something with a quantum circuit, but for now we can imagine drawing it like a usual circuit. We write the input state as some unit vector $x \in C^{2^n}$ (which may or may not be a pure tensor), each qubit making up the vector is associated to a “wire,” and at each step we pick three of the wires, send them to the next quantum gate $G_i$, and use the three output wires for further computations. The final output is the matrix product applied to the input $G_T \dots G_1x$. We imagine that each gate takes only one step to compute (recall, in our first post one “step” was a photon flying through a special material, so it’s not like we have to multiply these matrices by hand).

So now we have to say how a quantum circuit could solve a problem. At all levels of mathematical maturity we should have some idea how a regular circuit solves a problem: there is some distinguished output wire or set of wires containing the answer. For a quantum circuit it’s basically the same, except that at the end of the circuit we get a single quantum state (a tensor in this big vector space), and we just measure that state. Like the case of a single qubit, if the vector has coordinates $x = (x_1, \dots, x_{2^n})$, they must satisfy $\sum_i |x_i|^2 = 1$, and the probability of the measurement producing index $j$ is $|x_j|^2$. The result of that measurement is an integer (some classical bits) that represent our answer. As a side effect, the vector $x$ is mutated into the basis state $e_j$. As we’ve said we may need to repeat a quantum computation over and over to get a good answer with high probability, so we can imagine that a quantum circuit is used as some subroutine in a larger (otherwise classical) algorithm that allows for pre- and post-processing on the quantum part.

The final caveat is that we allow one to include as many scratchwork qubits as one needs in their circuit. This makes it possible already to simulate any classical circuit using a quantum circuit. Let’s prove it as a theorem.

Theorem: Given a classical circuit $C$ with a single output bit, there is a quantum circuit $D$ that computes the same function.

Proof. Let $x$ be a binary string input to $C$, and suppose that $C$ has $s$ gates $g_1, \dots, g_s$, each being either AND, OR, or NOT, and with $g_s$ being the output gate. To construct $D$, we can replace every $g_i$ with their quantum counterparts $G_i$. Recall that this takes $e_{b_1b_20} \mapsto e_{b_1b_2(g_i(b_1, b_2))}$. And so we need to add a single scratchwork qubit for each one (really we only need it for the ANDs and ORs, but who cares). This means that our start state is $e_{x} \otimes e_{0^s} = e_{x0^s}$. Really, we need one of these gates $G_i$ for each wire going out of the classical gate $g_i$, but with some extra tricks one can do it with a single quantum gate that uses multiple scratchwork qubits. The crucial thing to note is that the state vector is always a basis vector!

If we call $z$ the contents of all the scratchwork after the quantum circuit described above runs and $z_0$ the initial state of the scratchwork, then what we did was extend the function $x \mapsto C(x)$ to a function $e_{xz_0} \mapsto e_{xz}$. In particular, one of the bits in the $z$ part is the output of the last gate of $C$, and everything is 0-1 valued. So we can measure the state vector, get the string $xz$ and inspect the bit of $z$ which corresponds to the output wire of the final gate of the original circuit $C$. This is your answer.

$\square$

It should be clear that the single output bit extends to the general case easily. We can split a circuit with lots of output bits into a bunch of circuits with single output bits in the obvious way and combine the quantum versions together.

Next time we’ll finally look at our first quantum algorithms. And along the way we’ll see some more significant quantum operations that make use of the properties that make the quantum world interesting. Until then!