A problem that is not (properly) PAC-learnable

In a previous post we introduced a learning model called Probably Approximately Correct (PAC). We saw an example of a concept class that was easy to learn: intervals on the real line (and more generally, if you did the exercise, axis-aligned rectangles in a fixed dimension).

One of the primary goals of studying models of learning is to figure out what is learnable and what is not learnable in the various models. So as a technical aside in our study of learning theory, this post presents the standard example of a problem that isn’t learnable in the PAC model we presented last time. Afterward we’ll see that allowing the learner to be more expressive can be helpful, and by doing so we can make this unlearnable problem learnable.

Addendum: This post is dishonest in the following sense. The original definition I presented of PAC-learning is not considered the “standard” version, precisely because it forces the learning algorithm to produce hypotheses from the concept class it’s trying to learn. As this post shows, that prohibits us from learning concept classes that should be easy to learn. So to quell any misconceptions, we’re not saying that 3-term DNF formulas (defined below) are not PAC-learnable, just that they’re not PAC-learnable under the definition we gave in the previous post. In other words, we’ve set up a straw man (or, done some good mathematics) in order to illustrate why we need to add the extra bit about hypothesis classes to the definition at the end of this post.

3-Term DNF Formulas

Readers of this blog will probably have encountered a boolean formula before. A boolean formula is just a syntactic way to describe some condition (like, exactly one of these two things has to be true) using variables and logical connectives. The best way to recall it is by example: the following boolean formula encodes the “exclusive or” of two variables.

$\displaystyle (x \wedge \overline{y}) \vee (\overline{x} \wedge y)$

The wedge $\wedge$ denotes a logical AND and the vee $\vee$ denotes a logical OR. A bar above a variable represents a negation of a variable. (Please don’t ask me why the official technical way to write AND and OR is in all caps, I feel like I’m yelling math at people.)

In general a boolean formula has literals, which we can always denote by an $x_i$ or the negation $\overline{x_i}$, and connectives $\wedge$ and $\vee$, and parentheses to denote order. It’s a simple fact that any logical formula can be encoded using just these tools, but rather than try to learn general boolean formulas we look at formulas in a special form.

Definition: A formula is in three-term disjunctive normal form (DNF) if it has the form $C_1 \vee C_2 \vee C_3$ where each $C_i$ is an AND of some number of literals.

Readers who enjoyed our P vs NP primer will recall a related form of formulas: the 3-CNF form, where the “three” meant that each clause had exactly three literals and the “C” means the clauses are connected with ANDs. This is a sort of dual normal form: there are only three clauses, each clause can have any number of variables, and the roles of AND and OR are switched. In fact, if you just distribute the $\vee$‘s in a 3-term DNF formula using DeMorgan’s rules, you’ll get an equivalent 3-CNF formula. The restriction of our hypotheses to 3-term DNFs will be the crux of the difficulty: it’s not that we can’t learn DNF formulas, we just can’t learn them if we are forced to express our hypothesis as a 3-term DNF as well.

The way we’ll prove that 3-term DNF formulas “can’t be learned” in the PAC model is by an NP-hardness reduction. That is, we’ll show that if we could learn 3-term DNFs in the PAC model, then we’d be able to efficiently solve NP-hard problems with high probability. The official conjecture we’d be violating is that RP is different from NP. RP is the class of problems that you can solve in polynomial time with randomness if you can never have false positives, and the probability of a false negative is at most 1/2. Our “RP” algorithm will be a PAC-learning algorithm.

The NP-complete problem we’ll reduce from is graph 3-coloring. So if you give me a graph, I’ll produce an instance of the 3-term DNF PAC-learning problem in such a way that finding a hypothesis with low error corresponds to a valid 3-coloring of the graph. Since PAC-learning ensures that you are highly likely to find a low-error hypothesis, the existence of a PAC-learning algorithm will constitute an RP algorithm to solve this NP-complete problem.

In more detail, an “instance” of the 3-term DNF problem comes in the form of a distribution over some set of labeled examples. In this case the “set” is the set of all possible truth assignments to the variables, where we fix the number of variables to suit our needs, along with a choice of a target 3-term DNF to be learned. Then you’d have to define the distribution over these examples.

But we’ll actually do something a bit slicker. We’ll take our graph $G$, we’ll construct a set $S_G$ of labeled truth assignments, and we’ll define the distribution $D$ to be the uniform distribution over those truth assignments used in $S_G$. Then, if there happens to be a 3-term DNF that coincidentally labels the truth assignments in $S_G$ exactly how we labeled them, and we set the allowed error $\varepsilon$ to be small enough, a PAC-learning algorithm will find a consistent hypothesis (and it will correspond to a valid 3-coloring of $G$). Otherwise, no algorithm would be able to come up with a low-error hypothesis, so if our purported learning algorithm outputs a bad hypothesis we’d be certain (with high probability) that it was not bad luck but that the examples are not consistent with any 3-term DNF (and hence there is no valid 3-coloring of $G$).

This general outline has nothing to do with graphs, and so you may have guessed that the technique is commonly used to prove learning problems are hard: come up with a set of labeled examples, and a purported PAC-learning algorithm would have to come up with a hypothesis consistent with all the examples, which translates back to a solution to your NP-hard problem.

The Reduction

Now we can describe the reduction from graphs to labeled examples. The intuition is simple: each term in the 3-term DNF should correspond to a color class, and so any two adjacent vertices should correspond to an example that cannot be true. The clauses will correspond to…

For a graph $G$ with $n$ nodes $v_1, \dots, v_n$ and a set of $m$ undirected edges $E$, we construct a set of examples with positive labels $S^+$ and one with negative examples $S^-$. The examples are truth assignments to $n$ variables, which we label $x_1, \dots, x_n$, and we identify a truth assignment to the $\left \{ 0,1 \right \}$-valued vector $(x_1, x_2, \dots, x_n)$ in the usual way (true is 1, false is 0).

The positive examples $S^+$ are simple: for each $v_i$ add a truth assignment $x_i = T, x_j = F$ for $j \neq i$. I.e., the binary vector is $(1, \dots, 1,0,1, \dots, 1)$, and the zero is in the $i$-th position.

The negative examples $S^-$ come from the edges. For each edge $(v_i, v_j) \in E$, we add the example with a zero in the $i$-th and $j$-th components and ones everywhere else. Here is an example graph and the corresponding positive and negative examples:

Claim: $G$ is 3-colorable if and only if the corresponding examples are consistent with some 3-term DNF formula $\varphi$.

Again, consistent just means that $\varphi$ is satisfied by every truth assignment in $S^+$ and unsatisfied by every example in $S^-$. Since we chose our distribution to be uniform over $S^+ \cup S^-$, we don’t care what $\varphi$ does elsewhere.

Indeed, if $G$ is three-colorable we can fix some valid 3-coloring with colors red, blue, and yellow. We can construct a 3-term DNF that does what we need. Let $T_R$ be the AND of all the literals $x_i$ for which vertex $v_i$ is not red. For each such $i$, the corresponding example in $S^+$ will satisfy $T_R$, because we put a zero in the $i$-th position and ones everywhere else. Similarly, no example in $S^-$ will make $T_R$ true because to do so both vertices in the corresponding edge would have to be red.

To drive this last point home say there are three vertices and your edge is $(v_1,v_2)$. Then the corresponding negative example is $(0,0,1)$. Unless both $v_1$ and $v_2$ are colored red, one of $x_1, x_2$ will have to be ANDed as part of $T_R$. But the example has a zero for both $x_1$ and $x_2$, so $T_R$ would not be satisfied.

Doing the same thing for blue and yellow, and OR them together to get $T_R \vee T_B \vee T_Y$. Since the case is symmetrically the same for the other colors, we a consistent 3-term DNF.

On the other hand, say there is a consistent 3-term DNF $\varphi$. We need to construct a three coloring of $G$. It goes in largely the same way: label the clauses $\varphi = T_R \vee T_B \vee T_Y$ for Red, Blue, and Yellow, and then color a vertex $v_i$ the color of the clause that is satisfied by the corresponding example in $S^+$. There must be some clause that does this because $\varphi$ is consistent with $S^+$, and if there are multiple you can pick a valid color arbitrarily. Now we argue why no edge can be monochromatic. Suppose there were such an edge $(v_i, v_j)$, and both $v_i$ and $v_j$ are colored, say, blue. Look at the clause $T_B$: since $v_i$ and $v_j$ are both blue, the positive examples corresponding to those vertices  (with a 0 in the single index and 1’s everywhere else) both make $T_B$ true. Since those two positive examples differ in both their $i$-th and $j$-th positions, $T_B$ can’t have any of the literals $x_i, \overline{x_i}, x_j, \overline{x_j}$. But then the negative example for the edge would satisfy $T_B$ because it has 1’s everywhere except $i,j$! This means that the formula doesn’t consistently classify the negative examples, a contradiction. This proves the Claim.

Now we just need to show a few more details to finish the proof. In particular, we need to observe that the number of examples we generate is polynomial in the size of the graph $G$; that the learning algorithm would still run in polynomial time in the size of the input graph (indeed, this depends on our choice of the learning parameters); and that we only need to pick $\delta < 1/2$ and $\varepsilon \leq 1/(2|S^+ \cup S^-|)$ in order to enforce that an efficient PAC-learner would generate a hypothesis consistent with all the examples. Indeed, if a hypothesis errs on even one example, it will have error at least $1 / |S^+ \cup S^-|$, which is too big.

Everything’s not Lost

This might seem a bit depressing for PAC-learning, that we can’t even hope to learn 3-term DNF formulas. But we will give a sketch of why this is mostly not a problem with PAC but a problem with DNFs.

In particular, the difficulty comes in forcing a PAC-learning algorithm to express its hypothesis as a 3-term DNF, as opposed to what we might argue is a more natural representation. As we observed, distributing the ORs in a 3-term DNF produces a 3-CNF formula (an AND of clauses where each clause is an OR of exactly three literals). Indeed, one can PAC-learn 3-CNF formulas efficiently, and it suffices to show that one can learn formulas which are just ANDs of literals. Then you can blow up the number of variables only polynomially larger to get 3-CNFs. ANDs of literals are just called “conjunctions,” so the problem is to PAC-learn conjunctions. The idea that works is the same one as in our first post on PAC where we tried to learn intervals: just pick the “smallest” hypothesis that is consistent with all the examples you’ve seen so far. We leave a formal proof as an (involved) exercise to the reader.

The important thing to note is that a concept class $C$ (the thing we’re trying to learn) might be hard to learn if you’re constrained to work within $C$. If you’re allowed more expressive hypotheses (in this case, arbitrary boolean formulas), then learning $C$ suddenly becomes tractable. This compels us to add an additional caveat to the PAC definition from our first post.

Definition: A concept class $\mathsf{C}$ over a set $X$ is efficiently PAC-learnable using the hypothesis class $\mathsf{H}$ if there exists an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\text{poly}(1/\varepsilon, 1/\delta))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all $0 < \delta , \varepsilon < 1/2$, the probability that $A$ produces a hypothesis $h \in \mathsf{H}$ with error at most $\varepsilon$ is at least $1-\delta$.

And with that we’ll end this extended side note. The next post in this series will introduce and analyze a fascinating notion of dimension for concept classes, the Vapnik-Chervonenkis dimension.

Until then!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $p$. If $X$ denotes the number of trials needed to get the first success, then clearly $\Pr(X = k) = (1-p)^{k-1} p$ (since first we need $k-1$ failures which occur independently with probability $1-p$, then we need one success which happens with probability $p$). Thus the expected value of $X$ is

$\displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $p=1/6$ thus the expected time is $1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $\frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $\left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$\displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses$1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive$26 if he wins, since the probability of getting the next letter right is exactly $\frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $\frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $T$ dollars. How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins$26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $\26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $\26^{11}$. Thus our casino will have to give out $26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $T$ dollars, the expected value of our expenses is $26^{11}+26^4+26$ dollars, thus $E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $X_0$ will denote the gambler’s fortune before the game starts, $X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $X_n$, given $X_1, X_2, \ldots, X_{n-1}$ is the same as $X_{n-1}$. This can be written as $E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $E(X_n - X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $X'_n = X_n$ if $n < T$ and $X'_n = X_T$ if $n \ge T$ where $T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $T$ itself is a random variable.

We require our stopping time $T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $X_n$ be a martingale stopped at step $T$, and suppose one of the following three conditions hold:

1. The stopping time $T$ is almost surely bounded by some constant;
2. The stopping time $T$ is almost surely finite and every step of the stopped martingale $X_n$ is almost surely bounded by some constant; or
3. The expected stopping time $E(T)$ is finite and the absolute value of the martingale increments $|X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $26^{11}+26^4+26$. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $a$ dollars, the second player has $b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $X_n$ denote the change in the second player’s fortune, and set $X_0 = 0$. Let $T_k$ denote the first time $s$ when $X_s = k$. Then our first question can be formalized as trying to determine $\Pr(T_{-b} < T_a)$. Let $t = \min \{ T_{-b}, T_a\}$. Clearly $t$ is a stopping time. By the optional stopping theorem we have that

$\displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $\Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $k$-SAT for some $k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $k$ since $k$-SAT is a special case of $(k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $k$-SAT for any $k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $x$ and $y$ are connected by a directed edge if there is a clause $(\bar x \lor y)$. Recall that $\bar x \lor y$ is equivalent to $x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $x \to \bar x$ and $\bar x \to x$ (since $x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $O(n^2)$ rounds where $n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $\pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $\frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $n$, on the other hand, is called a reflecting barrier: we cannot reach $n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .

(Finite) Fields — A Primer

So far on this blog we’ve given some introductory notes on a few kinds of algebraic structures in mathematics (most notably groups and rings, but also monoids). Fields are the next natural step in the progression.

If the reader is comfortable with rings, then a field is extremely simple to describe: they’re just commutative rings with 0 and 1, where every nonzero element has a multiplicative inverse. We’ll give a list of all of the properties that go into this “simple” definition in a moment, but an even more simple way to describe a field is as a place where “arithmetic makes sense.” That is, you get operations for $+,-, \cdot , /$ which satisfy the expected properties of addition, subtraction, multiplication, and division. So whatever the objects in your field are (and sometimes they are quite weird objects), they behave like usual numbers in a very concrete sense.

So here’s the official definition of a field. We call a set $F$ a field if it is endowed with two binary operations addition ($+$) and multiplication ($\cdot$, or just symbol juxtaposition) that have the following properties:

• There is an element we call 0 which is the identity for addition.
• Addition is commutative and associative.
• Every element $a \in F$ has a corresponding additive inverse $b$ (which may equal $a$) for which $a + b = 0$.

These three properties are just the axioms of a (commutative) group, so we continue:

• There is an element we call 1 (distinct from 0) which is the identity for multiplication.
• Multiplication is commutative and associative.
• Every nonzero element $a \in F$ has a corresponding multiplicative inverse $b$ (which may equal $a$) for which $ab = 1$.
• Addition and multiplication distribute across each other as we expect.

If we exclude the existence of multiplicative inverses, these properties make $F$ a commutative ring, and so we have the following chain of inclusions that describes it all

$\displaystyle \textup{Fields} \subset \textup{Commutative Rings} \subset \textup{Rings} \subset \textup{Commutative Groups} \subset \textup{Groups}$

The standard examples of fields are the real numbers $\mathbb{R}$, the rationals $\mathbb{Q}$, and the complex numbers $\mathbb{C}$. But of course there are many many more. The first natural question to ask about fields is: what can they look like?

For example, can there be any finite fields? A field $F$ which as a set has only finitely many elements?

As we saw in our studies of groups and rings, the answer is yes! The simplest example is the set of integers modulo some prime $p$. We call them $\mathbb{Z} / p \mathbb{Z},$ or sometimes just $\mathbb{Z}/p$ for short, and let’s rederive what we know about them now.

As a set, $\mathbb{Z}/p$ consists of the integers $\left \{ 0, 1, \dots, p-1 \right \}$. The addition and multiplication operations are easy to define, they’re just usual addition and multiplication followed by a modulus. That is, we add by $a + b \mod p$ and multiply with $ab \mod p$. This thing is clearly a commutative ring (because the integers form a commutative ring), so to show this is a field we need to show that everything has a multiplicative inverse.

There is a nice fact that allows us to do this: an element $a$ has an inverse if and only if the only way for it to divide zero is the trivial way $0a = 0$. Here’s a proof. For one direction, suppose $a$ divides zero nontrivially, that is there is some $c \neq 0$ with $ac = 0$. Then if $a$ had an inverse $b$, then $0 = b(ac) = (ba)c = c$, but that’s very embarrassing for $c$ because it claimed to be nonzero. Now suppose $a$ only divides zero in the trivial way. Then look at all possible ways to multiply $a$ by other nonzero elements of $F$. No two can give you the same result because if $ax = ay$ then (without using multiplicative inverses) $a(x-y) = 0$, but we know that $a$ can only divide zero in the trivial way so $x=y$. In other words, the map “multiplication by $a$” is injective. Because the set of nonzero elements of $F$ is finite you have to hit everything (the map is in fact a bijection), and some $x$ will give you $ax = 1$.

Now let’s use this fact on $\mathbb{Z}/p$ in the obvious way. Since $p$ is a prime, there are no two smaller numbers $a, b < p$ so that $ab = p$. But in $\mathbb{Z}/p$ the number $p$ is equivalent to zero (mod $p$)! So $\mathbb{Z}/p$ has no nontrivial zero divisors, and so every element has an inverse, and so it’s a finite field with $p$ elements.

The next question is obvious: can we get finite fields of other sizes? The answer turns out to be yes, but you can’t get finite fields of any size. Let’s see why.

Characteristics and Vector Spaces

Say you have a finite field $k$ (lower-case k is the standard letter for a field, so let’s forget about $F$). Beacuse the field is finite, if you take 1 and keep adding it to itself you’ll eventually run out of field elements. That is, $n = 1 + 1 + \dots + 1 = 0$ at some point. How do I know it’s zero and doesn’t keep cycling never hitting zero? Well if at two points $n = m \neq 0$, then $n-m = 0$ is a time where you hit zero, contradicting the claim.

Now we define $\textup{char}(k)$, the characteristic of $k$, to be the smallest $n$ (sums of 1 with itself) for which $n = 0$. If there is no such $n$ (this can happen if $k$ is infinite, but doesn’t always happen for infinite fields), then we say the characteristic is zero. It would probably make more sense to say the characteristic is infinite, but that’s just the way it is. Of course, for finite fields the characteristic is always positive. So what can we say about this number? We have seen lots of example where it’s prime, but is it always prime? It turns out the answer is yes!

For if $ab = n = \textup{char}(k)$ is composite, then by the minimality of $n$ we get $a,b \neq 0$, but $ab = n = 0$. This can’t happen by our above observation, because being a zero divisor means you have no inverse! Contradiction, sucker.

But it might happen that there are elements of $k$ that can’t be written as $1 + 1 + \dots + 1$ for any number of terms. We’ll construct examples in a minute (in fact, we’ll classify all finite fields), but we already have a lot of information about what those fields might look like. Indeed, since every field has 1 in it, we just showed that every finite field contains a smaller field (a subfield) of all the ways to add 1 to itself. Since the characteristic is prime, the subfield is a copy of $\mathbb{Z}/p$ for $p = \textup{char}(k)$. We call this special subfield the prime subfield of $k$.

The relationship between the possible other elements of $k$ and the prime subfield is very neat. Because think about it: if $k$ is your field and $F$ is your prime subfield, then the elements of $k$ can interact with $F$ just like any other field elements. But if we separate $k$ from $F$ (make a separate copy of $F$), and just think of $k$ as having addition, then the relationship with $F$ is that of a vector space! In fact, whenever you have two fields $k \subset k'$, the latter has the structure of a vector space over the former.

Back to finite fields, $k$ is a vector space over its prime subfield, and now we can impose all the power and might of linear algebra against it. What’s it’s dimension? Finite because $k$ is a finite set! Call the dimension $m$, then we get a basis $v_1, \dots, v_m$. Then the crucial part: every element of $k$ has a unique representation in terms of the basis. So they are expanded in the form

$\displaystyle f_1v_1 + \dots + f_mv_m$

where the $f_i$ come from $F$. But now, since these are all just field operations, every possible choice for the $f_i$ has to give you a different field element. And how many choices are there for the $f_i$? Each one has exactly $|F| = \textup{char}(k) = p$. And so by counting we get that $k$ has $p^m$ many elements.

This is getting exciting quickly, but we have to pace ourselves! This is a constraint on the possible size of a finite field, but can we realize it for all choices of $p, m$? The answer is again yes, and in the next section we’ll see how.  But reader be warned: the formal way to do it requires a little bit of familiarity with ideals in rings to understand the construction. I’ll try to avoid too much technical stuff, but if you don’t know what an ideal is, you should expect to get lost (it’s okay, that’s the nature of learning new math!).

Constructing All Finite Fields

Let’s describe a construction. Take a finite field $k$ of characteristic $p$, and say you want to make a field of size $p^m$. What we need to do is construct a field extension, that is, find a bigger field containing $k$ so that the vector space dimension of our new field over $k$ is exactly $m$.

What you can do is first form the ring of polynomials with coefficients in $k$. This ring is usually denoted $k[x]$, and it’s easy to check it’s a ring (polynomial addition and multiplication are defined in the usual way). Now if I were speaking to a mathematician I would say, “From here you take an irreducible monic polynomial $p(x)$ of degree $m$, and quotient your ring by the principal ideal generated by $p$. The result is the field we want!”

In less compact terms, the idea is exactly the same as modular arithmetic on integers. Instead of doing arithmetic with integers modulo some prime (an irreducible integer), we’re doing arithmetic with polynomials modulo some irreducible polynomial $p(x)$. Now you see the reason I used $p$ for a polynomial, to highlight the parallel thought process. What I mean by “modulo a polynomial” is that you divide some element $f$ in your ring by $p$ as much as you can, until the degree of the remainder is smaller than the degree of $p(x)$, and that’s the element of your quotient. The Euclidean algorithm guarantees that we can do this no matter what $k$ is (in the formal parlance, $k[x]$ is called a Euclidean domain for this very reason). In still other words, the “quotient structure” tells us that two polynomials $f, g \in k[x]$ are considered to be the same in $k[x] / p$ if and only if $f - g$ is divisible by $p$. This is actually the same definition for $\mathbb{Z}/p$, with polynomials replacing numbers, and if you haven’t already you can start to imagine why people decided to study rings in general.

Let’s do a specific example to see what’s going on. Say we’re working with $k = \mathbb{Z}/3$ and we want to compute a field of size $27 = 3^3$. First we need to find a monic irreducible polynomial of degree $3$. For now, I just happen to know one: $p(x) = x^3 - x + 1$. In fact, we can check it’s irreducible, because to be reducible it would have to have a linear factor and hence a root in $\mathbb{Z}/3$. But it’s easy to see that if you compute $p(0), p(1), p(2)$ and take (mod 3) you never get zero.

So I’m calling this new ring

$\displaystyle \frac{\mathbb{Z}/3[x]}{(x^3 - x + 1)}$

It happens to be a field, and we can argue it with a whole lot of ring theory. First, we know an irreducible element of this ring is also prime (because the ring is a unique factorization domain), and prime elements generate maximal ideals (because it’s a principal ideal domain), and if you quotient by a maximal ideal you get a field (true of all rings).

But if we want to avoid that kind of argument and just focus on this ring, we can explicitly construct inverses. Say you have a polynomial $f(x)$, and for illustration purposes we’ll choose $f(x) = x^4 + x^2 - 1$. Now in the quotient ring we could do polynomial long division to find remainders, but another trick is just to notice that the quotient is equivalent to the condition that $x^3 = x - 1$. So we can reduce $f(x)$ by applying this rule to $x^4 = x^3 x$ to get

$\displaystyle f(x) = x^2 + x(x-1) - 1 = 2x^2 - x - 1$

Now what’s the inverse of $f(x)$? Well we need a polynomial $g(x) = ax^2 + bx + c$ whose product with $f$ gives us something which is equivalent to 1, after you reduce by $x^3 - x + 1$. A few minutes of algebra later and you’ll discover that this is equivalent to the following polynomial being identically 1

$\displaystyle (a-b+2c)x^2 + (-3a+b-c)x + (a - 2b - 2c) = 1$

In other words, we get a system of linear equations which we need to solve:

\displaystyle \begin{aligned} a & - & b & + & 2c & = 0 \\ -3a & + & b & - & c &= 0 \\ a & - & 2b & - & 2c &= 1 \end{aligned}

And from here you can solve with your favorite linear algebra techniques. This is a good exercise for working in fields, because you get to abuse the prime subfield being characteristic 3 to say terrifying things like $-1 = 2$ and $6b = 0$. The end result is that the inverse polynomial is $2x^2 + x + 1$, and if you were really determined you could write a program to compute these linear systems for any input polynomial and ensure they’re all solvable. We prefer the ring theoretic proof.

In any case, it’s clear that taking a polynomial ring like this and quotienting by a monic irreducible polynomial gives you a field. We just control the size of that field by choosing the degree of the irreducible polynomial to our satisfaction. And that’s how we get all finite fields!

One Last Word on Irreducible Polynomials

One thing we’ve avoided is the question of why irreducible monic polynomials exist of all possible degrees $m$ over any $\mathbb{Z}/p$ (and as a consequence we can actually construct finite fields of all possible sizes).

The answer requires a bit of group theory to prove this, but it turns out that the polynomial $x^{p^m} - x$ has all degree $m$ monic irreducible polynomials as factors. But perhaps a better question (for computer scientists) is how do we work over a finite field in practice? One way is to work with polynomial arithmetic as we described above, but this has some downsides: it requires us to compute these irreducible monic polynomials (which doesn’t sound so hard, maybe), to do polynomial long division every time we add, subtract, or multiply, and to compute inverses by solving a linear system.

But we can do better for some special finite fields, say where the characteristic is 2 (smells like binary) or we’re only looking at $F_{p^2}$. The benefit there is that we aren’t forced to use polynomials. We can come up with some other kind of structure (say, matrices of a special form) which happens to have the same field structure and makes computing operations relatively painless. We’ll see how this is done in the future, and see it applied to cryptography when we continue with our series on elliptic curve cryptography.

Until then!

How to Conquer Tensorphobia

A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke tensor products… People run in terror from the $\otimes$ symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. Tensors are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a tensor product. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word product is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual direct product of vector spaces. In fact they are related (in very precise sense), but they are far from similar. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces $V, W$ of dimensions $n, m$. Recall that the direct product  $V \times W$ is the vector space of pairs $(v,w)$ where $v$ comes from $V$ and $w$ from $W$. Recall that addition in this vector space is defined componentwise ($(v_1,w_1) + (v_2, w_2) = (v_1 + v_2, w_1 + w_2$)) and scalar multiplication scales both components $\lambda (v,w) = (\lambda v, \lambda w)$.

To get the tensor product space $V \otimes W$, we make the following modifications. First, we redefine what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is declared to be the same as scalar multiplication of any component you want. In symbols,

$\displaystyle \lambda (v, w) = (\lambda v, w) = (v, \lambda w)$

for all choices of scalars $\lambda$ and vectors $v, w$. Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

$(v, w) + (v', w) = (v + v', w)$

only works because $w$ is the same in both pieces, and with the same rule applying if we switch the positions of $v,w$ above. All other additions are simply declared to be new vectors. I.e. $(x,y) + (z,w)$ is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that $y=w$ or $x=z$. To say it still one more time, a general element of the tensor $V \otimes W$ is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we rename the pair $(v,w)$ to $v \otimes w$, to distinguish it from the old vector space $V \times W$ that we’ve totally butchered and reanimated, and we call the tensor product space as a whole $V \otimes W$. Those familiar with this kind of abstract algebra will recognize quotient spaces at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of $\mathbb{R}$. This means that our space $\mathbb{R} \otimes \mathbb{R}$ is comprised of vectors like $3 \otimes 5$, and moreover that the following operations are completely legitimate.

$3 \otimes 5 + 1 \otimes (-5) = 3 \otimes 5 + (-1) \otimes 5 = 2 \otimes 5$

$6 \otimes 1 + 3\pi \otimes \pi = 3 \otimes 2 + 3 \otimes \pi^2 = 3 \otimes (2 + \pi^2)$

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just declaring that all the axioms of a vector space hold when we want them to. So if you were wondering whether

$\lambda (a \otimes b + c \otimes d) = \lambda(a \otimes b) + \lambda(c \otimes d)$

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space $V \otimes W$ are

1. The “basic” vectors are $v \otimes w$ for $v \in V, w \in W$, and they’re used to build up all other vectors.
2. Addition is symbolic, unless one of the components is the same in both addends, in which case $(v_1, w) + (v_2, w) = (v_1+ v_2, w)$ and $(v, w_1) + (v,w_2) = (v, w_1 + w_2)$.
3. You can freely move scalar multiples around the components of $v \otimes w$.
4. The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to $n$-fold tensor products, like $V_1 \otimes V_2 \otimes \dots \otimes V_d$. Here we write the vectors as sums of things like $v_1 \otimes \dots \otimes v_d$, and we enforce that addition can only be combined if all but one coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the maps between objects are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a linear map between vector spaces. That is, a map $f: V \to W$ that factors through addition and scalar multiplication (i.e. $f(v + v') = f(v) + f(v')$ and $f(\lambda v) = \lambda f(v)$).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it $m$, then $m$ takes in two matrices and spits out their product

$m(A,B) = AB$

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix $B$ then

$m(A+C, B) = (A+C)B = AB + CB = m(A,B) + m(C,B)$

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates simultaneously. In other words,

$m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD \neq AC + BD = m(A,C) + m(B,D)$

In fact, there is only one function that satisfies both “linearity in its two coordinates separately” and also “linearity in both coordinates simultaneously,” and it’s the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that $m$ is linear in each coordinate when all else is fixed. Note that this property allows us to shift around scalar multiples, too. For example,

$\displaystyle m(\lambda A, B) = \lambda AB = A (\lambda B) = m(A, \lambda B) = \lambda m(A,B)$

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called bilinearity when we only have two coordinates, and multilinearity when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

• If $V$ is an inner product space over $\mathbb{R}$, then the inner product is bilinear.
• The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
• The cross product of vectors in $\mathbb{R}^3$ is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of classification of multilinear maps. The idea is that every multilinear map $f$ from a product vector space $U_1 \times \dots \times U_d$ to any vector space $Y$ can be written first as a multilinear map to the tensor space

$\displaystyle \alpha : U_1 \times \dots \times U_d \to U_1 \otimes \dots \otimes U_d$

Followed by a linear map to $Y$,

$\displaystyle \hat{f} : U_1 \otimes \dots \otimes U_d \to Y$

And the important part is that $\alpha$ doesn’t depend on the original $f$ (but $\hat{f}$ does). One usually draws this as a single diagram:

And to say this diagram commutes is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. $f = \hat{f} \alpha$).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and $\alpha$ is the gate. Yet another way to say this is that $\alpha$ is the most general possible multilinear map that can be constructed from $U_1 \times \dots \times U_d$. Moreover, the tensor product itself is uniquely defined by having a “most-general” $\alpha$ (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of $\alpha$ (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix $V, W$ as vector spaces and look at the diagram

What is $\alpha$ in this case? Well it just sends $(v,w) \mapsto v \otimes w$. Is this map multilinear? Well if we fix $w$ then

$\displaystyle \alpha(v_1 + v_2, w) = (v_1 + v_2) \otimes w = v_1 \otimes w + v_2 \otimes w = \alpha(v_1, w) + \alpha (v_2, w)$

and

$\displaystyle \alpha(\lambda v, w) = (\lambda v) \otimes w = (\lambda) (v \otimes w) = \lambda \alpha(v,w)$

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we designed tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps $f : V \times W \to Y$ can be decomposed into an $\alpha$ part and a $\hat{f}$ part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of $V \times W$ is, it’s just the union of the bases of the pieces. Say that $V$ has a basis $v_1, \dots, v_n$ and $W$ has $w_1, \dots, w_m$, then a basis for the product is just $((v_1, 0), \dots, (v_n,0), (0,w_1), \dots, (0,w_m))$.

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on all possible pairs of basis elements. For how else could we determine $f(v_1 + v_2, w_1)$? If there are $n$ of the $v_i$‘s and $m$ of the $w_i$‘s, then there are $nm$ such pairs $f(v_i, w_j)$.

Uncoincidentally, as $V \otimes W$ is a vector space, its basis can also be constructed in terms of the bases of $V$ and $W$. You simply take all possible tensors $v_i \otimes w_j$. Since every $v \in V, w \in W$ can be written in terms of their bases, it’s clear than any tensor $\sum_{k} a_k \otimes b_k$ can also be written in terms of the basis tensors $v_i \otimes w_j$ (by simply expanding each $a_k, b_k$ in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if $(e_1, e_2, e_3)$ is a basis for $\mathbb{R}^3$, and $(g_1, g_2)$ a basis for $\mathbb{R}^2$, then the tensor space $\mathbb{R}^3 \otimes \mathbb{R}^2$ has basis

$(e_1 \otimes g_1, e_1 \otimes g_2, e_2 \otimes g_1, e_2 \otimes g_2, e_3 \otimes g_1, e_3 \otimes g_2)$

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that $\mathbb{R}^3 \otimes \mathbb{R}^2 \cong \mathbb{R}^6$.

So fine, back to decomposing $f$. All we have left to do is use the data given by $f$ (the effect on pairs of basis elements) to define $\hat{f} : V \otimes W \to Y$. The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space ($v_i \otimes w_j$) and the definition of $f(v_i, w_j)$ are essentially the same.

That is, just take $\hat{f}(v_i \otimes w_j) = f(v_i, w_j)$. Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining $\hat{f}$ to split across sums of tensors as needed). Is this well defined? Well, multilinearity of $f$ forces it to be so. For if we had two equal tensors, say, $\lambda v \otimes w = v \otimes \lambda w$, then we know that $f$ has to respect their equality, because $f(\lambda v_i, w_j) = f(v_i, \lambda w_j)$, so $\hat{f}$ will take the same value on equal tensors regardless of which representative we pick (where we decide to put the $\lambda$). The same idea works for sums, so everything checks out, and $f(v,w)$ is equal to $\hat{f} \alpha$, as desired. Moreover, we didn’t make any choices in constructing $\hat{f}$. If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining $V \otimes W$). Since the construction would be isomorphic if we changed the basis, our choice of $\hat{f}$ is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear why we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

Computability and Stuff

It should be clear by now that plain product spaces $V \times W$ and tensor product spaces $V \otimes W$ are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in $V$ and $W$. Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space $U_1 \otimes \dots \otimes U_d$ is as a sum of the basic-looking tensors.

$\sum_k \displaystyle a_{1,k} \otimes a_{2,k} \otimes \dots \otimes a_{d,k}$

where the $a_{i,d}$ are linear combinations of basis vectors in the $U_i$. But as we saw with our examples over $\mathbb{R}$, there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor $a_1 \otimes \dots \otimes a_d$. If you can do this we call the tensor a pure tensor, or a rank 1 tensor. We then have the following natural definition and problem:

Definition: The rank of a tensor $x \in U_1 \otimes \dots \otimes U_d$ is the minimum number of terms in any representation of $x$ as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

Problem: Given a tensor $x \in k^{n_1} \otimes k^{n_2} \otimes k^{n_3}$ where $k$ is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict $k$ to be either the rational numbers $\mathbb{Q}$ or a finite field $\mathbb{F}_{q}$.

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

Theorem: Computing tensor rank is NP-hard when $k = \mathbb{Q}$ and NP-complete when $k$ is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!