# On Coloring Resilient Graphs

I’m pleased to announce that another paper of mine is finished. This one is submitted to ICALP, which is being held in Copenhagen this year (this whole research thing is exciting!). This is joint work with my advisor, Lev Reyzin. As with my first paper, I’d like to explain things here on my blog a bit more informally than a scholarly article allows.

## A Recent History of Graph Coloring

One of the first important things you learn when you study graphs is that coloring graphs is hard. Remember that coloring a graph with $k$ colors means that you assign each vertex a color (a number in $\left \{ 1, 2, \dots, k \right \}$) so that no vertex is adjacent to a vertex of the same color (no edge is monochromatic). In fact, even deciding whether a graph can be colored with just $3$ colors (not to mention finding such a coloring) has no known polynomial time algorithm. It’s what’s called NP-hard, which means that almost everyone believes it’s hopeless to solve efficiently in the worst case.

One might think that there’s some sort of gradient to this problem, that as the graphs get more “complicated” it becomes algorithmically harder to figure out how colorable they are. There are some notions of “simplicity” and “complexity” for graphs, but they hardly fall on a gradient. Just to give the reader an idea, here are some ways to make graph coloring easy:

• Make sure your graph is planar. Then deciding 4-colorability is easy because the answer is always yes.
• Make sure your graph is triangle-free and planar. Then finding a 3-coloring is easy.
• Make sure your graph is perfect (which again requires knowledge about how colorable it is).
• Make sure your graph has tree-width or clique-width bounded by a constant.
• Make sure your graph doesn’t have a certain kind of induced subgraph (such as having no induced paths of length 4 or 5).

Let me emphasize that these results are very difficult and tricky to compare. The properties are inherently discrete (either perfect or imperfect, planar or not planar). The fact that the world has not yet agreed upon a universal measure of complexity for graphs (or at least one that makes graph coloring easy to understand) is not a criticism of the chef but a testament to the challenge and intrigue of the dish.

Coloring general graphs is much bleaker, where the focus has turned to approximations. You can’t “approximate” the answer to whether a graph is colorable, so now the key here is that we are actually trying to find an approximate coloring. In particular, if you’re given some graph $G$ and you don’t know the minimum number of colors needed to color it (say it’s $\chi(G)$, this is called the chromatic number), can you easily color it with what turns out to be, say, $2 \chi(G)$ colors?

Garey and Johnson (the gods of NP-hardness) proved this problem is again hard. In fact, they proved that you can’t do better than twice the number of colors. This might not seem so bad in practice, but the story gets worse. This lower bound was improved by Zuckerman, building on the work of Håstad, to depend on the size of the graph! That is, unless $P=NP$, all efficient algorithms will use asymptotically more than $\chi(G) n^{1 - \varepsilon}$ colors for any $\varepsilon > 0$ in the worst case, where $n$ is the number of vertices of $G$. So the best you can hope for is being off by something like a multiplicative factor of $n / \log n$. You can actually achieve this (it’s nontrivial and takes a lot of work), but it carries that aura of pity for the hopeful graph colorer.

The next avenue is to assume you know the chromatic number of your graph, and see how well you can do then. For example: if you are given the promise that a graph $G$ is 3-colorable, can you efficiently find a coloring with 8 colors? The best would be if you could find a coloring with 4 colors, but this is already known to be NP-hard.

The best upper bounds, algorithms to find approximate colorings of 3-colorable graphs, also pitifully depend on the size of the graph. Remember I say pitiful not to insult the researchers! This decades-long line of work was extremely difficult and deserves the highest praise. It’s just frustrating that the best known algorithm to color a 3-colorable graph requires up to $n^{0.2}$ colors. At least it bypasses the barrier of $n^{1 - \varepsilon}$ mentioned above, so we know that knowing the chromatic number actually does help.

The lower bounds are a bit more hopeful; it’s known to be NP-hard to color a $k$-colorable graph using $2^{\sqrt[3]{k}}$ colors if $k$ is sufficiently large. There are a handful of other linear lower bounds that work for all $k \geq 3$, but to my knowledge this is the best asymptotic result. The big open problem (which I doubt many people have their eye on considering how hard it seems) is to find an upper bound depending only on $k$. I wonder offhand whether a ridiculous bound like $k^{k^k}$ colors would be considered progress, and I bet it would.

## Our Idea: Resilience

So without big breakthroughs on the front of approximate graph coloring, we propose a new front for investigation. The idea is that we consider graphs which are not only colorable, but remain colorable under the adversarial operation of adding a few new edges. More formally,

Definition: A graph $G = (V,E)$ is called $r$-resiliently $k$-colorable if two properties hold

1. $G$ is $k$-colorable.
2. For any set $E'$ of $r$ edges disjoint from $E$, the graph $G' = (V, E \cup E')$ is $k$-colorable.

The simplest nontrivial example of this is 1-resiliently 3-colorable graphs. That is a graph that is 3-colorable and remains 3-colorable no matter which new edge you add. And the question we ask of this example: is there a polynomial time algorithm to 3-color a 1-resiliently 3-colorable graph? We prove in our paper that this is actually NP-hard, but it’s not a trivial thing to see.

The chief benefit of thinking about resiliently colorable graphs is that it provides a clear gradient of complexity from general graphs (zero-resilient) to the empty graph (which is $(\binom{k+1}{2} - 1)$-resiliently $k$-colorable). We know that the most complex case is NP-hard, and maximally resilient graphs are trivially colorable. So finding the boundary where resilience makes things easy can shed new light on graph coloring.

Indeed, we argue in the paper that lots of important graphs have stronger resilience properties than one might expect. For example, here are the resilience properties of some famous graphs.

From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)

If I were of a mind to do applied graph theory, I would love to know about the resilience properties of graphs that occur in the wild. For example, the reader probably knows the problem of register allocation is a natural graph coloring problem. I would love to know the resilience properties of such graphs, with the dream that they might be resilient enough on average to admit efficient coloring algorithms.

Unfortunately the only way that I know how to compute resilience properties is via brute-force search, and of course this only works for small graphs and small $k$. If readers are interested I could post such a program (I wrote it in vanilla python), but for now I’ll just post a table I computed on the proportion of small graphs that have various levels of resilience (note this includes graphs that vacuously satisfy the definition).

Percentage of k-colorable graphs on 6 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       58.0    22.7     5.9     1.7
4       93.3    79.3    58.0    35.3
5       99.4    98.1    94.8    89.0
6      100.0   100.0   100.0   100.0

Percentage of k-colorable graphs on 7 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       38.1     8.2     1.2     0.3
4       86.7    62.6    35.0    14.9
5       98.7    95.6    88.5    76.2
6       99.9    99.7    99.2    98.3

Percentage of k-colorable graphs on 8 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       21.3     2.1     0.2     0.0
4       77.6    44.2    17.0     4.5

The idea is this: if this trend continues, that only some small fraction of all 3-colorable graphs are, say, 2-resiliently 3-colorable graphs, then it should be easy to color them. Why? Because resilience imposes structure on the graphs, and that structure can hopefully be realized in a way that allows us to color easily. We don’t know how to characterize that structure yet, but we can give some structural implications for sufficiently resilient graphs.

For example, a 7-resiliently 5-colorable graph can’t have any subgraphs on 6 vertices with $\binom{6}{2} - 7$ edges, or else we can add enough edges to get a 6-clique which isn’t 5-colorable. This gives an obvious general property about the sizes of subgraphs in resilient graphs, but as a more concrete instance let’s think about 2-resilient 3-colorable graphs $G$. This property says that no set of 4 vertices may have more than $4 = \binom{4}{2} - 2$ edges in $G$. This rules out 4-cycles and non-isolated triangles, but is it enough to make 3-coloring easy? We can say that $G$ is a triangle-free graph and a bunch of disjoint triangles, but it’s known 3-colorable non-planar triangle-free graphs can have arbitrarily large chromatic number, and so the coloring problem is hard. Moreover, 2-resilience isn’t enough to make $G$ planar. It’s not hard to construct a non-planar counterexample, but proving it’s 2-resilient is a tedious task I relegated to my computer.

Speaking of which, the problem of how to determine whether a $k$-colorable graph is $r$-resiliently $k$-colorable is open. Is this problem even in NP? It certainly seems not to be, but if it had a nice characterization or even stronger necessary conditions than above, we might be able to use them to find efficient coloring algorithms.

In our paper we begin to fill in a table whose completion would characterize the NP-hardness of coloring resilient graphs

The known complexity of k-coloring r-resiliently k-colorable graphs

Ignoring the technical notion of 2-to-1 hardness (it’s technical), the paper accomplishes this as follows. First, we prove some relationships between cells. In particular, if a cell is NP-hard then so are all the cells to the left and below it. So our Theorem 1, that 3-coloring 1-resiliently 3-colorable graphs is NP-hard, gives us the entire black region, though more trivial arguments give all except the (3,1) cell. Also, if a cell is in P (it’s easy to $k$-color graphs with that resilience), then so are all cells above and to its right. We prove that $k$-coloring $\binom{k}{2}$-resiliently $k$-colorable graphs is easy. This is trivial: no vertex may have degree greater than $k-1$, and the greedy algorithm can color such graphs with $k$ colors. So that gives us the entire light gray region.

There is one additional lower bound comes from the fact that it’s NP-hard to $2^{\sqrt[3]{k}}$-color a $k$-colorable graph. In particular, we prove that if you have any function $f(k)$ that makes it NP-hard to $f(k)$-color a $k$-colorable graph, then it is NP-hard to $f(k)$-color an $(f(k) - k)$-resiliently $f(k)$-colorable graph. The exponential lower bound hence gives us a nice linear lower bound, and so we have the following “sufficiently zoomed out” picture of the table

The zoomed out version of the classification table above.

The paper contains the details of how these observations are proved, in addition to the NP-hardness proof for 1-resiliently 3-colorable graphs. This leaves the following open problems:

• Get an unconditional, concrete linear resilience lower bound for hardness.
• Find an algorithm that colors graphs that are less resilient than $O(k^2)$. Even determining specific cells like (4,5) or (5,9) would likely give enough insight for this.
• Classify the tantalizing (3,2) cell (determine if it’s hard or easy to 3-color a 2-resiliently 3-colorable graph) or even better the (4,2) cell.
• Find a way to relate resilient coloring back to general coloring. For example, if such and such cell is hard, then you can’t approximate k-coloring to within so many colors.

## But Wait, There’s More!

Though this paper focuses on graph coloring, our idea of resilience doesn’t stop there (and this is one reason I like it so much!). One can imagine a notion of resilience for almost any combinatorial problem. If you’re trying to satisfy boolean formulas, you can define resilience to mean that you fix the truth value of some variable (we do this in the paper to build up to our main NP-hardness result of 3-coloring 1-resiliently 3-colorable graphs). You can define resilient set cover to allow the removal of some sets. And any other sort of graph-based problem (Traveling salesman, max cut, etc) can be resiliencified by adding or removing edges, whichever makes the problem more constrained.

So this resilience notion is quite general, though it’s hard to define precisely in a general fashion. There is a general framework called Constraint Satisfaction Problems (CSPs), but resilience here seem too general. A CSP is literally just a bunch of objects which can be assigned some set of values, and a set of constraints (k-ary 0-1-valued functions) that need to all be true for the problem to succeed. If we were to define resilience by “adding any constraint” to a given CSP, then there’s nothing to stop us from adding the negation of an existing constraint (or even the tautologically unsatisfiable constraint!). This kind of resilience would be a vacuous definition, and even if we try to rule out these edge cases, I can imagine plenty of weird things that might happen in their stead. That doesn’t mean there isn’t a nice way to generalize resilience to CSPs, but it would probably involve some sort of “constraint class” of acceptable constraints, and I don’t know a reasonable property to impose on the constraint class to make things work.

So there’s lots of room for future work here. It’s exciting to think where it will take me.

Until then!

# Simulating a Biased Coin with a Fair Coin

This is a guest post by my friend and colleague Adam Lelkes. Adam’s interests are in algebra and theoretical computer science. This gem came up because Adam gave a talk on probabilistic computation in which he discussed this technique.

Problem: simulate a biased coin using a fair coin.

Solution: (in Python)

def biasedCoin(binaryDigitStream, fairCoin):
for d in binaryDigitStream:
if fairCoin() != d:
return d


Discussion: This function takes two arguments, an iterator representing the binary expansion of the intended probability of getting 1 (let us denote it as $p$) and another function that returns 1 or 0 with equal probability. At first glance this might seem like an overcomplicated way of solving this problem: why can’t the probability be a floating point number?

The point is that $p$ can have infinite precision! Assuming that fairCoin() gives us a perfectly random stream of 1′s and 0′s (independently and with probability 1/2) and we can read each bit of the binary expansion of $p$, this function returns 1 with probability exactly $p$ even if $p$ is irrational or a fraction with infinite decimal expansion. If we used floating point arithmetic there would be a small chance we get unlucky and exhaust the precision available. We would only get an approximation of the true bias at best.

Now let us explain why this algorithm works. We keep tossing our fair coins to get a sequence of random bits, until one of our random bits is different from the corresponding bit in the binary expansion of $p$. If we stop after $i$ steps, that means that the first $i-1$ bits in the two binary sequences were the same, which happens with probability $\frac{1}{2^{i-1}}$. Given that this happens, in the $i$th step we will return the $i$th bit of $p$; let us denote this bit by $p_i$. Then the probability of returning 1 is $\sum_{i=1}^\infty \frac{p_i}{2^{i-1}}$, which is the binary expansion of $p$.

This algorithm is also efficient. By efficient here we mean that the expected running time is constant. Of course, to show this we need to make some assumption about the computational complexity of calculating the bits of $p$. If we assume that the bits of $p$ are efficiently computable in the sense that the time required to compute $p_i$ is bounded by a polynomial in $i$, then this algorithm does run in constant expected time.

Indeed, the expected running time is $\sum_{i=0}^\infty \frac{i^n}{2^i}$. Showing that this sum is a constant is an easy calculus exercise: using the ratio test we get that

$\displaystyle \textup{limsup}_{i \to \infty} \left | \frac{\frac{(i+1)^n}{2^{i+1}}}{\frac{i^n}{2^i}} \right | = \limsup_{i\to\infty} \frac{\left(\frac{i+1}{i}\right)^n}{2} = \frac{1}{2} < 1$,

thus the series is convergent.

Now that we proved that our algorithm works, it’s time to try it! Let’s say that we want to simulate a coin which gives “heads” with probability 1/3.
We need to construct our binary digit stream. Since 1/3 is 0.010101… in binary, we could use the following simple generator:

def oneThird():
while True:
yield 0
yield 1


However, we might want to have a more general generator that gives us the binary representation of any number. The following function, which takes a number between 0 and 1 as its argument, does the job:

def binaryDigits(fraction):
while True:
fraction *= 2
yield int(fraction)
fraction = fraction % 1


We also need a fair coin simulator. For this simulation, let’s just use Python’s built-in pseudo-random number generator:

def fairCoin():
return random.choice([0,1])


Let us toss our biased coin 10000 times and take the sum. We expect the sum to be around 3333. Indeed, when I tried

>>> sum(biasedCoin(oneThird(), fairCoin) for i in range(10000))
3330


It might be worth noting oneThird() is approximately ten times faster than binaryDigits(fractions.Fraction(1,3)), so when a large number of biased coins is needed, you can hardwire the binary representation of $p$ into the program.

# How to Conquer Tensorphobia

A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke tensor products… People run in terror from the $\otimes$ symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

## Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. Tensors are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a tensor product. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word product is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual direct product of vector spaces. In fact they are related (in very precise sense), but they are far from similar. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces $V, W$ of dimensions $n, m$. Recall that the direct product  $V \times W$ is the vector space of pairs $(v,w)$ where $v$ comes from $V$ and $w$ from $W$. Recall that addition in this vector space is defined componentwise ($(v_1,w_1) + (v_2, w_2) = (v_1 + v_2, w_1 + w_2$)) and scalar multiplication scales both components $\lambda (v,w) = (\lambda v, \lambda w)$.

To get the tensor product space $V \otimes W$, we make the following modifications. First, we redefine what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is declared to be the same as scalar multiplication of any component you want. In symbols,

$\displaystyle \lambda (v, w) = (\lambda v, w) = (v, \lambda w)$

for all choices of scalars $\lambda$ and vectors $v, w$. Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

$(v, w) + (v', w) = (v + v', w)$

only works because $w$ is the same in both pieces, and with the same rule applying if we switch the positions of $v,w$ above. All other additions are simply declared to be new vectors. I.e. $(x,y) + (z,w)$ is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that $y=w$ or $x=z$. To say it still one more time, a general element of the tensor $V \otimes W$ is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we rename the pair $(v,w)$ to $v \otimes w$, to distinguish it from the old vector space $V \times W$ that we’ve totally butchered and reanimated, and we call the tensor product space as a whole $V \otimes W$. Those familiar with this kind of abstract algebra will recognize quotient spaces at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of $\mathbb{R}$. This means that our space $\mathbb{R} \otimes \mathbb{R}$ is comprised of vectors like $3 \otimes 5$, and moreover that the following operations are completely legitimate.

$3 \otimes 5 + 1 \otimes (-5) = 3 \otimes 5 + (-1) \otimes 5 = 2 \otimes 5$

$6 \otimes 1 + 3\pi \otimes \pi = 3 \otimes 2 + 3 \otimes \pi^2 = 3 \otimes (2 + \pi^2)$

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just declaring that all the axioms of a vector space hold when we want them to. So if you were wondering whether

$\lambda (a \otimes b + c \otimes d) = \lambda(a \otimes b) + \lambda(c \otimes d)$

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space $V \otimes W$ are

1. The “basic” vectors are $v \otimes w$ for $v \in V, w \in W$, and they’re used to build up all other vectors.
2. Addition is symbolic, unless one of the components is the same in both addends, in which case $(v_1, w) + (v_2, w) = (v_1+ v_2, w)$ and $(v, w_1) + (v,w_2) = (v, w_1 + w_2)$.
3. You can freely move scalar multiples around the components of $v \otimes w$.
4. The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to $n$-fold tensor products, like $V_1 \otimes V_2 \otimes \dots \otimes V_d$. Here we write the vectors as sums of things like $v_1 \otimes \dots \otimes v_d$, and we enforce that addition can only be combined if all but one coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

## So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the maps between objects are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a linear map between vector spaces. That is, a map $f: V \to W$ that factors through addition and scalar multiplication (i.e. $f(v + v') = f(v) + f(v')$ and $f(\lambda v) = \lambda f(v)$).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it $m$, then $m$ takes in two matrices and spits out their product

$m(A,B) = AB$

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix $B$ then

$m(A+C, B) = (A+C)B = AB + CB = m(A,B) + m(C,B)$

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates simultaneously. In other words,

$m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD \neq AC + BD = m(A,C) + m(B,D)$

In fact, if the only operation satisfying linearity in its two coordinates separately and also this kind of linearity is the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that $m$ is linear in each coordinate when all else is fixed. Note that this property allows us to shift around scalar multiples, too. For example,

$\displaystyle m(\lambda A, B) = \lambda AB = A (\lambda B) = m(A, \lambda B) = \lambda m(A,B)$

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called bilinearity when we only have two coordinates, and multilinearity when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

• If $V$ is an inner product space over $\mathbb{R}$, then the inner product is bilinear.
• The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
• The cross product of vectors in $\mathbb{R}^3$ is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of classification of multilinear maps. The idea is that every multilinear map $f$ from a product vector space $U_1 \times \dots \times U_d$ to any vector space $Y$ can be written first as a multilinear map to the tensor space

$\displaystyle \alpha : U_1 \times \dots \times U_d \to U_1 \otimes \dots \otimes U_d$

Followed by a linear map to $Y$,

$\displaystyle \hat{f} : U_1 \otimes \dots \otimes U_d \to Y$

And the important part is that $\alpha$ doesn’t depend on the original $f$ (but $\hat{f}$ does). One usually draws this as a single diagram:

And to say this diagram commutes is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. $f = \hat{f} \alpha$).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and $\alpha$ is the gate. Yet another way to say this is that $\alpha$ is the most general possible multilinear map that can be constructed from $U_1 \times \dots \times U_d$. Moreover, the tensor product itself is uniquely defined by having a “most-general” $\alpha$ (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of $\alpha$ (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix $V, W$ as vector spaces and look at the diagram

What is $\alpha$ in this case? Well it just sends $(v,w) \mapsto v \otimes w$. Is this map multilinear? Well if we fix $w$ then

$\displaystyle \alpha(v_1 + v_2, w) = (v_1 + v_2) \otimes w = v_1 \otimes w + v_2 \otimes w = \alpha(v_1, w) + \alpha (v_2, w)$

and

$\displaystyle \alpha(\lambda v, w) = (\lambda v) \otimes w = (\lambda) (v \otimes w) = \lambda \alpha(v,w)$

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we designed tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps $f : V \times W \to Y$ can be decomposed into an $\alpha$ part and a $\hat{f}$ part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of $V \times W$ is, it’s just the union of the bases of the pieces. Say that $V$ has a basis $v_1, \dots, v_n$ and $W$ has $w_1, \dots, w_m$, then a basis for the product is just $((v_1, 0), \dots, (v_n,0), (0,w_1), \dots, (0,w_m))$.

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on all possible pairs of basis elements. For how else could we determine $f(v_1 + v_2, w_1)$? If there are $n$ of the $v_i$‘s and $m$ of the $w_i$‘s, then there are $nm$ such pairs $f(v_i, w_j)$.

Uncoincidentally, as $V \otimes W$ is a vector space, its basis can also be constructed in terms of the bases of $V$ and $W$. You simply take all possible tensors $v_i \otimes w_j$. Since every $v \in V, w \in W$ can be written in terms of their bases, it’s clear than any tensor $\sum_{k} a_k \otimes b_k$ can also be written in terms of the basis tensors $v_i \otimes w_j$ (by simply expanding each $a_k, b_k$ in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if $(e_1, e_2, e_3)$ is a basis for $\mathbb{R}^3$, and $(g_1, g_2)$ a basis for $\mathbb{R}^2$, then the tensor space $\mathbb{R}^3 \otimes \mathbb{R}^2$ has basis

$(e_1 \otimes g_1, e_1 \otimes g_2, e_2 \otimes g_1, e_2 \otimes g_2, e_3 \otimes g_1, e_3 \otimes g_2)$

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that $\mathbb{R}^3 \otimes \mathbb{R}^2 \cong \mathbb{R}^6$.

So fine, back to decomposing $f$. All we have left to do is use the data given by $f$ (the effect on pairs of basis elements) to define $\hat{f} : V \otimes W \to Y$. The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space ($v_i \otimes w_j$) and the definition of $f$ ($f(v_i, w_j)$) are essentially the same.

That is, just take $\hat{f}(v_i \otimes w_j) = f(v_i, w_j)$. Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining $\hat{f}$ to split across sums of tensors as needed). Is this well defined? Well, multilinearity of $f$ forces it to be so. For if we had two equal tensors, say, $\lambda v \otimes w = v \otimes \lambda w$, then we know that $f$ has to respect their equality, because $f(\lambda v_i, w_j) = f(v_i, \lambda w_j)$, so $\hat{f}$ will take the same value on equal tensors regardless of which representative we pick (where we decide to put the $\lambda$). The same idea works for sums, so everything checks out, and $f(v,w)$ is equal to $\hat{f} \alpha$, as desired. Moreover, we didn’t make any choices in constructing $\hat{f}$. If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining $V \otimes W$). Since the construction would be isomorphic if we changed the basis, our choice of $\hat{f}$ is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear why we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

## Computability and Stuff

It should be clear by now that plain product spaces $V \times W$ and tensor product spaces $V \otimes W$ are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in $V$ and $W$. Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space $U_1 \otimes \dots \otimes U_d$ is as a sum of the basic-looking tensors.

$\displaystyle \sum_{k} a_{1,k} \otimes a_{2,k} \otimes \dots \otimes a_{d,k}$

where the $a_{i,k}$ may be sums of vectors from $U_i$ themselves. But as we saw with our examples over $\mathbb{R}$, there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor $a \otimes b$. If you can do this we call the tensor a pure tensor, or a rank 1 tensor. We then have the following natural definition and problem:

Definition: The rank of a tensor $x \in U_1 \otimes \dots \otimes U_d$ is the minimum number of terms in any representation of $x$ as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

Problem: Given a tensor $x \in k^{n_1} \otimes k^{n_2} \otimes k^{n_3}$ where $k$ is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict $k$ to be either the rational numbers $\mathbb{Q}$ or a finite field $\mathbb{F}_{q}$.

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

Theorem: Computing tensor rank is NP-hard when $k = \mathbb{Q}$ and NP-complete when $k$ is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!

# Probably Approximately Correct — a Formal Theory of Learning

In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

Leslie Valiant

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

So let’s jump right in and see what this award-winning definition is all about.

## Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers $x$ at random in some fixed way, and in Player 1′s mind he has an interval $[a,b]$. Whenever Player 1 gives out an $x$, he must also say whether it’s in the interval (that is, whether $a \leq x \leq b$). Let’s say that Player 1 gives reports a 1 if $x$ is in the interval, and a 0 otherwise. We’ll call this number the label of $x$, and call the pair of ($x$, label) a sample, or an example. We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead $\pm 1$, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine $a$ and $b$. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 does guess at the end can be tested against Player 1′s number-producing scheme. That is, we can compute the probability that Player 2′s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

## Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set $X$. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from $X$ according to some fixed but arbitrary and unknown distribution $D$. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same $D$ and independently so). This is Player 1′s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2′s algorithm to the most devastating ruin?

So then Player 1 is reduced to a choice of distribution $D$ over $X$, and since we said that Player 2′s algorithm has to do well with high probability no matter what $D$ is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm $A$ which will likely win the game for all distributions $D$ over $X$.

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function $c$ for classifying points in $X$ as a 0 or a 1. We’ll call this a concept, or a target, and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function $h$ that also labels points in $X$, and we’re comparing it to $c$. We call a function generated by Player 2 a hypothesis (hence the use of the letter h).

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

$\displaystyle \textup{err}_{c,D}(h) = \textup{P}_D(h(x) \neq c(x))$

One would say this aloud: “The error of the hypothesis $h$ with respect to the concept $c$ and the distribution $D$ is the probability over $x$ drawn via $D$ that $h(x)$ and $c(x)$ differ”. Some might write the “differ” part as the symmetric difference of the two functions as sets. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm $A$ which for any distribution $D$ and any concept $c$ will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are $2^{2^n}$ $\left \{ 0,1 \right \}-$valued functions on a set of $n$ elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1′s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

So we need to boil down what possibilities there are for the concepts $c$ and the allowed expressive power of the learner. This is what concept classes are for.

## Concept Classes

concept class $\mathsf{C}$ over $X$ is a family of functions $X \to \left \{ 0,1 \right \}$. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking $\mathsf{C}$ to be the set of all characteristic functions of intervals, $\chi_{[a,b]}(x) = 1$ if $a \leq x \leq b$ and 0 otherwise. Now the concept class becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class $\mathsf{C}$ is PAC-learnable if there is an algorithm $A$ which, for any distribution $D$ of samples and any concept $c \in \mathsf{C}$, will with high probability produce a hypothesis $h \in \mathsf{C}$ whose error is small.

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

## Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter $0 < \varepsilon < 1/2$ to represent our desired error bound. That is, our goal is to find a hypothesis $h$ such that $\textup{err}_{c,D}(h) \leq \varepsilon$ with high probability. And as $\varepsilon$ gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in $1/\varepsilon$.

We need another parameter to control the “high probability” part as well, so we’ll introduce $0 < \delta < 1/2$ to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least $1-\delta$, produce a hypothesis whose error is less than $\varepsilon$. In symbols, we want

$\textup{P}_D (\textup{err}_{c,D}(h) \leq \varepsilon) > 1 - \delta$

Note that the $\textup{P}_D$ refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is not the probability over which distribution is chosen; an algorithm which learns must still learn no matter what $D$ is.

And again as we restrict $\delta$ more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both $1/\varepsilon, 1/\delta$.

And now we have all the pieces to state the full definition.

Definition: Let $X$ be a set, and $\mathsf{C}$ be a concept class over $X$. We say that $\mathsf{C}$ is PAC-learnable if there is an algorithm $A(\varepsilon, \delta)$ with access to a query function for $\mathsf{C}$ and runtime $O(\textup{poly}(\frac{1}{\varepsilon}, \frac{1}{\delta}))$, such that for all $c \in \mathsf{C}$, all distributions $D$ over $X$, and all inputs $\varepsilon, \delta$ between 0 and $1/2$, the probability that $A$ produces a hypothesis $h$ with error at most $\varepsilon$ is at least $1- \delta$. In symbols,

$\displaystyle \textup{P}_{D}(\textup{P}_{x \sim D}(h(x) \neq c(x)) \leq \varepsilon) \geq 1-\delta$

where the first $\textup{P}_D$ is the probability over samples drawn from $D$ during the execution of the program to produce $h$. Equivalently, we can express this using the error function,

$\displaystyle \textup{P}_{D}(\textup{err}_{c,D}(h) \leq \varepsilon) \geq 1-\delta$

Excellent.

## Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

• Leave the number of samples you pick arbitrary, say $m$.
• Figure out the probability that the total error of our produced hypothesis is $> \varepsilon$ in terms of $m$.
• Pick $m$ to be sufficiently large that this event (failing to achieve low error) happens with small probability.

So fix any distribution $D$ over the real line and say we have our $m$ samples, we picked the max and min, and our interval is $I = [a_1,b_1]$ when the target concept is $J = [a_0, b_0]$. We can notice one thing, that our hypothesis is contained in the true interval, $I \subset J$. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words $a_0 < a_1 < b_1 < b_0$. And so the probability of our hypothesis producing an error is just the probability that $D$ produces a positive example in the two intervals $A = [a_0, a_1], B = [b_1, b_0]$.

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of $A, B$ separately.

$\displaystyle \textup{err}_{J, D} \leq \textup{P}_{x \sim D}(x \in A) + \textup{P}_{x \sim D}(x \in B)$

Here’s a picture.

The two green intervals are the regions where error can occur.

If we can guarantee that each of the green pieces is smaller than $\varepsilon / 2$ with high probability, then we’ll be done. Let’s look at $A$, and the same argument will hold for $B$. Define $A'$ to be the interval $[a_0, y]$ which is so big that the probability that a positive example is drawn from $A'$ under $D$ is exactly $\varepsilon / 2$. Here’s another picture to clarify that.

The pink region A’ has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

We’ll be in great shape if it’s already the case that $A \subset A'$, because that implies the probability we draw a positive example from $A$ is at most $\varepsilon / 2$. So we’re worried about the possibility that $A' \subset A$. But this can only happen if we never saw a point from $A'$ as a sample during the run of our algorithm. Since we had $m$ samples, we can compute in terms of $m$ the probability of never seeing a sample from $A'$.

The probability of a single sample not being in $A$ is just $1 - \varepsilon/2$ (by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing $A$ $m$ times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen $A$ contributes error greater than $\varepsilon / 2$ is at most

$\displaystyle \textup{P}_D(A' \subset A) \leq (1 - \varepsilon / 2)^m$

The same argument applies to $B$, so we know by the union bound that the probability of error $> \varepsilon / 2$ occurring in either $A$ or $B$ is at most the sum of the probabilities of large error in each piece, so that

$\displaystyle \textup{P}_D(\textup{err}_{J,D}(I) > \varepsilon) \leq 2(1 - \varepsilon / 2)^m$

Now for the third bullet. We want the chance that the error is big to be smaller than $\delta$, so that we’ll have low error with probability $> 1 - \delta$. So simply set

$\displaystyle 2(1 - \varepsilon / 2)^m \leq \delta$

And solve for $m$. Using the fact that $(1-x) \leq e^{-x}$ (which is proved by Taylor series), it’s enough to solve

$\displaystyle 2e^{-\varepsilon m/2} \leq \delta$,

And a fine solution is $m \geq (2 / \varepsilon \log (2 / \delta))$.

Now to cover all our bases: our algorithm simply computes $m$ for its inputs $\varepsilon, \delta$, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in $1/\varepsilon, 1/\delta$ (and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an $\varepsilon$ fraction of new points drawn from $D$ is at most $\delta$. So we have proved the theorem:

Theorem: Intervals on the real line are PAC-learnable.

$\square$

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in $d$ dimensional space? Where does $d$ show up in the number of samples needed? Is this still efficient?

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution $D$ was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. As we’ll hopefully see in the future, distributions can be defined in quite arbitrary and pathological ways just for the singular purpose of showing a concept class isn’t efficiently PAC-learnable (or that an algorithm fails to show it is learnable). That is, even if the algorithm works great in practice! This is where learning theory starts to butt heads with the real world; in this respect PAC-learnability is much more of a mathematical subject than a practical one. We’re interested in knowing how this particular definition of what it means to learn works.