# On Coloring Resilient Graphs

I’m pleased to announce that another paper of mine is finished. This one is submitted to ICALP, which is being held in Copenhagen this year (this whole research thing is exciting!). This is joint work with my advisor, Lev Reyzin. As with my first paper, I’d like to explain things here on my blog a bit more informally than a scholarly article allows.

## A Recent History of Graph Coloring

One of the first important things you learn when you study graphs is that coloring graphs is hard. Remember that coloring a graph with $k$ colors means that you assign each vertex a color (a number in $\left \{ 1, 2, \dots, k \right \}$) so that no vertex is adjacent to a vertex of the same color (no edge is monochromatic). In fact, even deciding whether a graph can be colored with just $3$ colors (not to mention finding such a coloring) has no known polynomial time algorithm. It’s what’s called NP-hard, which means that almost everyone believes it’s hopeless to solve efficiently in the worst case.

One might think that there’s some sort of gradient to this problem, that as the graphs get more “complicated” it becomes algorithmically harder to figure out how colorable they are. There are some notions of “simplicity” and “complexity” for graphs, but they hardly fall on a gradient. Just to give the reader an idea, here are some ways to make graph coloring easy:

• Make sure your graph is planar. Then deciding 4-colorability is easy because the answer is always yes.
• Make sure your graph is triangle-free and planar. Then finding a 3-coloring is easy.
• Make sure your graph is perfect (which again requires knowledge about how colorable it is).
• Make sure your graph has tree-width or clique-width bounded by a constant.
• Make sure your graph doesn’t have a certain kind of induced subgraph (such as having no induced paths of length 4 or 5).

Let me emphasize that these results are very difficult and tricky to compare. The properties are inherently discrete (either perfect or imperfect, planar or not planar). The fact that the world has not yet agreed upon a universal measure of complexity for graphs (or at least one that makes graph coloring easy to understand) is not a criticism of the chef but a testament to the challenge and intrigue of the dish.

Coloring general graphs is much bleaker, where the focus has turned to approximations. You can’t “approximate” the answer to whether a graph is colorable, so now the key here is that we are actually trying to find an approximate coloring. In particular, if you’re given some graph $G$ and you don’t know the minimum number of colors needed to color it (say it’s $\chi(G)$, this is called the chromatic number), can you easily color it with what turns out to be, say, $2 \chi(G)$ colors?

Garey and Johnson (the gods of NP-hardness) proved this problem is again hard. In fact, they proved that you can’t do better than twice the number of colors. This might not seem so bad in practice, but the story gets worse. This lower bound was improved by Zuckerman, building on the work of Håstad, to depend on the size of the graph! That is, unless $P=NP$, all efficient algorithms will use asymptotically more than $\chi(G) n^{1 - \varepsilon}$ colors for any $\varepsilon > 0$ in the worst case, where $n$ is the number of vertices of $G$. So the best you can hope for is being off by something like a multiplicative factor of $n / \log n$. You can actually achieve this (it’s nontrivial and takes a lot of work), but it carries that aura of pity for the hopeful graph colorer.

The next avenue is to assume you know the chromatic number of your graph, and see how well you can do then. For example: if you are given the promise that a graph $G$ is 3-colorable, can you efficiently find a coloring with 8 colors? The best would be if you could find a coloring with 4 colors, but this is already known to be NP-hard.

The best upper bounds, algorithms to find approximate colorings of 3-colorable graphs, also pitifully depend on the size of the graph. Remember I say pitiful not to insult the researchers! This decades-long line of work was extremely difficult and deserves the highest praise. It’s just frustrating that the best known algorithm to color a 3-colorable graph requires up to $n^{0.2}$ colors. At least it bypasses the barrier of $n^{1 - \varepsilon}$ mentioned above, so we know that knowing the chromatic number actually does help.

The lower bounds are a bit more hopeful; it’s known to be NP-hard to color a $k$-colorable graph using $2^{\sqrt[3]{k}}$ colors if $k$ is sufficiently large. There are a handful of other linear lower bounds that work for all $k \geq 3$, but to my knowledge this is the best asymptotic result. The big open problem (which I doubt many people have their eye on considering how hard it seems) is to find an upper bound depending only on $k$. I wonder offhand whether a ridiculous bound like $k^{k^k}$ colors would be considered progress, and I bet it would.

## Our Idea: Resilience

So without big breakthroughs on the front of approximate graph coloring, we propose a new front for investigation. The idea is that we consider graphs which are not only colorable, but remain colorable under the adversarial operation of adding a few new edges. More formally,

Definition: A graph $G = (V,E)$ is called $r$-resiliently $k$-colorable if two properties hold

1. $G$ is $k$-colorable.
2. For any set $E'$ of $r$ edges disjoint from $E$, the graph $G' = (V, E \cup E')$ is $k$-colorable.

The simplest nontrivial example of this is 1-resiliently 3-colorable graphs. That is a graph that is 3-colorable and remains 3-colorable no matter which new edge you add. And the question we ask of this example: is there a polynomial time algorithm to 3-color a 1-resiliently 3-colorable graph? We prove in our paper that this is actually NP-hard, but it’s not a trivial thing to see.

The chief benefit of thinking about resiliently colorable graphs is that it provides a clear gradient of complexity from general graphs (zero-resilient) to the empty graph (which is $(\binom{k+1}{2} - 1)$-resiliently $k$-colorable). We know that the most complex case is NP-hard, and maximally resilient graphs are trivially colorable. So finding the boundary where resilience makes things easy can shed new light on graph coloring.

Indeed, we argue in the paper that lots of important graphs have stronger resilience properties than one might expect. For example, here are the resilience properties of some famous graphs.

From left to right: the Petersen graph, 2-resiliently 3-colorable; the Dürer graph, 4-resiliently 4-colorable; the Grötzsch graph, 4-resiliently 4-colorable; and the Chvátal graph, 3-resiliently 4-colorable. These are all maximally resilient (no graph is more resilient than stated) and chromatic (no graph is colorable with fewer colors)

If I were of a mind to do applied graph theory, I would love to know about the resilience properties of graphs that occur in the wild. For example, the reader probably knows the problem of register allocation is a natural graph coloring problem. I would love to know the resilience properties of such graphs, with the dream that they might be resilient enough on average to admit efficient coloring algorithms.

Unfortunately the only way that I know how to compute resilience properties is via brute-force search, and of course this only works for small graphs and small $k$. If readers are interested I could post such a program (I wrote it in vanilla python), but for now I’ll just post a table I computed on the proportion of small graphs that have various levels of resilience (note this includes graphs that vacuously satisfy the definition).

Percentage of k-colorable graphs on 6 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       58.0    22.7     5.9     1.7
4       93.3    79.3    58.0    35.3
5       99.4    98.1    94.8    89.0
6      100.0   100.0   100.0   100.0

Percentage of k-colorable graphs on 7 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       38.1     8.2     1.2     0.3
4       86.7    62.6    35.0    14.9
5       98.7    95.6    88.5    76.2
6       99.9    99.7    99.2    98.3

Percentage of k-colorable graphs on 8 vertices which are n-resilient
k\n       1       2       3       4
----------------------------------------
3       21.3     2.1     0.2     0.0
4       77.6    44.2    17.0     4.5

The idea is this: if this trend continues, that only some small fraction of all 3-colorable graphs are, say, 2-resiliently 3-colorable graphs, then it should be easy to color them. Why? Because resilience imposes structure on the graphs, and that structure can hopefully be realized in a way that allows us to color easily. We don’t know how to characterize that structure yet, but we can give some structural implications for sufficiently resilient graphs.

For example, a 7-resiliently 5-colorable graph can’t have any subgraphs on 6 vertices with $\binom{6}{2} - 7$ edges, or else we can add enough edges to get a 6-clique which isn’t 5-colorable. This gives an obvious general property about the sizes of subgraphs in resilient graphs, but as a more concrete instance let’s think about 2-resilient 3-colorable graphs $G$. This property says that no set of 4 vertices may have more than $4 = \binom{4}{2} - 2$ edges in $G$. This rules out 4-cycles and non-isolated triangles, but is it enough to make 3-coloring easy? We can say that $G$ is a triangle-free graph and a bunch of disjoint triangles, but it’s known 3-colorable non-planar triangle-free graphs can have arbitrarily large chromatic number, and so the coloring problem is hard. Moreover, 2-resilience isn’t enough to make $G$ planar. It’s not hard to construct a non-planar counterexample, but proving it’s 2-resilient is a tedious task I relegated to my computer.

Speaking of which, the problem of how to determine whether a $k$-colorable graph is $r$-resiliently $k$-colorable is open. Is this problem even in NP? It certainly seems not to be, but if it had a nice characterization or even stronger necessary conditions than above, we might be able to use them to find efficient coloring algorithms.

In our paper we begin to fill in a table whose completion would characterize the NP-hardness of coloring resilient graphs

The known complexity of k-coloring r-resiliently k-colorable graphs

Ignoring the technical notion of 2-to-1 hardness (it’s technical), the paper accomplishes this as follows. First, we prove some relationships between cells. In particular, if a cell is NP-hard then so are all the cells to the left and below it. So our Theorem 1, that 3-coloring 1-resiliently 3-colorable graphs is NP-hard, gives us the entire black region, though more trivial arguments give all except the (3,1) cell. Also, if a cell is in P (it’s easy to $k$-color graphs with that resilience), then so are all cells above and to its right. We prove that $k$-coloring $\binom{k}{2}$-resiliently $k$-colorable graphs is easy. This is trivial: no vertex may have degree greater than $k-1$, and the greedy algorithm can color such graphs with $k$ colors. So that gives us the entire light gray region.

There is one additional lower bound comes from the fact that it’s NP-hard to $2^{\sqrt[3]{k}}$-color a $k$-colorable graph. In particular, we prove that if you have any function $f(k)$ that makes it NP-hard to $f(k)$-color a $k$-colorable graph, then it is NP-hard to $f(k)$-color an $(f(k) - k)$-resiliently $f(k)$-colorable graph. The exponential lower bound hence gives us a nice linear lower bound, and so we have the following “sufficiently zoomed out” picture of the table

The zoomed out version of the classification table above.

The paper contains the details of how these observations are proved, in addition to the NP-hardness proof for 1-resiliently 3-colorable graphs. This leaves the following open problems:

• Get an unconditional, concrete linear resilience lower bound for hardness.
• Find an algorithm that colors graphs that are less resilient than $O(k^2)$. Even determining specific cells like (4,5) or (5,9) would likely give enough insight for this.
• Classify the tantalizing (3,2) cell (determine if it’s hard or easy to 3-color a 2-resiliently 3-colorable graph) or even better the (4,2) cell.
• Find a way to relate resilient coloring back to general coloring. For example, if such and such cell is hard, then you can’t approximate k-coloring to within so many colors.

## But Wait, There’s More!

Though this paper focuses on graph coloring, our idea of resilience doesn’t stop there (and this is one reason I like it so much!). One can imagine a notion of resilience for almost any combinatorial problem. If you’re trying to satisfy boolean formulas, you can define resilience to mean that you fix the truth value of some variable (we do this in the paper to build up to our main NP-hardness result of 3-coloring 1-resiliently 3-colorable graphs). You can define resilient set cover to allow the removal of some sets. And any other sort of graph-based problem (Traveling salesman, max cut, etc) can be resiliencified by adding or removing edges, whichever makes the problem more constrained.

So this resilience notion is quite general, though it’s hard to define precisely in a general fashion. There is a general framework called Constraint Satisfaction Problems (CSPs), but resilience here seem too general. A CSP is literally just a bunch of objects which can be assigned some set of values, and a set of constraints (k-ary 0-1-valued functions) that need to all be true for the problem to succeed. If we were to define resilience by “adding any constraint” to a given CSP, then there’s nothing to stop us from adding the negation of an existing constraint (or even the tautologically unsatisfiable constraint!). This kind of resilience would be a vacuous definition, and even if we try to rule out these edge cases, I can imagine plenty of weird things that might happen in their stead. That doesn’t mean there isn’t a nice way to generalize resilience to CSPs, but it would probably involve some sort of “constraint class” of acceptable constraints, and I don’t know a reasonable property to impose on the constraint class to make things work.

So there’s lots of room for future work here. It’s exciting to think where it will take me.

Until then!

# Lagrangians for the Amnesiac

For a while I’ve been meaning to do some more advanced posts on optimization problems of all flavors. One technique that comes up over and over again is Lagrange multipliers, so this post is going to be a leisurely reminder of that technique. I often forget how to do these basic calculus-type things, so it’s good practice.

We will assume something about the reader’s knowledge, but it’s a short list: know how to operate with vectors and the dot product, know how to take a partial derivative, and know that in single-variable calculus the local maxima and minima of a differentiable function $f(x)$ occur when the derivative $f'(x)$ vanishes. All of the functions we’ll work with in this post will have infinitely many derivatives (i.e. smooth). So things will be nice.

The gradient of a multivariable function is the natural extension of the derivative of a single-variable function. If $f(x_1, \dots, x_n)$ is a differentiable function, the data of the gradient of $f$ consists of all of the partial derivatives $\partial f / \partial x_i$. It’s usually written as a vector

$\displaystyle \nabla f = \left ( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right )$

To make things easier for ourselves, we’ll just call $f$ a function $f(x)$ and understand $x$ to be a vector in $\mathbb{R}^n$.

We can also think of $\nabla f$ as a function which takes in vectors and spits out vectors, by plugging in the input vector into each $\partial f / \partial x_i$. And the reason we do this is because it lets us describe the derivative of $f$ at a point as a linear map based on the gradient. That is, if we want to know how fast $f$ is growing along a particular vector $v$ and at a particular point $(x, f(x))$, we can just take a dot product of $v$ with $\nabla f(x)$. I like to call dot products inner products, and use the notation $\left \langle \nabla f(x), v \right \rangle$. Here $v$ is a vector in $\mathbb{R}^n$ which we think of as “tangent vectors” to the surface defined by $f$. And if we scale $v$ bigger or smaller, the value of the derivative scales with it (of course, because the derivative is a linear map!). Usually we use unit vectors to represent directions, but there’s no reason we have to. Calculus textbooks often require this to define a “directional derivative,” but perhaps it is better to understand the linear algebra over memorizing these arbitrary choices.

For example, let $f(x,y,z) = xyz$. Then $\nabla f = (yz, xz, xy)$, and $\nabla f(1,2,1) = (2, 1, 2)$. Now if we pick a vector to go along, say, $v = (0,-1,1)$, we get the derivative of $f$ along $v$ is $\left \langle (2,1,2), (0,-1,1) \right \rangle = 1$.

As importantly as computing derivatives is finding where the derivative is zero, and the geometry of the gradient can help us here. Specifically, if we think of our function $f$ as a surface sitting in $\mathbb{R}^{n+1}$ (as in the picture below), it’s not hard to see that the gradient vector points in the direction of steepest ascent of $f$. How do we know this? Well if you fix a point $(x_1, \dots, x_n)$ and you’re forced to use a vector $v$ of the same magnitude as $\nabla f(x)$, how can you maximize the inner product $\left \langle \nabla f(x), v \right \rangle$? Well, you just pick $v$ to be equal to $\nabla f(x)$, of course! This will turn the dot product into the square norm of $\nabla f(x)$.

The gradient points in the direction of steepest ascent. (image source)

More generally, the operation of an inner product $\left \langle -, v \right \rangle$ is geometrically the size of the projection of the argument onto $v$ (scaled by the size of $v$), and projections of a vector $w$ onto different directions than $w$ can only be smaller in magnitude than $w$. Another way to see this is to know the “alternative” formula for the dot product

$\displaystyle \left \langle v,w \right \rangle = \left \| v \right \| \left \| w \right \| \cos(\theta)$

where $\theta$ is the angle between the vectors (in $\mathbb{R}^n$). We might not know how to get that angle, and in this post we don’t care, but we do know that $\cos(\theta)$ is between -1 and 1. And so if $v$ is fixed and we can’t change the norm of $w$ but only its direction, we will maximize the dot product when the two vectors point in the same direction, when $\theta$ is zero.

All of this is just to say that the gradient at a point can be interpreted as having a specific direction. It’s the direction of steepest ascent of the surface $f$, and it’s size tells you how steep $f$ is at that point. The opposite direction is the direction of steepest descent, and the orthogonal directions (when $\theta = \pi /2$) have derivative zero.

Now what happens if we’re at a local minimum or maximum? Well it’s necessary that $f$ is flat, and so by our discussion above the derivatives in all directions must be zero. It’s a basic linear algebra proof to show that this means the gradient is the zero vector. You can prove this by asking what sorts of vectors $w$ have a dot product of zero with all other vectors $v$?

Now once we have a local max or a local min, how do we tell which? The answer is actually a bit complicated, and it requires you to inspect the eigenvalues of the Hessian of $f$. We won’t dally on eigenvalues except to explain the idea in brief: for an $n$ variable function $f$ the Hessian of $f$ at $x$ is an $n$-by-$n$ matrix where the $i,j$ entry is the value of $(\partial f / \partial x_i \partial x_j )(x)$. It just so turns out that if this matrix has only positive eigenvalues, then $x$ is a local minimum. If the eigenvalues are all negative, it’s a local max. If some are negative and some are positive, then it’s a saddle point. And if zero is an eigenvalue then we’re screwed and can’t conclude anything without more work.

But all of this Hessian business isn’t particularly important for us, because most of our applications of the Lagrangian will work with functions where we already know that there is a unique global maximum or minimum. Finding where the gradient is zero is enough. As much as this author stresses the importance of linear algebra, we simply won’t need to compute any eigenvalues for this one.

What we will need to do is look at optimizing functions which are constrained by some equality conditions. This is where Lagrangians come into play.

## Constrained by Equality

Often times we will want to find a minimum or maximum of a function $f(x)$, but we will have additional constraints. The simplest kind is an equality constraint.

For example, we might want to find the maximum of the function $f(x, y, z) = xyz$ requiring that the point $(x,y,z)$ lies on the unit circle. One could write this in a “canonical form”

maximize $xyz$
subject to $x^2 + y^2 + z^2 = 1$

Way back in the scientific revolution, Fermat discovered a technique to solve such problems that was later generalized by Lagrange. The idea is to combine these constraints into one function whose gradient provides enough information to find a maximum. Clearly such information needs to include two things: that the gradient of $xyz$ is zero, and that the constraint is satisfied.

First we rewrite the constraints as $g(x,y,z) = x^2 + y^2 + x^2 - 1 = 0$, because when we’re dealing with gradients we want things to be zero. Then we form the Lagrangian of the problem. We’ll give a precise definition in a minute, but it looks like this:

$L(x,y,z,\lambda) = xyz + \lambda(x^2 + y^2 + z^2 - 1)$

That is, we’ve added a new variable $\lambda$ and added the two functions together. Let’s see what happens when we take a gradient:

$\displaystyle \frac{\partial L}{\partial x} = yz + \lambda 2x$

$\displaystyle \frac{\partial L}{\partial y} = xz + \lambda 2y$

$\displaystyle \frac{\partial L}{\partial z} = xy + \lambda 2z$

$\displaystyle \frac{\partial L}{\partial \lambda} = x^2 + y^2 + z^2 - 1$

Now if we require the gradient to be zero, the last equation is simply the original constraint, and the first three equations say that $\nabla f (x,y,z) = \lambda \nabla g (x,y,z)$. In other words, we’re saying that the two gradients must point in the same direction for the function to provide a maximum. Solving for where these equations vanish gives some trivial solutions (one variable is $\pm 1$ and the rest zero, and $\lambda = 0$), and a solution defined by $x^2 = y^2 = z^2 = 1/3$ which is clearly the maximal of the choices.

Indeed, this will work in general, and you can see a geometric and analytic proof in these notes.

Specifically, if we have an optimization problem defined by an objective function $f(x)$ to optimize, and a set of $k$ equality constraints $g_i(x) = 0$, then we can form the Lagrangian

$\displaystyle L(x, \lambda_1, \dots, \lambda_k) = f(x) + \sum_{i=1}^k \lambda_i g_i(x)$

And then a theorem of Lagrange is that all optimal solutions $x^*$ to the problem satisfy $\nabla L(x^*, \lambda_1, \dots, \lambda_k) = 0$ for some choice of $\lambda _i$. But then you have to go solve the system and figure out which of the solutions gives you your optimum.

## Convexity

As it turns out, there are some additional constraints you can add to your problem to guarantee your system has a solution. One nice condition is that $f(x)$ is convexA function is convex if any point on a line segment between two points $(x,f(x))$ and $(y,f(y))$ has a value greater than $f$. In other words, for all $0 \leq t \leq 1$:

$\displaystyle f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$

Some important examples of convex functions: exponentials, quadratics whose leading coefficient is positive, square norms of a vector variable, and linear functions.

Convex functions have this nice property that they have a unique local minimum value, and hence it must also be the global minimum. Why is this? Well if you have a local minimum $x$, and any other point $y$, then by virtue of being a local minimum there is some $t$ sufficiently close to 1 so that:

$\displaystyle f(x) \leq f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$

And rearranging we get

$\displaystyle (1-t)f(x) \leq (1-t)f(y)$

So $f(x) \leq f(y)$, and since $y$ was arbitrary then $x$ is the global minimum.

This alleviates our problem of having to sort through multiple solutions, and in particular it helps us to write programs to solve optimization problems: we know that techniques like gradient descent will never converge to a false local minimum.

That’s all for now! The next question we might shadowily ask: what happens if we add inequality constraints?

# Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

## The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $X, Y$. Here, if we want to be completely formal, $X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $X$, to be the independent variable and the other, say $Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $X$ and $Y$, and that we intend to use the value of $X$ to predict the value of $Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $X$ could be age, and $Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $X$ corresponds to a proportional change in $Y$, and so one could hope there exist constants $a,b$ so that (as random variables) $Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $X$ and $Y$, and try to estimate the value of $a$ and $b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $\hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $(x,y)$ where $x$ is a value of $X$ and $y$ of $Y$, then the error of our predictor on this example is $|y - \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $y$ value to our prediction for the same $x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $|x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $(y - \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $(x_i, y_i)_{i=1}^n$ and a potential prediction line $\hat{f}(x) = ax + b$, we define the error of $\hat{f}$ on the examples to be

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - \hat{f}(x_i))^2$

Which can also be written as

$\displaystyle S(a,b) = \sum_{i=1}^n (y_i - ax_i - b)^2$

Note that since we’re fixing our data sample, the function $S$ is purely a function of the variables $a,b$. Now we want to minimize this quantity with respect to $a,b$, so we can take a gradient,

$\displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i - ax_i - b) x_i$

$\displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i - b)$

and set them simultaneously equal to zero. In the first we solve for $b$:

$\displaystyle 0 = -2 \sum_{i=1}^n y_i - ax_i - b = -2 \left ( nb + \sum_{i=1}^n y_i - ax_i \right )$

$\displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i - ax_i$

If we denote by $x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $b = y_{\textup{avg}} - ax_{\textup{avg}}$. Substituting $b$ into the other equation we get

$\displaystyle -2 \sum_{i=1}^n (y_ix_i - ax_i^2 - y_{\textup{avg}}x_i - ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $a$, further simplifies to

$\displaystyle 0 = \sum_{i=1}^n y_ix_i - y_{\textup{avg}}x_i - a \sum_{i=1}^n (x_i^2 - x_{\textup{avg}}x_i)$

And so

$\displaystyle a = \frac{\sum_{i=1}^n (y_i - y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i - x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
xAvg, yAvg = map(avg, zip(*points))

aNum = 0
for (x,y) in points:
aNum += (y - yAvg) * x
aDenom += (x - xAvg) * x

b = yAvg - a * xAvg
return (a, b), lambda x: a*x + b


and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

## Many Variables and Matrix Form

If we take those two variables $x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $\mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$\displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $\mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $\mathbf{x} = (x_0, x_1)$. The output is still $y$. Here we understand implicitly that the indices line up: if $w_0$ is the constant term, then that makes $x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $x_1$ is the linear term with coefficient $w_1$. Now we can just write the prediction function as

$\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $Y$ depending on the $n$ random variables $X_1, \dots, X_n$, then our data will come in tuples of the form $(\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $x_0$ is fixed to 1. Then expanding our line $\mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $\hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $\mathbf{x}_j$ as rows of a matrix $X$ and the output values $y_j$ as entries of a vector $\mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$y - Xw$

Indeed, each entry of the vector $Xw$ is a dot product of a row of $X$ (an input data point) with the coefficients of the line $w$. It’s just $\hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$\displaystyle S(w) = \| y - Xw \|^2$

This is just taking a dot product of $y-Xw$ with itself. This form is awkward to differentiate because the variable $w$ is nested in the norm. Luckily, we can get the same result by viewing $y - Xw$ as a 1-by-$n$ matrix, transposing it, and multiplying by $y-Xw$.

$\displaystyle S(w) = (y - Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $S$ with respect to each of the $w_i$ variables in $w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$\displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $x$, the partial $\frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $w$ gives $w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
# input points are n+1 tuples of n inputs and 1 output
X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
y = array([p[-1] for p in points])

Xt = transpose(X)
theInverse = inv(dot(Xt, X))
w = dot(dot(theInverse, Xt), y)
return w, lambda x: dot(w, x)


Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

## Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $X,Y$. If we assume that $Y$ is perfectly determined by $X$ via some linear equation $Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $x$ example, its corresponding $y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $y_i$ would be $y_i = ax_i + b + \eta_i$, and we say all the $\eta_i$ are drawn independently and uniformly at random from the set $\left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $\eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $H$ the random variable producing the noise (in Greek $H$ is the capital letter for $\eta$), and write $Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $\pm 1$, but is rather chosen to be Gaussian with mean $0$ and some variance $\sigma$. We’d denote this by $\eta_i \sim N(\mu, \sigma)$, and say the $\eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $y$ are given by $y = Xw + H$ where $X$ is the observed data and $H$ is Gaussian noise with mean zero and some (unknown) standard deviation $\sigma$. Then if we call $\hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$\displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $y$ by $Xw + H$,

$\displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$\textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $w$ is constant (so its expected value is just itself) and $\textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $\hat{w}$ is just $w$. Because this is true for all choices of data $X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $\hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $w_i$ as variables depending on $X$, we apply the definition:

$\textup{Var}(\hat{w} | X) = \textup{E}(\| w - \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $H$ is just the diagonal matrix $\sigma^2 I_n$, we get that

$\textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $(X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!

# Cauchy-Schwarz Inequality (and Amplification)

Problem: Prove that for vectors $v, w$ in an inner product space, the inequality

$\displaystyle |\left \langle v, w \right \rangle | \leq \| v \| \| w \|$

Solution: There is an elementary proof of the Cauchy-Schwarz inequality (see the Wikipedia article), and this proof is essentially the same. What makes this proof stand out is its insightful technique, which I first read about on Terry Tao’s blog. He calls it “textbook,” and maybe it is for an analyst, but it’s still very elegant.

We start by observing another inequality we know to be true, that $\| v - w \|^2 = \left \langle v - w, v - w \right \rangle \geq 0$, since norms are by definition nonnegative. By the properties of a complex inner product we can expand to get

$\displaystyle \| v \|^2 - 2 \textup{Re}(\left \langle v,w \right \rangle) + \| w \|^2 \geq 0$

or equivalently

$\displaystyle \textup{Re}(\left \langle v,w \right \rangle) \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

This inequality is close to the one we’re looking for, but ‘weaker’ because the inequality we seek squeezes inside the inequality we have. That is,

$\displaystyle \textup{Re}(\left \langle v,w \right \rangle) \leq |\left \langle v, w \right \rangle | \leq \| v \| \| w \| \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

The first inequality is trivial (a complex number is always greater than its real part), the second is the inequality we seek to prove, and the third is a consequence of the arithmetic-geometric mean inequality. And so we have an inequality we’d like to “tighten” to get the true theorem. We do this by tightening each side of the inequality separately, and we do each by exploiting symmetries in the expressions involved.

First, we observe that norms of vectors are preserved by (complex) rotations $v \mapsto e^{i \theta}v$, but the real part is not. Since this inequality is true no matter which vectors we choose, we can choose $\theta$ to our advantage. That is,

$\displaystyle \textup{Re}(\left \langle e^{i \theta}v, w \right \rangle) \leq \frac{1}{2} \| e^{i \theta}v \|^2 + \frac{1}{2} \| w \|^2$

And by properties of inner products and norms (pulling out scalars) and the fact that $|e^{i\theta}| = 1$, we can simplify to

$\displaystyle \textup{Re}(e^{i \theta}\left \langle v,w \right \rangle) \leq \frac{1}{2}\| v \|^2 + \frac{1}{2} \| w \|^2$

where $\theta$ is arbitrary. Since we want to maximize the left hand side as much as possible, we can choose $\theta$ to be whatever is required to make the number real. Then the real part is just the absolute value of the number itself, and we have

$\displaystyle \left |\langle v,w \right \rangle | \leq \frac{1}{2} \| v \|^2 + \frac{1}{2} \| w \|^2$

Now we tighten the right hand side by exploiting a symmetry in inner products: the transformation $(v,w) \mapsto (\lambda v, \frac{1}{\lambda} w)$ preserves the left hand side (since $|\lambda / \bar{\lambda}| = 1$) but not the right. And so by the same reasoning, we can transform the above inequality into

$\displaystyle \left |\langle v,w \right \rangle | \leq \frac{\lambda^2}{2} \| v \|^2 + \frac{1}{2 \lambda^2} \| w \|^2$

And by plugging in $\lambda = \sqrt{\| w \| / \| v \|}$ (indeed, this minimizes the expression for nonzero $v,w$) we get exactly the Cauchy-Schwarz inequality, as desired.

$\square$

This technique is termed “amplification” by Tao, and in his blog post he gives quite a few more advanced examples in harmonic and functional analysis (which are far beyond the scope of this blog).   The asymmetrical symmetry we took advantage of is a sort of “arbitrage” (again Terry’s clever choice of words) to take a weak fact and boost it to a stronger fact. And while the details of this proof are quite trivial, the technique of actively looking for one-sided symmetries is difficult to forget.