# Mathematical Genealogy

As a fun side project to distract me from my abysmal progress on my book, I decided to play around with the math genealogy graph!

For those who don’t know, since 1996, mathematicians, starting with the labor of Harry Coonce et al, have been managing a database of all mathematicians. More specifically, they’ve been keeping track of who everyone’s thesis advisors and subsequent students were. The result is a directed graph (with a current estimated 200k nodes) that details the scientific lineage of mathematicians.

Anyone can view the database online and explore the graph by hand. In it are legends like Gauss, Euler, and Noether, along with the sizes of their descendant subtrees. Here’s little ol’ me.

It’s fun to look at who is in your math genealogy, and I’ve spent more than a few minutes clicking until I get to the top of a tree (since a person can have multiple advisors, finding the top is time consuming), like the sort of random walk that inspired Google’s PageRank and Wikipedia link clicking games.

Inspired by a personalized demo by Colin Wright, I decided it would be fun to scrape the website, get a snapshot of the database, and then visualize and play with the graph. So I did.

Here’s a github repository with the raw data and scraping script. It includes a full json dump of what I scraped as of a few days ago. It’s only ~60MB.

Then, using a combination of tools, I built a rudimentary visualizer. Go play with it!

A subset of my mathematical family tree.

A few notes:

1. It takes about 15 seconds to load before you can start playing. During this time, it loads a compressed version of the database into memory (starting from a mere 5MB). Then it converts the data into a more useful format, builds a rudimentary search index of the names, and displays the ancestors for Gauss.
2. The search index is the main bloat of the program, requiring about a gigabyte of memory to represent. Note that because I’m too lazy to set up a proper server and elasticsearch index, everything in this demo is in Javascript running in your browser. Here’s the github repo for that code.
3. You can drag and zoom the graph.
4. There was a fun little bit of graph algorithms involved in this project, such as finding the closest common ancestor of two nodes. This is happening in a general digraph, not necessarily a tree, so there are some extra considerations. I isolated all the graph algorithms to one file.
5. People with even relatively few descendants generate really wide graphs. This is because each layer in the directed graph is assigned to a layer, and, the potentially 100+ grandchildren of a single node will be laid out in the same layer. I haven’t figured out how to constrain the width of the rendered graph (anyone used dagre/dagre-d3?), nor did I try very hard.
6. The dagre layout package used here is a port of the graphviz library. It uses linear programming and the simplex algorithm to determine an optimal layout that penalizes crossed edges and edges that span multiple layers, among other things. Linear programming strikes again! For more details on this, see this paper outlining the algorithm.
7. The scraping algorithm was my first time using Python 3’s asyncio features. The concepts of asynchronous programming are not strange to me, but somehow the syntax of this module is.

Feature requests, bugs, or ideas? Open an issue on Github or feel free to contribute a pull request! Enjoy.

# Duality for the SVM

This post is a sequel to Formulating the Support Vector Machine Optimization Problem.

## The Karush-Kuhn-Tucker theorem

Generic optimization problems are hard to solve efficiently. However, optimization problems whose objective and constraints have special structure often succumb to analytic simplifications. For example, if you want to optimize a linear function subject to linear equality constraints, one can compute the Lagrangian of the system and find the zeros of its gradient. More generally, optimizing a linear function subject to linear equality and inequality constraints can be solved using various so-called “linear programming” techniques, such as the simplex algorithm.

However, when the objective is not linear, as is the case with SVM, things get harder. Likewise, if the constraints don’t form a convex set you’re (usually) out of luck from the standpoint of analysis. You have to revert to numerical techniques and cross your fingers. Note that the set of points satisfying a collection of linear inequalities forms a convex set, provided they can all be satisfied.

We are in luck. The SVM problem can be expressed as a so-called “convex quadratic” optimization problem, meaning the objective is a quadratic function and the constraints form a convex set (are linear inequalities and equalities). There is a neat theorem that addresses such, and it’s the “convex quadratic” generalization of the Lagrangian method. The result is due to Karush, Kuhn, and Tucker, (dubbed the KKT theorem) but we will state a more specific case that is directly applicable to SVM.

Theorem [Karush 1939, Kuhn-Tucker 1951]: Suppose you have an optimization problem in $\mathbb{R}^n$ of the following form:

$\displaystyle \min f(x), \text{ subject to } g_i(x) \leq 0, i = 1, \dots, m$

Where $f$ is a differentiable function of the input variables $x$ and $g_1, \dots, g_m$ are affine (degree-1 polynomials). Suppose $z$ is a local minimum of $f$. Then there exist constants (called KKT or Lagrange multipliers) $\alpha_1, \dots, \alpha_m$ such that the following are true. Note the parenthetical labels contain many intentionally undefined terms.

1. $- \nabla f(z) = \sum_{i=1}^m \alpha_i \nabla g_i(z)$ (gradient of Lagrangian is zero)
2. $g_i(z) \leq 0$ for all $i = 1, \dots, m$ (primal constraints are satisfied)
3. $\alpha_i \geq 0$ for all $i = 1, \dots, m$ (dual constraints are satisfied)
4. $\alpha_i g_i(z) = 0$ for all $i = 1, \dots, m$ (complementary slackness conditions)

We’ll discuss momentarily how to interpret these conditions, but first a few asides. A large chunk of the work in SVMs is converting the original, geometric problem statement, that of maximizing the margin of a linear separator, into a form suitable for this theorem. We did that last time. However, the conditions of this theorem also provide the structure for a more analytic algorithm, the Sequential Minimal Optimization algorithm, which allows us to avoid numerical methods. We’ll see how this works explicitly next time when we implement SMO.

You may recall that for the basic Lagrangian, each constraint in the optimization problem corresponds to one Lagrangian multiplier, and hence one term of the Lagrangian. Here it’s largely the same—each constraint  in the SVM problem (and hence each training point) corresponds to a KKT multiplier—but in addition each KKT multiplier corresponds to a constraint for a new optimization problem that this theorem implicitly defines (called the dual problem). So the pseudocode of the Sequential Minimal Optimization algorithm is to start with some arbitrary separating hyperplane $w$, and find any training point $x_j$ that corresponds to a violated constraint. Fix $w$ so it works for $x_j$, and repeat until you can’t find any more violated constraints.

Now to interpret those four conditions. The difficulty in this part of the discussion is in the notion of primal/dual problems. The “original” optimization problem is often called the “primal” problem. While a “primal problem” can be either a minimization or a maximization (and there is a corresponding KKT theorem for each) we’ll use the one of the form:

$\displaystyle \min f(x), \text{subject to } g_i(x) \leq 0, i = 1, \dots, m$

Next we define a corresponding “dual” optimization problem, which is a maximization problem whose objective and constraints are related to the primal in a standard, but tedious-to-write-down way. In general, this dual maximization problem has the guarantee that its optimal solution (a max) is a lower bound on the optimal solution for the primal (a min). This can be useful in many settings. In the most pleasant settings, including SVM, you get an even stronger guarantee, that the optimal solutions for the primal and dual problems have equal objective value. That is, the bound that the dual objective provides on the primal optimum is tight. In that case, the primal and dual are two equivalent perspectives on the same problem. Solving the dual provides a solution to the primal, and vice versa.

The KKT theorem implicitly defines a dual problem, which can only possibly be clear from the statement of the theorem if you’re intimately familiar with duals and Lagrangians already. This dual problem has variables $\alpha = (\alpha_1, \dots, \alpha_m)$, one entry for each constraint of the primal. For KKT, the dual constraints are simply non-negativity of the variables

$\displaystyle \alpha_j \geq 0 \text{ for all } j$

And the objective for the dual is this nasty beast

$\displaystyle d(\alpha) = \inf_{x} L(x, \alpha)$

where $L(x, \alpha)$ is the generalized Lagrangian (which is simpler in this writeup because the primal has no equality constraints), defined as:

$\displaystyle L(x, \alpha) = f(x) + \sum_{i=1}^m \alpha_i g_i(x)$

While a proper discussion of primality and duality could fill a book, we’ll have to leave it at that. If you want to journey deeper into this rabbit hole, these notes give a great introduction from the perspective of the classical Lagrangian, without any scarring.

But we can begin to see why the KKT conditions are the way they are. The first requires the generalized Lagrangian has gradient zero. Just like with classical Lagrangians, this means the primal objective is at a local minimum. The second requires the constraints of the primal problem to be satisfied. The third does the same for the dual constraints. The fourth is the interesting one, because it says that at an optimal solution, the primal and dual constraints are intertwined.

4. $\alpha_i g_i(z) = 0$ for all $i = 1, \dots, m$ (complementary slackness conditions)

More specifically, these “complementary slackness” conditions require that for each $i$, either the dual constraint $\alpha_i \geq 0$ is actually tight ($\alpha_i = 0$), or else the primal constraint $g_i$ is tight. At least one of the two must be exactly at the limit (equal to zero, not strictly less than). The “product equals zero means one factor is zero” trick comes in handy here to express an OR, despite haunting generations of elementary algebra students. In terms of the SVM problem, complementary slackness translates to the fact that, for the optimal separating hyperplane $w$, if a data point doesn’t have functional margin exactly 1, then that data point isn’t a support vector. Indeed, when $\alpha_i = 0$ we’ll see in the next section how that affects the corresponding training point $x_i$.

## The nitty gritty for SVM

Now that we’ve recast the SVM into a form suitable for the KKT theorem, let’s compute the dual and understand how these dual constraints are related to the optimal solution of the primal SVM problem.

The primal problem statement is

$\displaystyle \min_{w} \frac{1}{2} \| w \|^2$

Subject to the constraints that all $m$ training points $x_1, \dots, x_m$ with training labels $y_1, \dots, y_m$ satisfy

$\displaystyle (\langle w, x_i \rangle + b) \cdot y_i \geq 1$

Which we can rewrite as

$\displaystyle 1 - (\langle w, x_i \rangle + b) \cdot y_i \leq 0$

The generalized Lagrangian is

\displaystyle \begin{aligned} L(w, b, \alpha) &= \frac{1}{2} \| w \|^2 + \sum_{j=1}^m \alpha_j(1-y_j \cdot (\langle w, x_j \rangle + b)) \\ &= \frac{1}{2} \| w \|^2 + \sum_{j=1}^m \alpha_j - \sum_{j=1}^m \alpha_j y_j \cdot \langle w, x_j \rangle - \sum_{j=1}^m \alpha_j y_j b \end{aligned}

We can compute each component of the gradient $\nabla L$, indexed by the variables $w_i, b,$ and $\alpha_j$. First, since this simplifies the Lagrangian a bit, we compute $\frac{\partial L}{\partial b}$.

$\displaystyle \frac{\partial L}{\partial b} = -\sum_{j=1}^m y_j \alpha_j$

The condition that the gradient is zero implies this entry is zero, i.e. $\sum_{j=1}^m y_j \alpha_j = 0$. In particular, and this will be a helpful reminder for next post, we could add this constraint to the dual problem formulation without changing the optimal solution, allowing us to remove the term $b \sum_{j=1}^m y_j \alpha_j$ from the Lagrangian since it’s zero. We will use this reminder again when we implement the Sequential Minimal Optimization algorithm next time.

Next, the individual components $w_i$ of $w$.

$\displaystyle \frac{\partial L}{\partial w_i} = w_i - \sum_{j=1}^m \alpha_j y_j x_{j,i}$

Note that $x_{i,j}$ is the $i$-th component of the $j$-th training point $x_j$, since this is the only part of the expression $w \cdot x_j$ that involves $w_i$.

Setting all these equal to zero means we require $w = \sum_{j=1}^m \alpha_j y_j x_j$. This is interesting! The optimality criterion, that the gradient of the Lagrangian must be zero, actually shows us how to write the optimal solution $w$ in terms of the Lagrange multipliers $\alpha_j$ and the training data/labels. It also hints at the fact that, because of this complementary slackness condition, many of the $\alpha_i$ will turn out to be zero, and hence the optimal solution can be written as a sparse sum of the training examples.

And, now that we have written $w$ in terms of the $\alpha_j$, we can eliminate $w$ in the formula for the Lagrangian and get a dual optimization objective only in terms of the $\alpha_j$. Substituting (and combining the resulting two double sums whose coefficients are $\frac{1}{2}$ and $-1$), we get

$\displaystyle L(\alpha) = \sum_{j=1}^m \alpha_j - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$

Again foreshadowing, the fact that this form only depends on the inner products of the training points will allow us to replace the standard (linear) inner product for a nonlinear “inner-product-like” function, called a kernel, that will allow us to introduce nonlinearity into the decision boundary.

Now back to differentiating the Lagrangian. For the remaining entries of the Lagrangian where the variable is a KKT multiplier, it coincides with the requirement that the constraints of the primal are satisfied:

$\displaystyle \frac{\partial L}{\partial \alpha_j} = 1 - y_j (\langle w, x_j \rangle + b) \leq 0$

Next, the KKT theorem says that one needs to have both feasibility of the dual:

$\displaystyle \alpha_j \geq 0 \text{ for all } j$

And finally the complementary slackness conditions,

$\displaystyle \alpha_j (1 - y_j (\langle w, x_j \rangle + b)) = 0 \text{ for all } j = 1, \dots, m$

To be completely clear, the dual problem for the SVM is just the generalized Lagrangian:

$\displaystyle \max_{\alpha} (\inf_x L(x, \alpha))$

subject to the non-negativity constraints:

$\displaystyle \alpha_i \geq 0$

And the one (superfluous reminder) equality constraint

$\displaystyle \sum_{j=1}^m y_j \alpha_j = 0$

All of the equality constraints above (Lagrangian being zero, complementary slackness, and this reminder constraint) are consequences of the KKT theorem.

At this point, we’re ready to derive and implement the Sequential Minimal Optimization Algorithm and run it on some data. We’ll do that next time.

# Formulating the Support Vector Machine Optimization Problem

## The hypothesis and the setup

This blog post has an interactive demo (mostly used toward the end of the post). The source for this demo is available in a Github repository.

Last time we saw how the inner product of two vectors gives rise to a decision rule: if $w$ is the normal to a line (or hyperplane) $L$, the sign of the inner product $\langle x, w \rangle$ tells you whether $x$ is on the same side of $L$ as $w$.

Let’s translate this to the parlance of machine-learning. Let $x \in \mathbb{R}^n$ be a training data point, and $y \in \{ 1, -1 \}$ is its label (green and red, in the images in this post). Suppose you want to find a hyperplane which separates all the points with -1 labels from those with +1 labels (assume for the moment that this is possible). For this and all examples in this post, we’ll use data in two dimensions, but the math will apply to any dimension.

Some data labeled red and green, which is separable by a hyperplane (line).

The hypothesis we’re proposing to separate these points is a hyperplane, i.e. a linear subspace that splits all of $\mathbb{R}^n$ into two halves. The data that represents this hyperplane is a single vector $w$, the normal to the hyperplane, so that the hyperplane is defined by the solutions to the equation $\langle x, w \rangle = 0$.

As we saw last time, $w$ encodes the following rule for deciding if a new point $z$ has a positive or negative label.

$\displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle)$

You’ll notice that this formula only works for the normals $w$ of hyperplanes that pass through the origin, and generally we want to work with data that can be shifted elsewhere. We can resolve this by either adding a fixed term $b \in \mathbb{R}$—often called a bias because statisticians came up with it—so that the shifted hyperplane is the set of solutions to $\langle x, w \rangle + b = 0$. The shifted decision rule is:

$\displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle + b)$

Now the hypothesis is the pair of vector-and-scalar $w, b$.

The key intuitive idea behind the formulation of the SVM problem is that there are many possible separating hyperplanes for a given set of labeled training data. For example, here is a gif showing infinitely many choices.

The question is: how can we find the separating hyperplane that not only separates the training data, but generalizes as well as possible to new data? The assumption of the SVM is that a hyperplane which separates the points, but is also as far away from any training point as possible, will generalize best.

While contrived, it’s easy to see that the separating hyperplane is as far as possible from any training point.

More specifically, fix a labeled dataset of points $(x_i, y_i)$, or more precisely:

$\displaystyle D = \{ (x_i, y_i) \mid i = 1, \dots, m, x_i \in \mathbb{R}^{n}, y_i \in \{1, -1\} \}$

And a hypothesis defined by the normal $w \in \mathbb{R}^{n}$ and a shift $b \in \mathbb{R}$. Let’s also suppose that $(w,b)$ defines a hyperplane that correctly separates all the training data into the two labeled classes, and we just want to measure its quality. That measure of quality is the length of its margin.

Definition: The geometric margin of a hyperplane $w$ with respect to a dataset $D$ is the shortest distance from a training point $x_i$ to the hyperplane defined by $w$.

The best hyperplane has the largest possible margin.

This margin can even be computed quite easily using our work from last post. The distance from $x$ to the hyperplane defined by $w$ is the same as the length of the projection of $x$ onto $w$. And this is just computed by an inner product.

If the tip of the $x$ arrow is the point in question, then $a$ is the dot product, and $b$ the distance from $x$ to the hyperplane $L$ defined by $w$.

## A naive optimization objective

If we wanted to, we could stop now and define an optimization problem that would be very hard to solve. It would look like this:

\displaystyle \begin{aligned} & \max_{w} \min_{x_i} \left | \left \langle x_i, \frac{w}{\|w\|} \right \rangle + b \right | & \\ \textup{subject to \ \ } & \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i) & \textup{ for every } i = 1, \dots, m \end{aligned}

The formulation is hard. The reason is it’s horrifyingly nonlinear. In more detail:

1. The constraints are nonlinear due to the sign comparisons.
2. There’s a min and a max! A priori, we have to do this because we don’t know which point is going to be the closest to the hyperplane.
3. The objective is nonlinear in two ways: the absolute value and the projection requires you to take a norm and divide.

The rest of this post (and indeed, a lot of the work in grokking SVMs) is dedicated to converting this optimization problem to one in which the constraints are all linear inequalities and the objective is a single, quadratic polynomial we want to minimize or maximize.

Along the way, we’ll notice some neat features of the SVM.

## Trick 1: linearizing the constraints

To solve the first problem, we can use a trick. We want to know whether $\textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i)$ for a labeled training point $(x_i, y_i)$. The trick is to multiply them together. If their signs agree, then their product will be positive, otherwise it will be negative.

So each constraint becomes:

$\displaystyle (\langle x_i, w \rangle + b) \cdot y_i \geq 0$

This is still linear because $y_i$ is a constant (input) to the optimization problem. The variables are the coefficients of $w$.

The left hand side of this inequality is often called the functional margin of a training point, since, as we will see, it still works to classify $x_i$, even if $w$ is scaled so that it is no longer a unit vector. Indeed, the sign of the inner product is independent of how $w$ is scaled.

## Trick 1.5: the optimal solution is midway between classes

This small trick is to notice that if $w$ is the supposed optimal separating hyperplane, i.e. its margin is maximized, then it must necessarily be exactly halfway in between the closest points in the positive and negative classes.

In other words, if $x_+$ and $x_-$ are the closest points in the positive and negative classes, respectively, then $\langle x_{+}, w \rangle + b = -(\langle x_{-}, w \rangle + b)$. If this were not the case, then you could adjust the bias, shifting the decision boundary along $w$ until it they are exactly equal, and you will have increased the margin. The closest point, say $x_+$ will have gotten farther away, and the closest point in the opposite class, $x_-$ will have gotten closer, but will not be closer than $x_+$.

## Trick 2: getting rid of the max + min

Resolving this problem essentially uses the fact that the hypothesis, which comes in the form of the normal vector $w$, has a degree of freedom in its length. To explain the details of this trick, we’ll set $b=0$ which simplifies the intuition.

Indeed, in the animation below, I can increase or decrease the length of $w$ without changing the decision boundary.

I have to keep my hand very steady (because I was too lazy to program it so that it only increases/decreases in length), but you can see the point. The line is perpendicular to the normal vector, and it doesn’t depend on the length.

Let’s combine this with tricks 1 and 1.5. If we increase the length of $w$, that means the absolute values of the dot products $\langle x_i, w \rangle$ used in the constraints will all increase by the same amount (without changing their sign). Indeed, for any vector $a$ we have $\langle a, w \rangle = \|w \| \cdot \langle a, w / \| w \| \rangle$.

In this world, the inner product measurement of distance from a point to the hyperplane is no longer faithful. The true distance is $\langle a, w / \| w \| \rangle$, but the distance measured by $\langle a, w \rangle$ is measured in units of $1 / \| w \|$.

In this example, the two numbers next to the green dot represent the true distance of the point from the hyperplane, and the dot product of the point with the normal (respectively). The dashed lines are the solutions to <x, w> = 1. The magnitude of w is 2.2, the inverse of that is 0.46, and indeed 2.2 = 4.8 * 0.46 (we’ve rounded the numbers).

Now suppose we had the optimal hyperplane and its normal $w$. No matter how near (or far) the nearest positively labeled training point $x$ is, we could scale the length of $w$ to force $\langle x, w \rangle = 1$. This is the core of the trick. One consequence is that the actual distance from $x$ to the hyperplane is $\frac{1}{\| w \|} = \langle x, w / \| w \| \rangle$.

The same as above, but with the roles reversed. We’re forcing the inner product of the point with w to be 1. The true distance is unchanged.

In particular, if we force the closest point to have inner product 1, then all other points will have inner product at least 1. This has two consequences. First, our constraints change to $\langle x_i, w \rangle \cdot y_i \geq 1$ instead of $\geq 0$. Second, we no longer need to ask which point is closest to the candidate hyperplane! Because after all, we never cared which point it was, just how far away that closest point was. And now we know that it’s exactly $1 / \| w \|$ away. Indeed, if the optimal points weren’t at that distance, then that means the closest point doesn’t exactly meet the constraint, i.e. that $\langle x, w \rangle > 1$ for every training point $x$. We could then scale $w$ shorter until $\langle x, w \rangle = 1$, hence increasing the margin $1 / \| w \|$.

In other words, the coup de grâce, provided all the constraints are satisfied, the optimization objective is just to maximize $1 / \| w \|$, a.k.a. to minimize $\| w \|$.

This intuition is clear from the following demonstration, which you can try for yourself. In it I have a bunch of positively and negatively labeled points, and the line in the center is the candidate hyperplane with normal $w$ that you can drag around. Each training point has two numbers next to it. The first is the true distance from that point to the candidate hyperplane; the second is the inner product with $w$. The two blue dashed lines are the solutions to $\langle x, w \rangle = \pm 1$. To solve the SVM by hand, you have to ensure the second number is at least 1 for all green points, at most -1 for all red points, and then you have to make $w$ as short as possible. As we’ve discussed, shrinking $w$ moves the blue lines farther away from the separator, but in order to satisfy the constraints the blue lines can’t go further than any training point. Indeed, the optimum will have those blue lines touching a training point on each side.

I bet you enjoyed watching me struggle to solve it. And while it’s probably not the optimal solution, the idea should be clear.

The final note is that, since we are now minimizing $\| w \|$, a formula which includes a square root, we may as well minimize its square $\| w \|^2 = \sum_j w_j^2$. We will also multiply the objective by $1/2$, because when we eventually analyze this problem we will take a derivative, and the square in the exponent and the $1/2$ will cancel.

## The final form of the problem

Our optimization problem is now the following (including the bias again):

\displaystyle \begin{aligned} & \min_{w} \frac{1}{2} \| w \|^2 & \\ \textup{subject to \ \ } & (\langle x_i, w \rangle + b) \cdot y_i \geq 1 & \textup{ for every } i = 1, \dots, m \end{aligned}

This is much simpler to analyze. The constraints are all linear inequalities (which, because of linear programming, we know are tractable to optimize). The objective to minimize, however, is a convex quadratic function of the input variables—a sum of squares of the inputs.

Such problems are generally called quadratic programming problems (or QPs, for short). There are general methods to find solutions! However, they often suffer from numerical stability issues and have less-than-satisfactory runtime. Luckily, the form in which we’ve expressed the support vector machine problem is specific enough that we can analyze it directly, and find a way to solve it without appealing to general-purpose numerical solvers.

We will tackle this problem in a future post (planned for two posts sequel to this one). Before we close, let’s just make a few more observations about the solution to the optimization problem.

## Support Vectors

In Trick 1.5 we saw that the optimal separating hyperplane has to be exactly halfway between the two closest points of opposite classes. Moreover, we noticed that, provided we’ve scaled $\| w \|$ properly, these closest points (there may be multiple for positive and negative labels) have to be exactly “distance” 1 away from the separating hyperplane.

Another way to phrase this without putting “distance” in scare quotes is to say that, if $w$ is the normal vector of the optimal separating hyperplane, the closest points lie on the two lines $\langle x_i, w \rangle + b = \pm 1$.

Now that we have some intuition for the formulation of this problem, it isn’t a stretch to realize the following. While a dataset may include many points from either class on these two lines $\langle x_i, w \rangle = \pm 1$, the optimal hyperplane itself does not depend on any of the other points except these closest points.

This fact is enough to give these closest points a special name: the support vectors.

We’ll actually prove that support vectors “are all you need” with full rigor and detail next time, when we cast the optimization problem in this post into the “dual” setting. To avoid vague names, the formulation described in this post called the “primal” problem. The dual problem is derived from the primal problem, with special variables and constraints chosen based on the primal variables and constraints. Next time we’ll describe in brief detail what the dual does and why it’s important, but we won’t have nearly enough time to give a full understanding of duality in optimization (such a treatment would fill a book).

When we compute the dual of the SVM problem, we will see explicitly that the hyperplane can be written as a linear combination of the support vectors. As such, once you’ve found the optimal hyperplane, you can compress the training set into just the support vectors, and reproducing the same optimal solution becomes much, much faster. You can also use the support vectors to augment the SVM to incorporate streaming data (throw out all non-support vectors after every retraining).

Eventually, when we get to implementing the SVM from scratch, we’ll see all this in action.

Until then!

# Testing Polynomial Equality

Problem: Determine if two polynomial expressions represent the same function. Specifically, if $p(x_1, x_2, \dots, x_n)$ and $q(x_1, x_2, \dots, x_n)$ are a polynomial with inputs, outputs and coefficients in a field $F$, where $|F|$ is sufficiently large, then the problem is to determine if $p(\mathbf{x}) = q(\mathbf{x})$ for every $x \in F$, in time polynomial in the number of bits required to write down $p$ and $q$.

Solution: Let $d$ be the maximum degree of all terms in $p, q$. Choose a finite set $S \subset F$ with $|S| > 2d$. Repeat the following process 100 times:

1. Choose inputs $z_1, z_2, \dots, z_n \in S$ uniformly at random.
2. Check if $p(z_1, \dots, z_n) = q(z_1, \dots, z_n)$.

If every single time the two polynomials agree, accept the claim that they are equal. If they disagree on any input, reject. You will be wrong with probability at most $2^{-100}$.

Discussion: At first glance it’s unclear why this problem is hard.

If you have two representations of polynomials $p, q$, say expressed in algebraic notation, why can’t you just do the algebra to convert them both into the same format, and see if they’re equal?

Unfortunately, that conversion can take exponential time. For example, suppose you have a polynomial $p(x) = (x+1)^{1000}$. Though it only takes a few bits to write down, expressing it in a “canonical form,” often in the monomial form $a_0 + a_1x + \dots + a_d x^d$, would require exponentially many bits in the original representation. In general, it’s unknown how to algorithmically transform polynomials into a “canonical form” (so that they can be compared) in subexponential time.

Instead, the best we know how to do is treat the polynomials as black boxes and plug values into them.

Indeed, for single variable polynomials it’s well known that a nonzero degree $d$ polynomial has at most $d$ roots. A similar result is true for polynomials with many variables, and so we can apply that result to the polynomial $p - q$ to determine if $p = q$. This theorem is so important (and easy to prove) that it deserves the name of lemma.

The Schwartz-Zippel lemma. Let $p$ be a nonzero polynomial of total degree $d \geq 0$ over a field $F$. Let $S$ be a finite subset of $F$ and let $z_1, \dots, z_n$ be chosen uniformly at random from $S$. The probability that $p(z_1, \dots, z_n) = 0$ is at most $d / |S|$.

Proof. By induction on the number of variables $n$. For the case of $n=1$, it’s the usual fact that a single-variable polynomial can have at most $d$ roots. Now for the inductive step, assume this is true for all polynomials with $n-1$ variables, and we will prove it for $n$ variables. Write $p$ as a polynomial in the variable $x_1$, whose coefficients are other polynomials:

$\displaystyle p(x_1, \dots, x_n) = \sum_{k=1}^d Q_k(x_2, \dots, x_n) x_1^k$

Here we’ve grouped $p$ by the powers of $x_1$, so that $Q_i$ are the coefficients of each $x_1^k$. This is useful because we’ll be able to apply the inductive hypothesis to one of the $Q_i$‘s, which have fewer variables.

Indeed, we claim there must be some $Q_k$ which is nonzero for $k > 0$. Clearly, since $p$ is not the zero polynomial, some $Q_k$ must be nonzero. If the only nonzero $Q_k$ is $Q_0$, then we’re done because $p$ doesn’t depend on $x_1$ at all. Otherwise, take the largest nonzero $Q_k$. It’s true that the degree of $Q_k$ is at most $d-k$. This is true because the term $x_1^k Q_k$ has degree at most $d$.

By the inductive hypothesis, if we choose $z_2, \dots, z_n$ and plug them into $Q_k$, we get zero with probability at most $\frac{d-k}{|S|}$. The crucial part is that if this polynomial coefficient is nonzero, then the entire polynomial $p$ is nonzero. This is true even if an unlucky choice of $x_1 = z_1$ causes the resulting evaluation $p(z_1, \dots, z_n) \neq 0$.

To think about it a different way, imagine we’re evaluating the polynomial in phases. In the first phase, we pick the $z_2, \dots, z_n$. We could also pick $z_1$ independently but not reveal what it is, for the sake of this story. Then we plug in the $z_2, \dots, z_n$, and the result is a one-variable polynomial whose largest coefficient is $Q_k(z_1, \dots, z_n)$. The inductive hypothesis tells us that this one-variable polynomial is the zero polynomial with probability at most $\frac{d-k}{|S|}$. (It’s probably a smaller probability, since all the coefficients have to be zero, but we’re just considering the largest one for the sake of generality and simplicity)

Indeed, the resulting polynomial after we plug in $z_2, \dots, z_n$ has degree $k$, so we can apply the inductive hypothesis to it as well, and the probability that it’s zero for a random choice of $z_1$ is at most $k / |S|$.

Finally, the probability that both occur can be computed using basic probability algebra. Let $A$ be the event that, for these $z_i$ inputs, $Q_k$ is zero, and $B$ the event that $p$ is zero for the $z_i$ and the additional $z_1$.

Then $\textup{Pr}[B] = \textup{Pr}[B \textup{ and } A] + \textup{Pr}[B \textup{ and } !A] = \textup{Pr}[B \mid A] \textup{Pr}[A] + \textup{Pr}[B \mid !A] \textup{Pr}[!A]$.

Note the two quantities above that we don’t know are $\textup{Pr}[B \mid A]$ and $\textup{Pr}[!A]$, so we’ll bound them from above by 1. The rest of the quantities add up to exactly what we want, and so

$\displaystyle \textup{Pr}[B] \leq \frac{d-k}{|S|} + \frac{k}{|S|} = \frac{d}{|S|},$

which proves the theorem.

$\square$

While this theorem is almost trivial to prove (it’s elementary induction, and the obvious kind), it can be used to solve polynomial identity testing, as well as finding perfect matchings in graphs and test numbers for primality.

But while the practical questions are largely solved–it’s hard to imagine a setting where you’d need faster primality testing than the existing randomized algorithms–the theory and philosophy of the result is much more interesting.

Indeed, checking two polynomials for equality has no known deterministic polynomial time algorithm. It’s one of a small class of problems, like integer factoring and the discrete logarithm, which are not known to be efficiently solvable in theory, but are also not known to be NP-hard, so there is still hope. The existence of this randomized algorithm increases hope (integer factorization sure doesn’t have one!). And more generally, the fact that there are so few natural problems in this class makes one wonder whether randomness is actually beneficial at all. From a polynomial-time-defined-as-efficient perspective, can every problem efficiently solvable with access to random bits also be solved without such access? In the computational complexity lingo, does P = BPP? Many experts think the answer is yes.