# Formulating the Support Vector Machine Optimization Problem

## The hypothesis and the setup

This blog post has an interactive demo (mostly used toward the end of the post). The source for this demo is available in a Github repository.

Last time we saw how the inner product of two vectors gives rise to a decision rule: if $w$ is the normal to a line (or hyperplane) $L$, the sign of the inner product $\langle x, w \rangle$ tells you whether $x$ is on the same side of $L$ as $w$.

Let’s translate this to the parlance of machine-learning. Let $x \in \mathbb{R}^n$ be a training data point, and $y \in \{ 1, -1 \}$ is its label (green and red, in the images in this post). Suppose you want to find a hyperplane which separates all the points with -1 labels from those with +1 labels (assume for the moment that this is possible). For this and all examples in this post, we’ll use data in two dimensions, but the math will apply to any dimension.

Some data labeled red and green, which is separable by a hyperplane (line).

The hypothesis we’re proposing to separate these points is a hyperplane, i.e. a linear subspace that splits all of $\mathbb{R}^n$ into two halves. The data that represents this hyperplane is a single vector $w$, the normal to the hyperplane, so that the hyperplane is defined by the solutions to the equation $\langle x, w \rangle = 0$.

As we saw last time, $w$ encodes the following rule for deciding if a new point $z$ has a positive or negative label.

$\displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle)$

You’ll notice that this formula only works for the normals $w$ of hyperplanes that pass through the origin, and generally we want to work with data that can be shifted elsewhere. We can resolve this by either adding a fixed term $b \in \mathbb{R}$—often called a bias because statisticians came up with it—so that the shifted hyperplane is the set of solutions to $\langle x, w \rangle + b = 0$. The shifted decision rule is:

$\displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle + b)$

Now the hypothesis is the pair of vector-and-scalar $w, b$.

The key intuitive idea behind the formulation of the SVM problem is that there are many possible separating hyperplanes for a given set of labeled training data. For example, here is a gif showing infinitely many choices.

The question is: how can we find the separating hyperplane that not only separates the training data, but generalizes as well as possible to new data? The assumption of the SVM is that a hyperplane which separates the points, but is also as far away from any training point as possible, will generalize best.

While contrived, it’s easy to see that the separating hyperplane is as far as possible from any training point.

More specifically, fix a labeled dataset of points $(x_i, y_i)$, or more precisely:

$\displaystyle D = \{ (x_i, y_i) \mid i = 1, \dots, m, x_i \in \mathbb{R}^{n}, y_i \in \{1, -1\} \}$

And a hypothesis defined by the normal $w \in \mathbb{R}^{n}$ and a shift $b \in \mathbb{R}$. Let’s also suppose that $(w,b)$ defines a hyperplane that correctly separates all the training data into the two labeled classes, and we just want to measure its quality. That measure of quality is the length of its margin.

Definition: The geometric margin of a hyperplane $w$ with respect to a dataset $D$ is the shortest distance from a training point $x_i$ to the hyperplane defined by $w$.

The best hyperplane has the largest possible margin.

This margin can even be computed quite easily using our work from last post. The distance from $x$ to the hyperplane defined by $w$ is the same as the length of the projection of $x$ onto $w$. And this is just computed by an inner product.

If the tip of the $x$ arrow is the point in question, then $a$ is the dot product, and $b$ the distance from $x$ to the hyperplane $L$ defined by $w$.

## A naive optimization objective

If we wanted to, we could stop now and define an optimization problem that would be very hard to solve. It would look like this:

\displaystyle \begin{aligned} & \max_{w} \min_{x_i} \left | \left \langle x_i, \frac{w}{\|w\|} \right \rangle + b \right | & \\ \textup{subject to \ \ } & \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i) & \textup{ for every } i = 1, \dots, m \end{aligned}

The formulation is hard. The reason is it’s horrifyingly nonlinear. In more detail:

1. The constraints are nonlinear due to the sign comparisons.
2. There’s a min and a max! A priori, we have to do this because we don’t know which point is going to be the closest to the hyperplane.
3. The objective is nonlinear in two ways: the absolute value and the projection requires you to take a norm and divide.

The rest of this post (and indeed, a lot of the work in grokking SVMs) is dedicated to converting this optimization problem to one in which the constraints are all linear inequalities and the objective is a single, quadratic polynomial we want to minimize or maximize.

Along the way, we’ll notice some neat features of the SVM.

## Trick 1: linearizing the constraints

To solve the first problem, we can use a trick. We want to know whether $\textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i)$ for a labeled training point $(x_i, y_i)$. The trick is to multiply them together. If their signs agree, then their product will be positive, otherwise it will be negative.

So each constraint becomes:

$\displaystyle (\langle x_i, w \rangle + b) \cdot y_i \geq 0$

This is still linear because $y_i$ is a constant (input) to the optimization problem. The variables are the coefficients of $w$.

The left hand side of this inequality is often called the functional margin of a training point, since, as we will see, it still works to classify $x_i$, even if $w$ is scaled so that it is no longer a unit vector. Indeed, the sign of the inner product is independent of how $w$ is scaled.

## Trick 1.5: the optimal solution is midway between classes

This small trick is to notice that if $w$ is the supposed optimal separating hyperplane, i.e. its margin is maximized, then it must necessarily be exactly halfway in between the closest points in the positive and negative classes.

In other words, if $x_+$ and $x_-$ are the closest points in the positive and negative classes, respectively, then $\langle x_{+}, w \rangle + b = -(\langle x_{-}, w \rangle + b)$. If this were not the case, then you could adjust the bias, shifting the decision boundary along $w$ until it they are exactly equal, and you will have increased the margin. The closest point, say $x_+$ will have gotten farther away, and the closest point in the opposite class, $x_-$ will have gotten closer, but will not be closer than $x_+$.

## Trick 2: getting rid of the max + min

Resolving this problem essentially uses the fact that the hypothesis, which comes in the form of the normal vector $w$, has a degree of freedom in its length. To explain the details of this trick, we’ll set $b=0$ which simplifies the intuition.

Indeed, in the animation below, I can increase or decrease the length of $w$ without changing the decision boundary.

I have to keep my hand very steady (because I was too lazy to program it so that it only increases/decreases in length), but you can see the point. The line is perpendicular to the normal vector, and it doesn’t depend on the length.

Let’s combine this with tricks 1 and 1.5. If we increase the length of $w$, that means the absolute values of the dot products $\langle x_i, w \rangle$ used in the constraints will all increase by the same amount (without changing their sign). Indeed, for any vector $a$ we have $\langle a, w \rangle = \|w \| \cdot \langle a, w / \| w \| \rangle$.

In this world, the inner product measurement of distance from a point to the hyperplane is no longer faithful. The true distance is $\langle a, w / \| w \| \rangle$, but the distance measured by $\langle a, w \rangle$ is measured in units of $1 / \| w \|$.

In this example, the two numbers next to the green dot represent the true distance of the point from the hyperplane, and the dot product of the point with the normal (respectively). The dashed lines are the solutions to <x, w> = 1. The magnitude of w is 2.2, the inverse of that is 0.46, and indeed 2.2 = 4.8 * 0.46 (we’ve rounded the numbers).

Now suppose we had the optimal hyperplane and its normal $w$. No matter how near (or far) the nearest positively labeled training point $x$ is, we could scale the length of $w$ to force $\langle x, w \rangle = 1$. This is the core of the trick. One consequence is that the actual distance from $x$ to the hyperplane is $\frac{1}{\| w \|} = \langle x, w / \| w \| \rangle$.

The same as above, but with the roles reversed. We’re forcing the inner product of the point with w to be 1. The true distance is unchanged.

In particular, if we force the closest point to have inner product 1, then all other points will have inner product at least 1. This has two consequences. First, our constraints change to $\langle x_i, w \rangle \cdot y_i \geq 1$ instead of $\geq 0$. Second, we no longer need to ask which point is closest to the candidate hyperplane! Because after all, we never cared which point it was, just how far away that closest point was. And now we know that it’s exactly $1 / \| w \|$ away. Indeed, if the optimal points weren’t at that distance, then that means the closest point doesn’t exactly meet the constraint, i.e. that $\langle x, w \rangle > 1$ for every training point $x$. We could then scale $w$ shorter until $\langle x, w \rangle = 1$, hence increasing the margin $1 / \| w \|$.

In other words, the coup de grâce, provided all the constraints are satisfied, the optimization objective is just to maximize $1 / \| w \|$, a.k.a. to minimize $\| w \|$.

This intuition is clear from the following demonstration, which you can try for yourself. In it I have a bunch of positively and negatively labeled points, and the line in the center is the candidate hyperplane with normal $w$ that you can drag around. Each training point has two numbers next to it. The first is the true distance from that point to the candidate hyperplane; the second is the inner product with $w$. The two blue dashed lines are the solutions to $\langle x, w \rangle = \pm 1$. To solve the SVM by hand, you have to ensure the second number is at least 1 for all green points, at most -1 for all red points, and then you have to make $w$ as short as possible. As we’ve discussed, shrinking $w$ moves the blue lines farther away from the separator, but in order to satisfy the constraints the blue lines can’t go further than any training point. Indeed, the optimum will have those blue lines touching a training point on each side.

I bet you enjoyed watching me struggle to solve it. And while it’s probably not the optimal solution, the idea should be clear.

The final note is that, since we are now minimizing $\| w \|$, a formula which includes a square root, we may as well minimize its square $\| w \|^2 = \sum_j w_j^2$. We will also multiply the objective by $1/2$, because when we eventually analyze this problem we will take a derivative, and the square in the exponent and the $1/2$ will cancel.

## The final form of the problem

Our optimization problem is now the following (including the bias again):

\displaystyle \begin{aligned} & \min_{w} \frac{1}{2} \| w \|^2 & \\ \textup{subject to \ \ } & (\langle x_i, w \rangle + b) \cdot y_i \geq 1 & \textup{ for every } i = 1, \dots, m \end{aligned}

This is much simpler to analyze. The constraints are all linear inequalities (which, because of linear programming, we know are tractable to optimize). The objective to minimize, however, is a convex quadratic function of the input variables—a sum of squares of the inputs.

Such problems are generally called quadratic programming problems (or QPs, for short). There are general methods to find solutions! However, they often suffer from numerical stability issues and have less-than-satisfactory runtime. Luckily, the form in which we’ve expressed the support vector machine problem is specific enough that we can analyze it directly, and find a way to solve it without appealing to general-purpose numerical solvers.

We will tackle this problem in a future post (planned for two posts sequel to this one). Before we close, let’s just make a few more observations about the solution to the optimization problem.

## Support Vectors

In Trick 1.5 we saw that the optimal separating hyperplane has to be exactly halfway between the two closest points of opposite classes. Moreover, we noticed that, provided we’ve scaled $\| w \|$ properly, these closest points (there may be multiple for positive and negative labels) have to be exactly “distance” 1 away from the separating hyperplane.

Another way to phrase this without putting “distance” in scare quotes is to say that, if $w$ is the normal vector of the optimal separating hyperplane, the closest points lie on the two lines $\langle x_i, w \rangle + b = \pm 1$.

Now that we have some intuition for the formulation of this problem, it isn’t a stretch to realize the following. While a dataset may include many points from either class on these two lines $\langle x_i, w \rangle = \pm 1$, the optimal hyperplane itself does not depend on any of the other points except these closest points.

This fact is enough to give these closest points a special name: the support vectors.

We’ll actually prove that support vectors “are all you need” with full rigor and detail next time, when we cast the optimization problem in this post into the “dual” setting. To avoid vague names, the formulation described in this post called the “primal” problem. The dual problem is derived from the primal problem, with special variables and constraints chosen based on the primal variables and constraints. Next time we’ll describe in brief detail what the dual does and why it’s important, but we won’t have nearly enough time to give a full understanding of duality in optimization (such a treatment would fill a book).

When we compute the dual of the SVM problem, we will see explicitly that the hyperplane can be written as a linear combination of the support vectors. As such, once you’ve found the optimal hyperplane, you can compress the training set into just the support vectors, and reproducing the same optimal solution becomes much, much faster. You can also use the support vectors to augment the SVM to incorporate streaming data (throw out all non-support vectors after every retraining).

Eventually, when we get to implementing the SVM from scratch, we’ll see all this in action.

Until then!

# Big Dimensions, and What You Can Do About It

Data is abundant, data is big, and big is a problem. Let me start with an example. Let’s say you have a list of movie titles and you want to learn their genre: romance, action, drama, etc. And maybe in this scenario IMDB doesn’t exist so you can’t scrape the answer. Well, the title alone is almost never enough information. One nice way to get more data is to do the following:

1. Pick a large dictionary of words, say the most common 100,000 non stop-words in the English language.
2. Crawl the web looking for documents that include the title of a film.
3. For each film, record the counts of all other words appearing in those documents.
4. Maybe remove instances of “movie” or “film,” etc.

After this process you have a length-100,000 vector of integers associated with each movie title. IMDB’s database has around 1.5 million listed movies, and if we have a 32-bit integer per vector entry, that’s 600 GB of data to get every movie.

One way to try to find genres is to cluster this (unlabeled) dataset of vectors, and then manually inspect the clusters and assign genres. With a really fast computer we could simply run an existing clustering algorithm on this dataset and be done. Of course, clustering 600 GB of data takes a long time, but there’s another problem. The geometric intuition that we use to design clustering algorithms degrades as the length of the vectors in the dataset grows. As a result, our algorithms perform poorly. This phenomenon is called the “curse of dimensionality” (“curse” isn’t a technical term), and we’ll return to the mathematical curiosities shortly.

A possible workaround is to try to come up with faster algorithms or be more patient. But a more interesting mathematical question is the following:

Is it possible to condense high-dimensional data into smaller dimensions and retain the important geometric properties of the data?

This goal is called dimension reduction. Indeed, all of the chatter on the internet is bound to encode redundant information, so for our movie title vectors it seems the answer should be “yes.” But the questions remain, how does one find a low-dimensional condensification? (Condensification isn’t a word, the right word is embedding, but embedding is overloaded so we’ll wait until we define it) And what mathematical guarantees can you prove about the resulting condensed data? After all, it stands to reason that different techniques preserve different aspects of the data. Only math will tell.

In this post we’ll explore this so-called “curse” of dimensionality, explain the formality of why it’s seen as a curse, and implement a wonderfully simple technique called “the random projection method” which preserves pairwise distances between points after the reduction. As usual, and all the code, data, and tests used in the making of this post are on Github.

## Some curious issues, and the “curse”

We start by exploring the curse of dimensionality with experiments on synthetic data.

In two dimensions, take a circle centered at the origin with radius 1 and its bounding square.

The circle fills up most of the area in the square, in fact it takes up exactly $\pi$ out of 4 which is about 78%. In three dimensions we have a sphere and a cube, and the ratio of sphere volume to cube volume is a bit smaller, $4 \pi /3$ out of a total of 8, which is just over 52%. What about in a thousand dimensions? Let’s try by simulation.

import random

def randUnitCube(n):
return [(random.random() - 0.5)*2 for _ in range(n)]

def sphereCubeRatio(n, numSamples):
randomSample = [randUnitCube(n) for _ in range(numSamples)]
return sum(1 for x in randomSample if sum(a**2 for a in x) <= 1) / numSamples 

The result is as we computed for small dimension,

 >>> sphereCubeRatio(2,10000)
0.7857
>>> sphereCubeRatio(3,10000)
0.5196


And much smaller for larger dimension

>>> sphereCubeRatio(20,100000) # 100k samples
0.0
>>> sphereCubeRatio(20,1000000) # 1M samples
0.0
>>> sphereCubeRatio(20,2000000)
5e-07


Forget a thousand dimensions, for even twenty dimensions, a million samples wasn’t enough to register a single random point inside the unit sphere. This illustrates one concern, that when we’re sampling random points in the $d$-dimensional unit cube, we need at least $2^d$ samples to ensure we’re getting a even distribution from the whole space. In high dimensions, this face basically rules out a naive Monte Carlo approximation, where you sample random points to estimate the probability of an event too complicated to sample from directly. A machine learning viewpoint of the same problem is that in dimension $d$, if your machine learning algorithm requires a representative sample of the input space in order to make a useful inference, then you require $2^d$ samples to learn.

Luckily, we can answer our original question because there is a known formula for the volume of a sphere in any dimension. Rather than give the closed form formula, which involves the gamma function and is incredibly hard to parse, we’ll state the recursive form. Call $V_i$ the volume of the unit sphere in dimension $i$. Then $V_0 = 1$ by convention, $V_1 = 2$ (it’s an interval), and $V_n = \frac{2 \pi V_{n-2}}{n}$. If you unpack this recursion you can see that the numerator looks like $(2\pi)^{n/2}$ and the denominator looks like a factorial, except it skips every other number. So an even dimension would look like $2 \cdot 4 \cdot \dots \cdot n$, and this grows larger than a fixed exponential. So in fact the total volume of the sphere vanishes as the dimension grows! (In addition to the ratio vanishing!)

def sphereVolume(n):
values = [0] * (n+1)
for i in range(n+1):
if i == 0:
values[i] = 1
elif i == 1:
values[i] = 2
else:
values[i] = 2*math.pi / i * values[i-2]

return values[-1]


This should be counterintuitive. I think most people would guess, when asked about how the volume of the unit sphere changes as the dimension grows, that it stays the same or gets bigger.  But at a hundred dimensions, the volume is already getting too small to fit in a float.

>>> sphereVolume(20)
0.025806891390014047
>>> sphereVolume(100)
2.3682021018828297e-40
>>> sphereVolume(1000)
0.0


The scary thing is not just that this value drops, but that it drops exponentially quickly. A consequence is that, if you’re trying to cluster data points by looking at points within a fixed distance $r$ of one point, you have to carefully measure how big $r$ needs to be to cover the same proportional volume as it would in low dimension.

Here’s a related issue. Say I take a bunch of points generated uniformly at random in the unit cube.

from itertools import combinations

def distancesRandomPoints(n, numSamples):
randomSample = [randUnitCube(n) for _ in range(numSamples)]
pairwiseDistances = [dist(x,y) for (x,y) in combinations(randomSample, 2)]
return pairwiseDistances


In two dimensions, the histogram of distances between points looks like this

However, as the dimension grows the distribution of distances changes. It evolves like the following animation, in which each frame is an increase in dimension from 2 to 100.

The shape of the distribution doesn’t appear to be changing all that much after the first few frames, but the center of the distribution tends to infinity (in fact, it grows like $\sqrt{n}$). The variance also appears to stay constant. This chart also becomes more variable as the dimension grows, again because we should be sampling exponentially many more points as the dimension grows (but we don’t). In other words, as the dimension grows the average distance grows and the tightness of the distribution stays the same. So at a thousand dimensions the average distance is about 26, tightly concentrated between 24 and 28. When the average is a thousand, the distribution is tight between 998 and 1002. If one were to normalize this data, it would appear that random points are all becoming equidistant from each other.

So in addition to the issues of runtime and sampling, the geometry of high-dimensional space looks different from what we expect. To get a better understanding of “big data,” we have to update our intuition from low-dimensional geometry with analysis and mathematical theorems that are much harder to visualize.

## The Johnson-Lindenstrauss Lemma

Now we turn to proving dimension reduction is possible. There are a few methods one might first think of, such as look for suitable subsets of coordinates, or sums of subsets, but these would all appear to take a long time or they simply don’t work.

Instead, the key technique is to take a random linear subspace of a certain dimension, and project every data point onto that subspace. No searching required. The fact that this works is called the Johnson-Lindenstrauss Lemma. To set up some notation, we’ll call $d(v,w)$ the usual distance between two points.

Lemma [Johnson-Lindenstrauss (1984)]: Given a set $X$ of $n$ points in $\mathbb{R}^d$, project the points in $X$ to a randomly chosen subspace of dimension $c$. Call the projection $\rho$. For any $\varepsilon > 0$, if $c$ is at least $\Omega(\log(n) / \varepsilon^2)$, then with probability at least 1/2 the distances between points in $X$ are preserved up to a factor of $(1+\varepsilon)$. That is, with good probability every pair $v,w \in X$ will satisfy

$\displaystyle \| v-w \|^2 (1-\varepsilon) \leq \| \rho(v) - \rho(w) \|^2 \leq \| v-w \|^2 (1+\varepsilon)$

Before we do the proof, which is quite short, it’s important to point out that the target dimension $c$ does not depend on the original dimension! It only depends on the number of points in the dataset, and logarithmically so. That makes this lemma seem like pure magic, that you can take data in an arbitrarily high dimension and put it in a much smaller dimension.

On the other hand, if you include all of the hidden constants in the bound on the dimension, it’s not that impressive. If your data have a million dimensions and you want to preserve the distances up to 1% ($\varepsilon = 0.01$), the bound is bigger than a million! If you decrease the preservation $\varepsilon$ to 10% (0.1), then you get down to about 12,000 dimensions, which is more reasonable. At 45% the bound drops to around 1,000 dimensions. Here’s a plot showing the theoretical bound on $c$ in terms of $\varepsilon$ for $n$ fixed to a million.

But keep in mind, this is just a theoretical bound for potentially misbehaving data. Later in this post we’ll see if the practical dimension can be reduced more than the theory allows. As we’ll see, an algorithm run on the projected data is still effective even if the projection goes well beyond the theoretical bound. Because the theorem is known to be tight in the worst case (see the notes at the end) this speaks more to the robustness of the typical algorithm than to the robustness of the projection method.

A second important note is that this technique does not necessarily avoid all the problems with the curse of dimensionality. We mentioned above that one potential problem is that “random points” are roughly equidistant in high dimensions. Johnson-Lindenstrauss actually preserves this problem because it preserves distances! As a consequence, you won’t see strictly better algorithm performance if you project (which we suggested is possible in the beginning of this post). But you will alleviate slow runtimes if the runtime depends exponentially on the dimension. Indeed, if you replace the dimension $d$ with the logarithm of the number of points $\log n$, then $2^d$ becomes linear in $n$, and $2^{O(d)}$ becomes polynomial.

## Proof of the J-L lemma

Let’s prove the lemma.

Proof. To start we make note that one can sample from the uniform distribution on dimension-$c$ linear subspaces of $\mathbb{R}^d$ by choosing the entries of a $c \times d$ matrix $A$ independently from a normal distribution with mean 0 and variance 1. Then, to project a vector $x$ by this matrix (call the projection $\rho$), we can compute

$\displaystyle \rho(x) = \frac{1}{\sqrt{c}}A x$

Now fix $\varepsilon > 0$ and fix two points in the dataset $x,y$. We want an upper bound on the probability that the following is false

$\displaystyle \| x-y \|^2 (1-\varepsilon) \leq \| \rho(x) - \rho(y) \|^2 \leq \| x-y \|^2 (1+\varepsilon)$

Since that expression is a pain to work with, let’s rearrange it by calling $u = x-y$, and rearranging (using the linearity of the projection) to get the equivalent statement.

$\left | \| \rho(u) \|^2 - \|u \|^2 \right | \leq \varepsilon \| u \|^2$

And so we want a bound on the probability that this event does not occur, meaning the inequality switches directions.

Once we get such a bound (it will depend on $c$ and $\varepsilon$) we need to ensure that this bound is true for every pair of points. The union bound allows us to do this, but it also requires that the probability of the bad thing happening tends to zero faster than $1/\binom{n}{2}$. That’s where the $\log(n)$ will come into the bound as stated in the theorem.

Continuing with our use of $u$ for notation, define $X$ to be the random variable $\frac{c}{\| u \|^2} \| \rho(u) \|^2$. By expanding the notation and using the linearity of expectation, you can show that the expected value of $X$ is $c$, meaning that in expectation, distances are preserved. We are on the right track, and just need to show that the distribution of $X$, and thus the possible deviations in distances, is tightly concentrated around $c$. In full rigor, we will show

$\displaystyle \Pr [X \geq (1+\varepsilon) c] < e^{-(\varepsilon^2 - \varepsilon^3) \frac{c}{4}}$

Let $A_i$ denote the $i$-th column of $A$. Define by $X_i$ the quantity $\langle A_i, u \rangle / \| u \|$. This is a weighted average of the entries of $A_i$ by the entries of $u$. But since we chose the entries of $A$ from the normal distribution, and since a weighted average of normally distributed random variables is also normally distributed (has the same distribution), $X_i$ is a $N(0,1)$ random variable. Moreover, each column is independent. This allows us to decompose $X$ as

$X = \frac{k}{\| u \|^2} \| \rho(u) \|^2 = \frac{\| Au \|^2}{\| u \|^2}$

Expanding further,

$X = \sum_{i=1}^c \frac{\| A_i u \|^2}{\|u\|^2} = \sum_{i=1}^c X_i^2$

Now the event $X \leq (1+\varepsilon) c$ can be expressed in terms of the nonegative variable $e^{\lambda X}$, where $0 < \lambda < 1/2$ is parameter, to get

$\displaystyle \Pr[X \geq (1+\varepsilon) c] = \Pr[e^{\lambda X} \geq e^{(1+\varepsilon)c \lambda}]$

This will become useful because the sum $X = \sum_i X_i^2$ will split into a product momentarily. First we apply Markov’s inequality, which says that for any nonnegative random variable $Y$, $\Pr[Y \geq t] \leq \mathbb{E}[Y] / t$. This lets us write

$\displaystyle \Pr[e^{\lambda X} \geq e^{(1+\varepsilon) c \lambda}] \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{(1+\varepsilon) c \lambda}}$

Now we can split up the exponent $\lambda X$ into $\sum_{i=1}^c \lambda X_i^2$, and using the i.i.d.-ness of the $X_i^2$ we can rewrite the RHS of the inequality as

$\left ( \frac{\mathbb{E}[e^{\lambda X_1^2}]}{e^{(1+\varepsilon)\lambda}} \right )^c$

A similar statement using $-\lambda$ is true for the $(1-\varepsilon)$ part, namely that

$\displaystyle \Pr[X \leq (1-\varepsilon)c] \leq \left ( \frac{\mathbb{E}[e^{-\lambda X_1^2}]}{e^{-(1-\varepsilon)\lambda}} \right )^c$

The last thing that’s needed is to bound $\mathbb{E}[e^{\lambda X_i^2}]$, but since $X_i^2 \sim N(0,1)$, we can use the known density function for a normal distribution, and integrate to get the exact value $\mathbb{E}[e^{\lambda X_1^2}] = \frac{1}{\sqrt{1-2\lambda}}$. Including this in the bound gives us a closed-form bound in terms of $\lambda, c, \varepsilon$. Using standard calculus the optimal $\lambda \in (0,1/2)$ is $\lambda = \varepsilon / 2(1+\varepsilon)$. This gives

$\displaystyle \Pr[X \geq (1+\varepsilon) c] \leq ((1+\varepsilon)e^{-\varepsilon})^{c/2}$

Using the Taylor series expansion for $e^x$, one can show the bound $1+\varepsilon < e^{\varepsilon - (\varepsilon^2 - \varepsilon^3)/2}$, which simplifies the final upper bound to $e^{-(\varepsilon^2 - \varepsilon^3) c/4}$.

Doing the same thing for the $(1-\varepsilon)$ version gives an equivalent bound, and so the total bound is doubled, i.e. $2e^{-(\varepsilon^2 - \varepsilon^3) c/4}$.

As we said at the beginning, applying the union bound means we need

$\displaystyle 2e^{-(\varepsilon^2 - \varepsilon^3) c/4} < \frac{1}{\binom{n}{2}}$

Solving this for $c$ gives $c \geq \frac{8 \log m}{\varepsilon^2 - \varepsilon^3}$, as desired.

$\square$

## Projecting in Practice

Let’s write a python program to actually perform the Johnson-Lindenstrauss dimension reduction scheme. This is sometimes called the Johnson-Lindenstrauss transform, or JLT.

First we define a random subspace by sampling an appropriately-sized matrix with normally distributed entries, and a function that performs the projection onto a given subspace (for testing).

import random
import math
import numpy

def randomSubspace(subspaceDimension, ambientDimension):
return numpy.random.normal(0, 1, size=(subspaceDimension, ambientDimension))

def project(v, subspace):
subspaceDimension = len(subspace)
return (1 / math.sqrt(subspaceDimension)) * subspace.dot(v)


We have a function that computes the theoretical bound on the optimal dimension to reduce to.

def theoreticalBound(n, epsilon):
return math.ceil(8*math.log(n) / (epsilon**2 - epsilon**3))


And then performing the JLT is simply matrix multiplication

def jlt(data, subspaceDimension):
ambientDimension = len(data[0])
A = randomSubspace(subspaceDimension, ambientDimension)
return (1 / math.sqrt(subspaceDimension)) * A.dot(data.T).T


The high-dimensional dataset we’ll use comes from a data mining competition called KDD Cup 2001. The dataset we used deals with drug design, and the goal is to determine whether an organic compound binds to something called thrombin. Thrombin has something to do with blood clotting, and I won’t pretend I’m an expert. The dataset, however, has over a hundred thousand features for about 2,000 compounds. Here are a few approximate target dimensions we can hope for as epsilon varies.

>>> [((1/x),theoreticalBound(n=2000, epsilon=1/x))
for x in [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]]
[('0.50', 487), ('0.33', 821), ('0.25', 1298), ('0.20', 1901),
('0.17', 2627), ('0.14', 3477), ('0.12', 4448), ('0.11', 5542),
('0.10', 6757), ('0.07', 14659), ('0.05', 25604)]


Going down from a hundred thousand dimensions to a few thousand is by any measure decreases the size of the dataset by about 95%. We can also observe how the distribution of overall distances varies as the size of the subspace we project to varies.

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The last three frames are for 10, 5, and 2 dimensions respectively. As you can see the histogram starts to beef up around zero. To be honest I was expecting something a bit more dramatic like a uniform-ish distribution. Of course, the distribution of distances is not all that matters. Another concern is the worst case change in distances between any two points before and after the projection. We can see that indeed when we project to the dimension specified in the theorem, that the distances are within the prescribed bounds.

def checkTheorem(oldData, newData, epsilon):

for (x,y), (x2,y2) in zip(combinations(oldData, 2), combinations(newData, 2)):
oldNorm = numpy.linalg.norm(x2-y2)**2
newNorm = numpy.linalg.norm(x-y)**2

if newNorm == 0 or oldNorm == 0:
continue

if abs(oldNorm / newNorm - 1) &gt; epsilon:

if __name__ == &quot;__main__&quot;
from data import thrombin

numPoints = len(train)
epsilon = 0.2
subspaceDim = theoreticalBound(numPoints, epsilon)
ambientDim = len(train[0])
newData = jlt(train, subspaceDim)

print(checkTheorem(train, newData, epsilon))


This program prints zero every time I try running it, which is the poor man’s way of saying it works “with high probability.” We can also plot statistics about the number of pairs of data points that are distorted by more than $\varepsilon$ as the subspace dimension shrinks. We ran this on the following set of subspace dimensions with $\varepsilon = 0.1$ and took average/standard deviation over twenty trials:

   dims = [1000, 750, 500, 250, 100, 75, 50, 25, 10, 5, 2]


The result is the following chart, whose x-axis is the dimension projected to (so the left hand is the most extreme projection to 2, 5, 10 dimensions), the y-axis is the number of distorted pairs, and the error bars represent a single standard deviation away from the mean.

This chart provides good news about this dataset because the standard deviations are low. It tells us something that mathematicians often ignore: the predictability of the tradeoff that occurs once you go past the theoretically perfect bound. In this case, the standard deviations tell us that it’s highly predictable. Moreover, since this tradeoff curve measures pairs of points, we might conjecture that the distortion is localized around a single set of points that got significantly “rattled” by the projection. This would be an interesting exercise to explore.

Now all of these charts are really playing with the JLT and confirming the correctness of our code (and hopefully our intuition). The real question is: how well does a machine learning algorithm perform on the original data when compared to the projected data? If the algorithm only “depends” on the pairwise distances between the points, then we should expect nearly identical accuracy in the unprojected and projected versions of the data. To show this we’ll use an easy learning algorithm, the k-nearest-neighbors clustering method. The problem, however, is that there are very few positive examples in this particular dataset. So looking for the majority label of the nearest $k$ neighbors for any $k > 2$ unilaterally results in the “all negative” classifier, which has 97% accuracy. This happens before and after projecting.

To compensate for this, we modify k-nearest-neighbors slightly by having the label of a predicted point be 1 if any label among its nearest neighbors is 1. So it’s not a majority vote, but rather a logical OR of the labels of nearby neighbors. Our point in this post is not to solve the problem well, but rather to show how an algorithm (even a not-so-good one) can degrade as one projects the data into smaller and smaller dimensions. Here is the code.

def nearestNeighborsAccuracy(data, labels, k=10):
from sklearn.neighbors import NearestNeighbors
trainData, trainLabels, testData, testLabels = randomSplit(data, labels) # cross validation
model = NearestNeighbors(n_neighbors=k).fit(trainData)
distances, indices = model.kneighbors(testData)
predictedLabels = []

for x in indices:
xLabels = [trainLabels[i] for i in x[1:]]
predictedLabel = max(xLabels)
predictedLabels.append(predictedLabel)

totalAccuracy = sum(x == y for (x,y) in zip(testLabels, predictedLabels)) / len(testLabels)
falsePositive = (sum(x == 0 and y == 1 for (x,y) in zip(testLabels, predictedLabels)) /
sum(x == 0 for x in testLabels))
falseNegative = (sum(x == 1 and y == 0 for (x,y) in zip(testLabels, predictedLabels)) /
sum(x == 1 for x in testLabels))



And here is the accuracy of this modified k-nearest-neighbors algorithm run on the thrombin dataset. The horizontal line represents the accuracy of the produced classifier on the unmodified data set. The x-axis represents the dimension projected to (left-hand side is the lowest), and the y-axis represents the accuracy. The mean accuracy over fifty trials was plotted, with error bars representing one standard deviation. The complete code to reproduce the plot is in the Github repository.

Likewise, we plot the proportion of false positive and false negatives for the output classifier. Note that a “positive” label made up only about 2% of the total data set. First the false positives

Then the false negatives

As we can see from these three charts, things don’t really change that much (for this dataset) even when we project down to around 200-300 dimensions. Note that for these parameters the “correct” theoretical choice for dimension was on the order of 5,000 dimensions, so this is a 95% savings from the naive approach, and 99.75% space savings from the original data. Not too shabby.

## Notes

The $\Omega(\log(n))$ worst-case dimension bound is asymptotically tight, though there is some small gap in the literature that depends on $\varepsilon$. This result is due to Noga Alon, the very last result (Section 9) of this paper. [Update: as djhsu points out in the comments, this gap is now closed thanks to Larsen and Nelson]

We did dimension reduction with respect to preserving the Euclidean distance between points. One might naturally wonder if you can achieve the same dimension reduction with a different metric, say the taxicab metric or a $p$-norm. In fact, you cannot achieve anything close to logarithmic dimension reduction for the taxicab ($l_1$) metric. This result is due to Brinkman-Charikar in 2004.

The code we used to compute the JLT is not particularly efficient. There are much more efficient methods. One of them, borrowing its namesake from the Fast Fourier Transform, is called the Fast Johnson-Lindenstrauss Transform. The technique is due to Ailon-Chazelle from 2009, and it involves something called “preconditioning a sparse projection matrix with a randomized Fourier transform.” I don’t know precisely what that means, but it would be neat to dive into that in a future post.

The central focus in this post was whether the JLT preserves distances between points, but one might be curious as to whether the points themselves are well approximated. The answer is an enthusiastic no. If the data were images, the projected points would look nothing like the original images. However, it appears the degradation tradeoff is measurable (by some accounts perhaps linear), and there appears to be some work (also this by the same author) when restricting to sparse vectors (like word-association vectors).

Note that the JLT is not the only method for dimensionality reduction. We previously saw principal component analysis (applied to face recognition), and in the future we will cover a related technique called the Singular Value Decomposition. It is worth noting that another common technique specific to nearest-neighbor is called “locality-sensitive hashing.” Here the goal is to project the points in such a way that “similar” points land very close to each other. Say, if you were to discretize the plane into bins, these bins would form the hash values and you’d want to maximize the probability that two points with the same label land in the same bin. Then you can do things like nearest-neighbors by comparing bins.

Another interesting note, if your data is linearly separable (like the examples we saw in our age-old post on Perceptrons), then you can use the JLT to make finding a linear separator easier. First project the data onto the dimension given in the theorem. With high probability the points will still be linearly separable. And then you can use a perceptron-type algorithm in the smaller dimension. If you want to find out which side a new point is on, you project and compare with the separator in the smaller dimension.

Beyond its interest for practical dimensionality reduction, the JLT has had many other interesting theoretical consequences. More generally, the idea of “randomly projecting” your data onto some small dimensional space has allowed mathematicians to get some of the best-known results on many optimization and learning problems, perhaps the most famous of which is called MAX-CUT; the result is by Goemans-Williamson and it led to a mathematical constant being named after them, $\alpha_{GW} =.878567 \dots$. If you’re interested in more about the theory, Santosh Vempala wrote a wonderful (and short!) treatise dedicated to this topic.

# The Inequality

Math and computer science are full of inequalities, but there is one that shows up more often in my work than any other. Of course, I’m talking about

$\displaystyle 1+x \leq e^{x}$

This is The Inequality. I’ve been told on many occasions that the entire field of machine learning reduces to The Inequality combined with the Chernoff bound (which is proved using The Inequality).

Why does it show up so often in machine learning? Mostly because in analyzing an algorithm you want to bound the probability that some bad event happens. Bad events are usually phrased similarly to

$\displaystyle \prod_{i=1}^m (1-p_i)$

And applying The Inequality we can bound this from above by

$\displaystyle\prod_{i=1}^m (1-p_i) \leq \prod_{i=1}^m e^{-p_i} = e^{-\sum_{i=1}^m p_i}$

The point is that usually $m$ is the size of your dataset, which you get to choose, and by picking larger $m$ you make the probability of the bad event vanish exponentially quickly in $m$. (Here $p_i$ is unrelated to how I am about to use $p_i$ as weights).

Of course, The Inequality has much deeper implications than bounds for the efficiency and correctness of machine learning algorithms. To convince you of the depth of this simple statement, let’s see its use in an elegant proof of the arithmetic geometric inequality.

Theorem: (The arithmetic-mean geometric-mean inequality, general version): For all non-negative real numbers $a_1, \dots, a_n$ and all positive $p_1, \dots, p_n$ such that $p_1 + \dots + p_n = 1$, the following inequality holds:

$\displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq p_1 a_1 + \dots + p_n a_n$

Note that when all the $p_i = 1/n$ this is the standard AM-GM inequality.

Proof. This proof is due to George Polya (in Hungarian, Pólya György).

We start by modifying The Inequality $1+x \leq e^x$ by a shift of variables $x \mapsto x-1$, so that the inequality now reads $x \leq e^{x-1}$. We can apply this to each $a_i$ giving $a_i \leq e^{a_i - 1}$, and in fact,

$\displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq e^{\sum_{i=1}^n p_ia_i - p_i} = e^{\left ( \sum_{i=1}^n p_ia_i \right ) - 1}$

Now we have something quite curious: if we call $A$ the sum $p_1a_1 + \dots + p_na_n$, the above shows that $a_1^{p_1} \cdots a_n^{p_n} \leq e^{A-1}$. Moreover, again because $A \leq e^{A-1}$ that shows that the right hand side of the inequality we’re trying to prove is also bounded by $e^{A-1}$. So we know that both sides of our desired inequality (and in particular, the max) is bounded from above by $e^{A-1}$. This seems like a conundrum until we introduce the following beautiful idea: normalize by the thing you think should be the larger of the two sides of the inequality.

Define new variables $b_i = a_i / A$ and notice that $\sum_i p_i b_i = 1$ just by unraveling the definition. Call this sum $B = \sum_i p_i b_i$. Now we know that

$b_1^{p_1} \cdots b_n^{p_n} = \left ( \frac{a_1}{A} \right )^{p_1} \cdots \left ( \frac{a_n}{A} \right )^{p_n} \leq e^{B - 1} = e^0 = 1$

Now we unpack the pieces, and multiply through by $A^{p_1}A^{p_2} \cdots A^{p_n} = A$, the result is exactly the AM-GM inequality.

$\square$

Even deeper, there is only one case when The Inequality is tight, i.e. when $1+x = e^x$, and that is $x=0$. This allows us to use the proof above to come to a full characterization of the case of equality in the proof above. Indeed, the crucial step was that $(a_i / A) = e^{A-1}$, which is only true when $(a_i / A) = 1$, i.e. when $a_i = A$. Spending a few seconds thinking about this gives the characterization of equality if and only if $a_1 = a_2 = \dots = a_n = A$.

So this is excellent: the arithmetic-geometric inequality is a deep theorem with applications all over mathematics and statistics. Adding another layer of indirection for impressiveness, one can use the AM-GM inequality to prove the Cauchy-Schwarz inequality rather directly. Sadly, the Wikipedia page for the Cauchy-Schwarz inequality hardly does it justice as far as the massive number of applications. For example, many novel techniques in geometry and number theory are proved directly from C-S. More, in fact, than I can hope to learn.

Of course, no article about The Inequality could be complete without a proof of The Inequality.

Theorem: For all $x \in \mathbb{R}$, $1+x \leq e^x$.

Proof. The proof starts by proving a simpler theorem, named after Bernoulli, that $1+nx \leq (1+x)^n$ for every $x [-1, \infty)$ and every $n \in \mathbb{N}$. This is relatively straightforward by induction. The base case is trivial, and

$\displaystyle (1+x)^{n+1} = (1+x)(1+x)^n \geq (1+x)(1+nx) = 1 + (n+1)x + nx^2$

And because $nx^2 \geq 0$, we get Bernoulli’s inequality.

Now for any $z \geq 0$ we can set $x = z/n$, and get $(1+z) = (1+nx) \leq (1+\frac{z}{n})^n$ for every $n$. Note that Bernoulli’s inequality is preserved for larger and larger $n$ because $x \geq 0$. So taking limits of both sides as $n \to \infty$ we get the definition of $e^z$ on the right hand side of the inequality. We can prove a symmetrical inequality for $-x$ when $x < 0$, and this proves the theorem.

$\square$

What other insights can we glean about The Inequality? For one, it’s a truncated version of the Taylor series approximation

$\displaystyle e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots$

Indeed, the Taylor remainder theorem tells us that the first two terms approximate $e^x$ around zero with error depending on some constant times $e^x x^2 \geq 0$. In other words, $1+x$ is a lower bound on $e^x$ around zero. It is perhaps miraculous that this extends to a lower bound everywhere, until you realize that exponentials grow extremely quickly and lines do not.

One might wonder whether we can improve our approximation with higher order approximations. Indeed we can, but we have to be a bit careful. In particular, $1+x+x^2/2 \leq e^x$ is only true for nonnegative $x$ (because the remainder theorem now applies to $x^3$, but if we restrict to odd terms we win: $1+x+x^2/2 + x^3/6 \leq e^x$ is true for all $x$.

What is really surprising about The Inequality is that, at least in the applications I work with, we rarely see higher order approximations used. For most applications, The difference between an error term which is quadratic and one which is cubic or quartic is often not worth the extra work in analyzing the result. You get the same theorem: that something vanishes exponentially quickly.

If you’re interested in learning more about the theory of inequalities, I wholeheartedly recommend The Cauchy-Schwarz Master Class. This book is wonderfully written, and chock full of fun exercises. I know because I do exercises from books like this one on planes and trains. It’s my kind of sudoku 🙂

# Zero-One Laws for Random Graphs

Last time we saw a number of properties of graphs, such as connectivity, where the probability that an Erdős–Rényi random graph $G(n,p)$ satisfies the property is asymptotically either zero or one. And this zero or one depends on whether the parameter $p$ is above or below a universal threshold (that depends only on $n$ and the property in question).

To remind the reader, the Erdős–Rényi random “graph” $G(n,p)$ is a distribution over graphs that you draw from by including each edge independently with probability $p$. Last time we saw that the existence of an isolated vertex has a sharp threshold at $(\log n) / n$, meaning if $p$ is asymptotically smaller than the threshold there will certainly be isolated vertices, and if $p$ is larger there will certainly be no isolated vertices. We also gave a laundry list of other properties with such thresholds.

One might want to study this phenomenon in general. Even if we might not be able to find all the thresholds we want for a given property, can we classify which properties have thresholds and which do not?

The answer turns out to be mostly yes! For large classes of properties, there are proofs that say things like, “either this property holds with probability tending to one, or it holds with probability tending to zero.” These are called “zero-one laws,” and they’re sort of meta theorems. We’ll see one such theorem in this post relating to constant edge-probabilities in random graphs, and we’ll remark on another at the end.

## Sentences about graphs in first order logic

A zero-one law generally works by defining a class of properties, and then applying a generic first/second moment-type argument to every property in the class.

So first we define what kinds of properties we’ll discuss. We’ll pick a large class: anything that can be expressed in first-order logic in the language of graphs. That is, any finite logical statement that uses existential and universal quantifiers over variables, and whose only relation (test) is whether an edge exists between two vertices. We’ll call this test $e(x,y)$. So you write some sentence $P$ in this language, and you take a graph $G$, and you can ask $P(G) = 1$, whether the graph satisfies the sentence.

This seems like a really large class of properties, and it is, but let’s think carefully about what kinds of properties can be expressed this way. Clearly the existence of a triangle can be written this way, it’s just the sentence

$\exists x,y,z : e(x,y) \wedge e(y,z) \wedge e(x,z)$

I’m using $\wedge$ for AND, and $\vee$ for OR, and $\neg$ for NOT. Similarly, one can express the existence of a clique of size $k$, or the existence of an independent set of size $k$, or a path of a fixed length, or whether there is a vertex of maximal degree $n-1$.

Here’s a question: can we write a formula which will be true for a graph if and only if it’s connected? Well such a formula seems like it would have to know about how many vertices there are in the graph, so it could say something like “for all $x,y$ there is a path from $x$ to $y$.” It seems like you’d need a family of such formulas that grows with $n$ to make anything work. But this isn’t a proof; the question remains whether there is some other tricky way to encode connectivity.

But as it turns out, connectivity is not a formula you can express in propositional logic. We won’t prove it here, but we will note at the end of the article that connectivity is in a different class of properties that you can prove has a similar zero-one law.

## The zero-one law for first order logic

So the theorem about first-order expressible sentences is as follows.

Theorem: Let $P$ be a property of graphs that can be expressed in the first order language of graphs (with the $e(x,y)$ relation). Then for any constant $p$, the probability that $P$ holds in $G(n,p)$ has a limit of zero or one as $n \to \infty$.

Proof. We’ll prove the simpler case of $p=1/2$, but the general case is analogous. Given such a graph $G$ drawn from $G(n,p)$, what we’ll do is define a countably infinite family of propositional formulas $\varphi_{k,l}$, and argue that they form a sort of “basis” for all first-order sentences about graphs.

First let’s describe the $\varphi_{k,l}$. For any $k,l \in \mathbb{N}$, the sentence will assert that for every set of $k$ vertices and every set of $l$ vertices, there is some other vertex connected to the first $k$ but not the last $l$.

$\displaystyle \varphi_{k,l} : \forall x_1, \dots, x_k, y_1, \dots, y_l \exists z : \\ e(z,x_1) \wedge \dots \wedge e(z,x_k) \wedge \neg e(z,y_1) \wedge \dots \wedge \neg e(z,y_l)$.

In other words, these formulas encapsulate every possible incidence pattern for a single vertex. It is a strange set of formulas, but they have a very nice property we’re about to get to. So for a fixed $\varphi_{k,l}$, what is the probability that it’s false on $n$ vertices? We want to give an upper bound and hence show that the formula is true with probability approaching 1. That is, we want to show that all the $\varphi_{k,l}$ are true with probability tending to 1.

Computing the probability: we have $\binom{n}{k} \binom{n-k}{l}$ possibilities to choose these sets, and the probability that some other fixed vertex $z$ has the good connections is $2^{-(k+l)}$ so the probability $z$ is not good is $1 - 2^{-(k+l)}$, and taking a product over all choices of $z$ gives the probability that there is some bad vertex $z$ with an exponent of $(n - (k + l))$. Combining all this together gives an upper bound of $\varphi_{k,l}$ being false of:

$\displaystyle \binom{n}{k}\binom{n-k}{l} (1-2^{-k-1})^{n-k-l}$

And $k, l$ are constant, so the left two terms are polynomials while the rightmost term is an exponentially small function, and this implies that the whole expression tends to zero, as desired.

Break from proof.

## A bit of model theory

So what we’ve proved so far is that the probability of every formula of the form $\varphi_{k,l}$ being satisfied in $G(n,1/2)$ tends to 1.

Now look at the set of all such formulas

$\displaystyle \Phi = \{ \varphi_{k,l} : k,l \in \mathbb{N} \}$

We ask: is there any graph which satisfies all of these formulas? Certainly it cannot be finite, because a finite graph would not be able to satisfy formulas with sufficiently large values of $l, k > n$. But indeed, there is a countably infinite graph that works. It’s called the Rado graph, pictured below.

The Rado graph has some really interesting properties, such as that it contains every finite and countably infinite graph as induced subgraphs. Basically this means, as far as countably infinite graphs go, it’s the big momma of all graphs. It’s the graph in a very concrete sense of the word. It satisfies all of the formulas in $\Phi$, and in fact it’s uniquely determined by this, meaning that if any other countably infinite graph satisfies all the formulas in $\Phi$, then that graph is isomorphic to the Rado graph.

But for our purposes (proving a zero-one law), there’s a better perspective than graph theory on this object. In the logic perspective, the set $\Phi$ is called a theory, meaning a set of statements that you consider “axioms” in some logical system. And we’re asking whether there any model realizing the theory. That is, is there some logical system with a semantic interpretation (some mathematical object based on numbers, or sets, or whatever) that satisfies all the axioms?

A good analogy comes from the rational numbers, because they satisfy a similar property among all ordered sets. In fact, the rational numbers are the unique countable, ordered set with the property that it has no biggest/smallest element and is dense. That is, in the ordering there is always another element between any two elements you want. So the theorem says if you have two countable sets with these properties, then they are actually isomorphic as ordered sets, and they are isomorphic to the rational numbers.

So, while we won’t prove that the Rado graph is a model for our theory $\Phi$, we will use that fact to great benefit. One consequence of having a theory with a model is that the theory is consistent, meaning it can’t imply any contradictions. Another fact is that this theory $\Phi$ is complete. Completeness means that any formula or it’s negation is logically implied by the theory. Note these are syntactical implications (using standard rules of propositional logic), and have nothing to do with the model interpreting the theory.

The proof that $\Phi$ is complete actually follows from the uniqueness of the Rado graph as the only countable model of $\Phi$. Suppose the contrary, that $\Phi$ is not consistent, then there has to be some formula $\psi$ that is not provable, and it’s negation is also not provable, by starting from $\Phi$. Now extend $\Phi$ in two ways: by adding $\psi$ and by adding $\neg \psi$. Both of the new theories are still countable, and by a theorem from logic this means they both still have countable models. But both of these new models are also countable models of $\Phi$, so they have to both be the Rado graph. But this is very embarrassing for them, because we assumed they disagree on the truth of $\psi$.

So now we can go ahead and prove the zero-one law theorem.

Given an arbitrary property $\varphi \not \in \Psi$. Now either $\varphi$ or it’s negation can be derived from $\Phi$. Without loss of generality suppose it’s $\varphi$. Take all the formulas from the theory you need to derive $\varphi$, and note that since it is a proof in propositional logic you will only finitely many such $\varphi_{k,l}$. Now look at the probabilities of the $\varphi_{k,l}$: they are all true with probability tending to 1, so the implied statement of the proof of $\varphi$ (i.e., $\varphi$ itself) must also hold with probability tending to 1. And we’re done!

$\square$

If you don’t like model theory, there is another “purely combinatorial” proof of the zero-one law using something called Ehrenfeucht–Fraïssé games. It is a bit longer, though.

## Other zero-one laws

One might naturally ask two questions: what if your probability is not constant, and what other kinds of properties have zero-one laws? Both great questions.

For the first, there are some extra theorems. I’ll just describe one that has always seemed very strange to me. If your probability is of the form $p = n^{-\alpha}$ but $\alpha$ is irrational, then the zero-one law still holds! This is a theorem of Baldwin-Shelah-Spencer, and it really makes you wonder why irrational numbers would be so well behaved while rational numbers are not 🙂

For the second question, there is another theorem about monotone properties of graphs. Monotone properties come in two flavors, so called “increasing” and “decreasing.” I’ll describe increasing monotone properties and the decreasing counterpart should be obvious. A property is called monotone increasing if adding edges can never destroy the property. That is, with an empty graph you don’t have the property (or maybe you do), and as you start adding edges eventually you suddenly get the property, but then adding more edges can’t cause you to lose the property again. Good examples of this include connectivity, or the existence of a triangle.

So the theorem is that there is an identical zero-one law for monotone properties. Great!

It’s not so often that you get to see these neat applications of logic and model theory to graph theory and (by extension) computer science. But when you do get to apply them they seem very powerful and mysterious. I think it’s a good thing.

Until next time!