Key Switching in LWE

Last time we covered an operation in the LWE encryption scheme called modulus switching, which allows one to switch from one modulus to another, at the cost of introducing a small amount of extra noise, roughly $\sqrt{n}$, where $n$ is the dimension of the LWE ciphertext.

This time we’ll cover a more sophisticated operation called key switching, which allows one to switch an LWE ciphertext from being encrypted under one secret key to another, without ever knowing either secret key.

Reminder of LWE

A literal repetition of the last article. The LWE encryption scheme I’ll use has the following parameters:

  • A plaintext space $\mathbb{Z}/q\mathbb{Z}$, where $q \geq 2$ is a positive integer. This is the space that the underlying message comes from.
  • An LWE dimension $n \in \mathbb{N}$.
  • A discrete Gaussian error distribution $D$ with a mean of zero and a fixed standard deviation.

An LWE secret key is defined as a vector in $\{0, 1\}^n$ (uniformly sampled). An LWE ciphertext is defined as a vector $a = (a_1, \dots, a_n)$, sampled uniformly over $(\mathbb{Z} / q\mathbb{Z})^n$, and a scalar $b = \langle a, s \rangle + m + e$, where $e$ is drawn from $D$ and all arithmetic is done modulo $q$. Note that $e$ must be small for the encryption to be valid.

Sometimes I will denote by $\textup{LWE}_s(x)$ the LWE encryption of plaintext $x$ under the secret key $s$, and it should be understood that this is a fixed (but arbitrary) draw from the distribution of LWE ciphertexts described above.

Main idea: homomorphically almost-decrypt

The main idea is to encrypt each entry of the original secret key using the new secret key (this collection of encryptions is jointly called a key-switching key), and then use this to homomorphically evaluate the first step of the decryption function (i.e., compute $b – \langle a, s \rangle$). The result is an encryption of the (noisy) message under the new key.

First we’ll show how this works in a naïve sense. In particular, doing what I said in the last paragraph verbatim won’t work because the error will grow too large. But we’ll do it anyway, measure the error, and the remainder of the article will show how the gadget decomposition can be used to reduce the error.

Key switching, without gadget decompositions

Start with an LWE ciphertext for the plaintext $m$. Call it

$\displaystyle c = (a_1, \dots, a_n, b) \in (\mathbb{Z}/q\mathbb{Z})^{n+1}$

where

$\displaystyle b = \left ( \sum_{i=1}^n a_i s_i \right ) + m + e_{\textup{original}}$

and $s = (s_1, \dots, s_n) \in \{ 0,1\}^n$ is the secret key. Now say we have another secret key, possibly of a different dimension $t = (t_1, \dots, t_m) \in \{ 0, 1\}^m$, and we would like to switch the ciphertext $c$ to a ciphertext $c’$ which encrypts the same underlying message $m$, but under the new secret key $t$. That is, we would like to write

$\displaystyle c’ = (a’_1, \dots, a’_m, b’) \in (\mathbb{Z}/q\mathbb{Z})^{m+1}$

where

$\displaystyle b’ = \left ( \sum_{i=1}^n a’_i t_i \right ) + m + e_{\textup{original}} + e_{\textup{new}}$

implying that there is possibly some additional error introduced as a result. As usual, so long as the total error in the ciphertext remains small enough (and $m$ is stored in the significant bits of the underlying integer space), the result will still be a valid LWE ciphertext.

Define the key switching key $\textup{KSK}(s, t)$ as follows (I will omit the $s, t$ and just call it KSK from now on):

$\displaystyle \textup{KSK} = \{ \textup{KSK}_i = \textup{LWE}_t(s_i) = (x_{i, 1}, \dots, x_{i, m}, y_i) \mid i=1, \dots, n\}$

In other words, $\textup{KSK}_i$ encrypts bit $s_i$, and $y_i = \langle x_i, t \rangle + s_i + e_i$ makes it a valid LWE encryption.

Now the algorithm to switch keys is merely as follows (where the first vector has $m$ leading zeros to ensure the dimensions align):

$\displaystyle c’ = (0, \dots, 0, b) – \sum_{i=1}^n a_i \textup{KSK}_i$

This is computing a linear combination of the $\textup{KSK}_i$. The specific linear combination is the first step of LWE decryption ($b – \langle a, s \rangle$), but performed on ciphertexts of $b$ and the $s_i$. Note, $(0, \dots, 0, b)$ is a valid (but insecure) LWE ciphertext of $b$ under any secret key, in part because we’re pretending the LWE samples and error were all sampled as zero; an unlikely but coherent outcome used to jumpstart a homomorphic computation in more places than key switching. So if you wanted to, you could write $c’$ as follows, to highlight how we’re computing additions and linear scalings of LWE ciphertexts.

$\displaystyle c’ = \textup{LWE}_{\textup{t}}(b) – \sum_{i=1}^n a_i \textup{LWE}_t(s_i)$

This should be enough to show that $c’$ is a valid LWE encryption (if we accept that adding and scaling preserves LWE validity). But to warm up for the rest of the article we’ll reprove it with a slightly different technique. This will also help us understand the error growth. Because LWE naturally admits sums and scalar products with corresponding added error, we expect the error to grow proportionally to the number of additions and the magnitudes of the $a_i$’s. And you may already be able to tell that because the $a_i$’s are uniform $\mathbb{Z}/q\mathbb{Z}$ elements, this part will be far too large to be useful. Let’s make this explicit now.

To show it’s a valid LWE encryption, we define the function $\varphi_s$, defined on any LWE ciphertext $c = (a_1, \dots, a_n, b)$ as $\varphi_s(c) = b – \langle a, s \rangle$. Some authors call $\varphi_s$ the “phase” function, but I think of it as a close friend: the first step of the decryption function for LWE (the second step would be rounding off the error). Critically, an LWE encryption is valid if and only if $\varphi_s(c) = m + e$ (provided $e$ is sufficiently small).

Because $\varphi_s$ is a linear function, it factors through the definition of $c’$ nicely, and we get

$\displaystyle \begin{aligned} \varphi_t(c’) &= \varphi_t((0, \dots, 0, b)) – \sum_{i=1}^n a_i \varphi_t(\textup{KSK}_i) \\ &= b – \sum_{i=1}^n a_i (y_i – \langle x_i, t \rangle) \\ &= b – \sum_{i=1}^n a_i (s_i + e_i) \end{aligned}$

where (reminder) $e_i$ is the error sample from $\textup{KSK}_i$’s definition. Distributing $a_i$ across the $(s_i + e_i)$ simplifies everything nicely

$\displaystyle \begin{aligned} &= b – \sum_{i=1}^n a_i s_i – \sum_{i=1}^n a_i e_i \\ &= m + e_{\textup{original}} – \sum_{i=1}^n a_i e_i \end{aligned}$

Now as we foreshadowed, $e_{\textup{new}} = -\sum_{i=1}^n a_i e_i$ is simply too large. A typical LWE ciphertext will have error at least 1 (or it would be useless), and if $q = 2^{32}$, the $a_i$’s would also be of magnitude roughly $2^{31}$, so summing even two of those would corrupt even a 1-bit message stored in the most significant bit of the plaintext.

The way to deal with this is to use a bit decomposition.

Key switching, with gadget decompositions

Recall from the gadget decomposition article that the core function of a gadget decomposition is to preserve the ultimate value of a dot product while making the vectors multiplicands larger (spending space/time) but also making the size of the coefficients of one of the vectors smaller (reducing the accumulation of error due to that dot product).

This is exactly the approach we’ll take here. The “dot product” in question is $(a_1, \dots, a_n) \cdot \textup{KSK}$ (where KSK is viewed as a matrix), and we’ll expand the values $a_i$ into a vector of its digits in a base-$B$ number system, while modifying the key switching key so that those missing powers of $B$ are part of the LWE encryption. This will result in replacing the error term that looked like $\sum_{i=1}^n a_i e_i$ with an error term like $\sum_{i=1}^n c B e_i$ for some small constant $c$ (expect it to be even less than $B$).

More specifically, define decomposition parameters as a triple of numbers $(B, k, L)$. The number $B$ is a power of 2 no bigger than $q/2$, and $L$, or the number of levels of the decomposition, is the positive integer such that $B^L = q$ (this is forced by the choice of $B$). Then finally, $k$ is a number between $0$ and $L-1$ describing the “lowest level” (or least-significant digit) included in the decomposition.

An error-free decomposition sets the parameter $k=0$, and this is defined simply as a base-$B$ representation of a number. For example, suppose $q = 2^{32}$, and $(B, k, L) = (256, 0, 4)$, and we’re decomposing $x=2^{32} – 2$. Then $\textup{Decomp}_{256, 0, 4}(x) = (254, 255, 255, 255)$. I subtracted 2 to emphasize that the digits are little-Endian (the right-most entry is the most significant, representing the $256^3$ place).

An approximate decomposition is one with $k > 0$. For example, suppose $(B, k, L) = (256, 2, 4)$ and again $x=2^{32} – 2$. Setting $k=2$ means that we represent this number as if it were $(0, 0, 255, 255)$, wiping out the two least significant digits. The error of this approximation is $65534 = 254 + 255 \cdot 256^1$. As we will see, an approximate decomposition may help reduce overall error by splitting the newly introduced error into a sum of two terms, where $k$ scales the error differently in each term.

Let’s go through the key-switching key derivation again, using an error-free decomposition $(B, 0, L)$. First, re-define the key switching key as follows.

$\displaystyle \textup{KSK} = \{ \textup{KSK}_{i, j} = \textup{LWE}_t(s_i B^j) \mid i=1, \dots, n ; j = 0, \dots, L-1\}$

Note that this increases the dimension of the key-switching key by 1. Previously the key-switching key was a list of LWE ciphertexts (2-dimensional array of numbers), and now it’s a 3-dimensional array, with the new dimension corresponding to the decomposition digit $j$.

Because the powers of $B$ are attached to the message, they will factor out and allow us to reconstruct the original $a_i$’s, but they will not be included in the error part because error is added to the message during encryption.

Next, to perform the key switch, define $\textup{Decomp}(a_i) = (a_{i,0}, \dots, a_{i,L-1})$ and compute

$\displaystyle c’ = (0, \dots, 0, b) – \sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} \textup{KSK}_{i,j}$

This is the same as the original key switch, but the extra summation accounts for the extra dimension introduced by the gadget decomposition. Then we can repeat the same $\varphi_t$ trick and see how the original $a_i$’s are reconstructed.

$\displaystyle \begin{aligned} \varphi_t(c’) &= b – \sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} \varphi_t(\textup{KSK}_{i,j}) \\ &= b -\sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} (s_i B^j + e_i) \\ &= b -\sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} s_i B^j – \sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} e_i \\ &= b -\sum_{i=1}^n a_i s_i – \sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} e_i \\ &= m + e_{\textup{original}} – \sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} e_i \end{aligned}$

One key ingredient above is noticing that in $\sum_{i=1}^n \sum_{j=0}^{L-1} a_{i,j} s_i B^j$, the $s_i$ factors out of the innermost sum, and what you have left is $\sum_{j=0}^{L-1} a_{i,j} B^j$, which is exactly how to reconstruct $a_i$ from its base-$B$ digits.

The second key ingredient is that the innermost term on the second line is $a_{i,j} (s_i B^j + e_i)$, which means that only the digits $a_{i,j}$ are multiplied by the error terms, not including the powers of $B$, and so the final error can be bounded by the largest allowable value of a single digit $B-1$, resulting in the new error being $L (B-1) \sum_{i=1}^n e_i$. For a Gaussian centered at zero, the expectation of these errors is zero, and using standard bounding arguments like Chernoff bounds, you can prove that with high probability this new error is at most $L(B-1) \sigma \sqrt{2n \log n}$, where $\sigma$ is the standard deviation of the error distribution.

Now, finally, we can run through this argument one more time, but using an approximate decomposition. This merely changes the sum’s lower bound from $j=0$ to $j=k$. Start by calling $\tilde{a}_i = \sum_{j=k}^{L-1} a_{i,j} B^j$, the approximation of $a_i$ from its most significant bits. Then the error of this approximation is $a_i – \tilde{a}_i = \sum_{j=0}^{k-1} a_{i,j} B^j$, a relatively small quantity at most $(B^k – 1) / (B-1)$ (if each $a_{i,j} = B-1$ is as large as possible).

$\displaystyle \begin{aligned} \varphi_t(c’) &= b – \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} \varphi_t(\textup{KSK}_{i,j}) \\ &= b -\sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} (s_i B^j + e_i) \\ &= b -\sum_{i=1}^n s_i \sum_{j=k}^{L-1} a_{i,j} B^j – \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} e_i \\ &= b -\sum_{i=1}^n s_i \tilde{a}_i – \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} e_i \end{aligned}$

Mentally zoom in on the first sum $\sum_{i=1}^n s_i \tilde{a}_i$. Use the trick of adding zero to get

$\displaystyle \sum_{i=1}^n s_i \tilde{a}_i = \sum_{i=1}^n s_i (a_i + \tilde{a}_i – a_i) = \sum_{i=1}^n s_i a_i – \sum_{i=1}^n s_i(a_i – \tilde{a}_i)$

The term $\sum_{i=1}^n s_i(a_i – \tilde{a}_i)$ is part of our new error term, and recalling that the secret key bits are binary, you should think of this in expectation as roughly $\frac{n}{2} B^{k-1}$ (more precisely, $\frac{n}{2} (B^{k}-1)/(B-1)$).

Continuing, we arrive at

$\displaystyle \begin{aligned} \varphi_t(c’) &= b -\sum_{i=1}^n a_i s_i – \sum_{i=1}^n s_i(a_i – \tilde{a}_i) – \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} e_i \\ &= m + e_{\textup{original}} – \sum_{i=1}^n s_i(a_i – \tilde{a}_i) – \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} e_i \end{aligned}$

Rough error analysis

Now the choice of $k$ admits a tradeoff that one can optimize for to minimize the total newly introduced error. I’m going to switch to a sloppy mode of math to heuristically navigate this tradeoff.

The triangle inequality lets us bound the magnitude of the error by the sum of the magnitudes of the parts, i.e., the error is bounded from above by

$\displaystyle \left | \sum_{i=1}^n s_i(a_i – \tilde{a}_i) \right | + \left | \sum_{i=1}^n \sum_{j=k}^{L-1} a_{i,j} e_i \right |$

The left term is like $\frac{n}{2} B^{k-1}$ as we stated earlier, and with high probability it’s at most $(n/2 + \sqrt{n \log n}) B^{k-1}$. The right term is at most $(L-k)B \sum_{i=1}^n e_i$, (worst case size of $a_{i,j}$, increasing $B-1$ to $B$ because why not), and with high probability the sum of the $e_i$ is like $\sigma \sqrt{2n \log n}$, making the whole term bounded by $(L-k)B \sigma \sqrt{2n \log n}$. So we want to minimize the sum

$\displaystyle (n/2 + \sqrt{n \log n}) B^{k-1} + (L-k)B \sigma \sqrt{2n \log n}$

We could try to explicitly optimize this for $k$, treating the other terms as constant, but it won’t be nice because $k$ is present in both a linear term and an exponent. We could also just stare at it and think. The approximation error (the term on the left) is going to get exponentially larger as $k$ grows, so we want to keep $k$ relatively small. But on the other hand, the standard deviation $\sigma$ should be much larger than $n$ to keep LWE secure. This is effectively what we’re trying to suppress: error that grows like $O(n)$ is small enough to deal with, but error that grows like $\omega(n)$ is problematic. Increasing $k$ gives us a meager (but nontrivial) means to reduce the constant coefficient on that part of the error in exchange for $\Theta(n)$ growth with in the other term.

I admit, as of the time of this writing I still don’t understand how to set production security parameters for LWE. Is it still linear in $n$? Super-linear? Not sure. I’m betting future Jeremy will clarify this to me in another article. Even if it were linear in $n$, the right term multiplies $\sigma$ by $\sqrt{n \log n}$ which makes the whole thing super-linear, whereas the left term adds a square root factor. So the tradeoff in $k$ should still help.

Until I understand LWE security, I won’t have the asymptotics I need to analyze this further. Moreover, the allowed values of $B, k$ are so small that we can brute force evaluate all options. For example, if $B = 16$ then $k$ can be between 0 and 7. And realistically, if $n \approx 2^{10}$, then letting $k = 4$ makes the first term roughly $2^{26}$, which leaves only 6 bits left for the message (further reduced by any error introduced by the second term).

Thanks to Cathie Yun and Asra Ali for providing feedback on an early draft of this article.

Until next time!

Formulating the Support Vector Machine Optimization Problem

The hypothesis and the setup

This blog post has an interactive demo (mostly used toward the end of the post). The source for this demo is available in a Github repository.

Last time we saw how the inner product of two vectors gives rise to a decision rule: if $ w$ is the normal to a line (or hyperplane) $ L$, the sign of the inner product $ \langle x, w \rangle$ tells you whether $ x$ is on the same side of $ L$ as $ w$.

Let’s translate this to the parlance of machine-learning. Let $ x \in \mathbb{R}^n$ be a training data point, and $ y \in \{ 1, -1 \}$ is its label (green and red, in the images in this post). Suppose you want to find a hyperplane which separates all the points with -1 labels from those with +1 labels (assume for the moment that this is possible). For this and all examples in this post, we’ll use data in two dimensions, but the math will apply to any dimension.

problem_setup

Some data labeled red and green, which is separable by a hyperplane (line).

The hypothesis we’re proposing to separate these points is a hyperplane, i.e. a linear subspace that splits all of $ \mathbb{R}^n$ into two halves. The data that represents this hyperplane is a single vector $ w$, the normal to the hyperplane, so that the hyperplane is defined by the solutions to the equation $ \langle x, w \rangle = 0$.

As we saw last time, $ w$ encodes the following rule for deciding if a new point $ z$ has a positive or negative label.

$ \displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle)$

You’ll notice that this formula only works for the normals $ w$ of hyperplanes that pass through the origin, and generally we want to work with data that can be shifted elsewhere. We can resolve this by either adding a fixed term $ b \in \mathbb{R}$—often called a bias because statisticians came up with it—so that the shifted hyperplane is the set of solutions to $ \langle x, w \rangle + b = 0$. The shifted decision rule is:

$ \displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle + b)$

Now the hypothesis is the pair of vector-and-scalar $ w, b$.

The key intuitive idea behind the formulation of the SVM problem is that there are many possible separating hyperplanes for a given set of labeled training data. For example, here is a gif showing infinitely many choices.

svm_lots_of_choices.gif

The question is: how can we find the separating hyperplane that not only separates the training data, but generalizes as well as possible to new data? The assumption of the SVM is that a hyperplane which separates the points, but is also as far away from any training point as possible, will generalize best.

optimal_example.png

While contrived, it’s easy to see that the separating hyperplane is as far as possible from any training point.

More specifically, fix a labeled dataset of points $ (x_i, y_i)$, or more precisely:

$ \displaystyle D = \{ (x_i, y_i) \mid i = 1, \dots, m, x_i \in \mathbb{R}^{n}, y_i \in \{1, -1\}  \}$

And a hypothesis defined by the normal $ w \in \mathbb{R}^{n}$ and a shift $ b \in \mathbb{R}$. Let’s also suppose that $ (w,b)$ defines a hyperplane that correctly separates all the training data into the two labeled classes, and we just want to measure its quality. That measure of quality is the length of its margin.

Definition: The geometric margin of a hyperplane $ w$ with respect to a dataset $ D$ is the shortest distance from a training point $ x_i$ to the hyperplane defined by $ w$.

The best hyperplane has the largest possible margin.

This margin can even be computed quite easily using our work from last post. The distance from $ x$ to the hyperplane defined by $ w$ is the same as the length of the projection of $ x$ onto $ w$. And this is just computed by an inner product.

decision-rule-3

If the tip of the $ x$ arrow is the point in question, then $ a$ is the dot product, and $ b$ the distance from $ x$ to the hyperplane $ L$ defined by $ w$.

A naive optimization objective

If we wanted to, we could stop now and define an optimization problem that would be very hard to solve. It would look like this:

$ \displaystyle \begin{aligned} & \max_{w} \min_{x_i} \left | \left \langle x_i, \frac{w}{\|w\|} \right \rangle + b \right | & \\ \textup{subject to \ \ } & \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i) & \textup{ for every } i = 1, \dots, m \end{aligned}$

The formulation is hard. The reason is it’s horrifyingly nonlinear. In more detail:

  1. The constraints are nonlinear due to the sign comparisons.
  2. There’s a min and a max! A priori, we have to do this because we don’t know which point is going to be the closest to the hyperplane.
  3. The objective is nonlinear in two ways: the absolute value and the projection requires you to take a norm and divide.

The rest of this post (and indeed, a lot of the work in grokking SVMs) is dedicated to converting this optimization problem to one in which the constraints are all linear inequalities and the objective is a single, quadratic polynomial we want to minimize or maximize.

Along the way, we’ll notice some neat features of the SVM.

Trick 1: linearizing the constraints

To solve the first problem, we can use a trick. We want to know whether $ \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i)$ for a labeled training point $ (x_i, y_i)$. The trick is to multiply them together. If their signs agree, then their product will be positive, otherwise it will be negative.

So each constraint becomes:

$ \displaystyle (\langle x_i, w \rangle + b) \cdot y_i \geq 0$

This is still linear because $ y_i$ is a constant (input) to the optimization problem. The variables are the coefficients of $ w$.

The left hand side of this inequality is often called the functional margin of a training point, since, as we will see, it still works to classify $ x_i$, even if $ w$ is scaled so that it is no longer a unit vector. Indeed, the sign of the inner product is independent of how $ w$ is scaled.

Trick 1.5: the optimal solution is midway between classes

This small trick is to notice that if $ w$ is the supposed optimal separating hyperplane, i.e. its margin is maximized, then it must necessarily be exactly halfway in between the closest points in the positive and negative classes.

In other words, if $ x_+$ and $ x_-$ are the closest points in the positive and negative classes, respectively, then $ \langle x_{+}, w \rangle + b = -(\langle x_{-}, w \rangle + b)$. If this were not the case, then you could adjust the bias, shifting the decision boundary along $ w$ until it they are exactly equal, and you will have increased the margin. The closest point, say $ x_+$ will have gotten farther away, and the closest point in the opposite class, $ x_-$ will have gotten closer, but will not be closer than $ x_+$.

Trick 2: getting rid of the max + min

Resolving this problem essentially uses the fact that the hypothesis, which comes in the form of the normal vector $ w$, has a degree of freedom in its length. To explain the details of this trick, we’ll set $ b=0$ which simplifies the intuition.

Indeed, in the animation below, I can increase or decrease the length of $ w$ without changing the decision boundary.

svm_w_length.gif

I have to keep my hand very steady (because I was too lazy to program it so that it only increases/decreases in length), but you can see the point. The line is perpendicular to the normal vector, and it doesn’t depend on the length.

Let’s combine this with tricks 1 and 1.5. If we increase the length of $ w$, that means the absolute values of the dot products $ \langle x_i, w \rangle $ used in the constraints will all increase by the same amount (without changing their sign). Indeed, for any vector $ a$ we have $ \langle a, w \rangle = \|w \| \cdot \langle a, w / \| w \| \rangle$.

In this world, the inner product measurement of distance from a point to the hyperplane is no longer faithful. The true distance is $ \langle a, w / \| w \| \rangle$, but the distance measured by $ \langle a, w \rangle$ is measured in units of $ 1 / \| w \|$.

units.png

In this example, the two numbers next to the green dot represent the true distance of the point from the hyperplane, and the dot product of the point with the normal (respectively). The dashed lines are the solutions to <x, w> = 1. The magnitude of w is 2.2, the inverse of that is 0.46, and indeed 2.2 = 4.8 * 0.46 (we’ve rounded the numbers).

Now suppose we had the optimal hyperplane and its normal $ w$. No matter how near (or far) the nearest positively labeled training point $ x$ is, we could scale the length of $ w$ to force $ \langle x, w \rangle = 1$. This is the core of the trick. One consequence is that the actual distance from $ x$ to the hyperplane is $ \frac{1}{\| w \|} = \langle x, w / \| w \| \rangle$.

units2.png

The same as above, but with the roles reversed. We’re forcing the inner product of the point with w to be 1. The true distance is unchanged.

In particular, if we force the closest point to have inner product 1, then all other points will have inner product at least 1. This has two consequences. First, our constraints change to $ \langle x_i, w \rangle \cdot y_i \geq 1$ instead of $ \geq 0$. Second, we no longer need to ask which point is closest to the candidate hyperplane! Because after all, we never cared which point it was, just how far away that closest point was. And now we know that it’s exactly $ 1 / \| w \|$ away. Indeed, if the optimal points weren’t at that distance, then that means the closest point doesn’t exactly meet the constraint, i.e. that $ \langle x, w \rangle > 1$ for every training point $ x$. We could then scale $ w$ shorter until $ \langle x, w \rangle = 1$, hence increasing the margin $ 1 / \| w \|$.

In other words, the coup de grâce, provided all the constraints are satisfied, the optimization objective is just to maximize $ 1 / \| w \|$, a.k.a. to minimize $ \| w \|$.

This intuition is clear from the following demonstration, which you can try for yourself. In it I have a bunch of positively and negatively labeled points, and the line in the center is the candidate hyperplane with normal $ w$ that you can drag around. Each training point has two numbers next to it. The first is the true distance from that point to the candidate hyperplane; the second is the inner product with $ w$. The two blue dashed lines are the solutions to $ \langle x, w \rangle = \pm 1$. To solve the SVM by hand, you have to ensure the second number is at least 1 for all green points, at most -1 for all red points, and then you have to make $ w$ as short as possible. As we’ve discussed, shrinking $ w$ moves the blue lines farther away from the separator, but in order to satisfy the constraints the blue lines can’t go further than any training point. Indeed, the optimum will have those blue lines touching a training point on each side.

svm_solve_by_hand

 

I bet you enjoyed watching me struggle to solve it. And while it’s probably not the optimal solution, the idea should be clear.

The final note is that, since we are now minimizing $ \| w \|$, a formula which includes a square root, we may as well minimize its square $ \| w \|^2 = \sum_j w_j^2$. We will also multiply the objective by $ 1/2$, because when we eventually analyze this problem we will take a derivative, and the square in the exponent and the $ 1/2$ will cancel.

The final form of the problem

Our optimization problem is now the following (including the bias again):

$ \displaystyle \begin{aligned} & \min_{w}  \frac{1}{2} \| w \|^2 & \\ \textup{subject to \ \ } & (\langle x_i, w \rangle + b) \cdot y_i \geq 1 & \textup{ for every } i = 1, \dots, m \end{aligned}$

This is much simpler to analyze. The constraints are all linear inequalities (which, because of linear programming, we know are tractable to optimize). The objective to minimize, however, is a convex quadratic function of the input variables—a sum of squares of the inputs.

Such problems are generally called quadratic programming problems (or QPs, for short). There are general methods to find solutions! However, they often suffer from numerical stability issues and have less-than-satisfactory runtime. Luckily, the form in which we’ve expressed the support vector machine problem is specific enough that we can analyze it directly, and find a way to solve it without appealing to general-purpose numerical solvers.

We will tackle this problem in a future post (planned for two posts sequel to this one). Before we close, let’s just make a few more observations about the solution to the optimization problem.

Support Vectors

In Trick 1.5 we saw that the optimal separating hyperplane has to be exactly halfway between the two closest points of opposite classes. Moreover, we noticed that, provided we’ve scaled $ \| w \|$ properly, these closest points (there may be multiple for positive and negative labels) have to be exactly “distance” 1 away from the separating hyperplane.

Another way to phrase this without putting “distance” in scare quotes is to say that, if $ w$ is the normal vector of the optimal separating hyperplane, the closest points lie on the two lines $ \langle x_i, w \rangle + b = \pm 1$.

Now that we have some intuition for the formulation of this problem, it isn’t a stretch to realize the following. While a dataset may include many points from either class on these two lines $ \langle x_i, w \rangle = \pm 1$, the optimal hyperplane itself does not depend on any of the other points except these closest points.

This fact is enough to give these closest points a special name: the support vectors.

We’ll actually prove that support vectors “are all you need” with full rigor and detail next time, when we cast the optimization problem in this post into the “dual” setting. To avoid vague names, the formulation described in this post called the “primal” problem. The dual problem is derived from the primal problem, with special variables and constraints chosen based on the primal variables and constraints. Next time we’ll describe in brief detail what the dual does and why it’s important, but we won’t have nearly enough time to give a full understanding of duality in optimization (such a treatment would fill a book).

When we compute the dual of the SVM problem, we will see explicitly that the hyperplane can be written as a linear combination of the support vectors. As such, once you’ve found the optimal hyperplane, you can compress the training set into just the support vectors, and reproducing the same optimal solution becomes much, much faster. You can also use the support vectors to augment the SVM to incorporate streaming data (throw out all non-support vectors after every retraining).

Eventually, when we get to implementing the SVM from scratch, we’ll see all this in action.

Until then!

Big Dimensions, and What You Can Do About It

Data is abundant, data is big, and big is a problem. Let me start with an example. Let’s say you have a list of movie titles and you want to learn their genre: romance, action, drama, etc. And maybe in this scenario IMDB doesn’t exist so you can’t scrape the answer. Well, the title alone is almost never enough information. One nice way to get more data is to do the following:

  1. Pick a large dictionary of words, say the most common 100,000 non stop-words in the English language.
  2. Crawl the web looking for documents that include the title of a film.
  3. For each film, record the counts of all other words appearing in those documents.
  4. Maybe remove instances of “movie” or “film,” etc.

After this process you have a length-100,000 vector of integers associated with each movie title. IMDB’s database has around 1.5 million listed movies, and if we have a 32-bit integer per vector entry, that’s 600 GB of data to get every movie.

One way to try to find genres is to cluster this (unlabeled) dataset of vectors, and then manually inspect the clusters and assign genres. With a really fast computer we could simply run an existing clustering algorithm on this dataset and be done. Of course, clustering 600 GB of data takes a long time, but there’s another problem. The geometric intuition that we use to design clustering algorithms degrades as the length of the vectors in the dataset grows. As a result, our algorithms perform poorly. This phenomenon is called the “curse of dimensionality” (“curse” isn’t a technical term), and we’ll return to the mathematical curiosities shortly.

A possible workaround is to try to come up with faster algorithms or be more patient. But a more interesting mathematical question is the following:

Is it possible to condense high-dimensional data into smaller dimensions and retain the important geometric properties of the data?

This goal is called dimension reduction. Indeed, all of the chatter on the internet is bound to encode redundant information, so for our movie title vectors it seems the answer should be “yes.” But the questions remain, how does one find a low-dimensional condensification? (Condensification isn’t a word, the right word is embedding, but embedding is overloaded so we’ll wait until we define it) And what mathematical guarantees can you prove about the resulting condensed data? After all, it stands to reason that different techniques preserve different aspects of the data. Only math will tell.

In this post we’ll explore this so-called “curse” of dimensionality, explain the formality of why it’s seen as a curse, and implement a wonderfully simple technique called “the random projection method” which preserves pairwise distances between points after the reduction. As usual, and all the code, data, and tests used in the making of this post are on Github.

Some curious issues, and the “curse”

We start by exploring the curse of dimensionality with experiments on synthetic data.

In two dimensions, take a circle centered at the origin with radius 1 and its bounding square.

circle.png

The circle fills up most of the area in the square, in fact it takes up exactly $ \pi$ out of 4 which is about 78%. In three dimensions we have a sphere and a cube, and the ratio of sphere volume to cube volume is a bit smaller, $ 4 \pi /3$ out of a total of 8, which is just over 52%. What about in a thousand dimensions? Let’s try by simulation.

import random

def randUnitCube(n):
   return [(random.random() - 0.5)*2 for _ in range(n)]

def sphereCubeRatio(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   return sum(1 for x in randomSample if sum(a**2 for a in x) &lt;= 1) / numSamples 

The result is as we computed for small dimension,

 &gt;&gt;&gt; sphereCubeRatio(2,10000)
0.7857
&gt;&gt;&gt; sphereCubeRatio(3,10000)
0.5196

And much smaller for larger dimension

&gt;&gt;&gt; sphereCubeRatio(20,100000) # 100k samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,1000000) # 1M samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,2000000)
5e-07

Forget a thousand dimensions, for even twenty dimensions, a million samples wasn’t enough to register a single random point inside the unit sphere. This illustrates one concern, that when we’re sampling random points in the $ d$-dimensional unit cube, we need at least $ 2^d$ samples to ensure we’re getting a even distribution from the whole space. In high dimensions, this face basically rules out a naive Monte Carlo approximation, where you sample random points to estimate the probability of an event too complicated to sample from directly. A machine learning viewpoint of the same problem is that in dimension $ d$, if your machine learning algorithm requires a representative sample of the input space in order to make a useful inference, then you require $ 2^d$ samples to learn.

Luckily, we can answer our original question because there is a known formula for the volume of a sphere in any dimension. Rather than give the closed form formula, which involves the gamma function and is incredibly hard to parse, we’ll state the recursive form. Call $ V_i$ the volume of the unit sphere in dimension $ i$. Then $ V_0 = 1$ by convention, $ V_1 = 2$ (it’s an interval), and $ V_n = \frac{2 \pi V_{n-2}}{n}$. If you unpack this recursion you can see that the numerator looks like $ (2\pi)^{n/2}$ and the denominator looks like a factorial, except it skips every other number. So an even dimension would look like $ 2 \cdot 4 \cdot \dots \cdot n$, and this grows larger than a fixed exponential. So in fact the total volume of the sphere vanishes as the dimension grows! (In addition to the ratio vanishing!)

def sphereVolume(n):
   values = [0] * (n+1)
   for i in range(n+1):
      if i == 0:
         values[i] = 1
      elif i == 1:
         values[i] = 2
      else:
         values[i] = 2*math.pi / i * values[i-2]

   return values[-1]

This should be counterintuitive. I think most people would guess, when asked about how the volume of the unit sphere changes as the dimension grows, that it stays the same or gets bigger.  But at a hundred dimensions, the volume is already getting too small to fit in a float.

&gt;&gt;&gt; sphereVolume(20)
0.025806891390014047
&gt;&gt;&gt; sphereVolume(100)
2.3682021018828297e-40
&gt;&gt;&gt; sphereVolume(1000)
0.0

The scary thing is not just that this value drops, but that it drops exponentially quickly. A consequence is that, if you’re trying to cluster data points by looking at points within a fixed distance $ r$ of one point, you have to carefully measure how big $ r$ needs to be to cover the same proportional volume as it would in low dimension.

Here’s a related issue. Say I take a bunch of points generated uniformly at random in the unit cube.

from itertools import combinations

def distancesRandomPoints(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   pairwiseDistances = [dist(x,y) for (x,y) in combinations(randomSample, 2)]
   return pairwiseDistances

In two dimensions, the histogram of distances between points looks like this

2d-distances.png

However, as the dimension grows the distribution of distances changes. It evolves like the following animation, in which each frame is an increase in dimension from 2 to 100.

distances-animation.gif

The shape of the distribution doesn’t appear to be changing all that much after the first few frames, but the center of the distribution tends to infinity (in fact, it grows like $ \sqrt{n}$). The variance also appears to stay constant. This chart also becomes more variable as the dimension grows, again because we should be sampling exponentially many more points as the dimension grows (but we don’t). In other words, as the dimension grows the average distance grows and the tightness of the distribution stays the same. So at a thousand dimensions the average distance is about 26, tightly concentrated between 24 and 28. When the average is a thousand, the distribution is tight between 998 and 1002. If one were to normalize this data, it would appear that random points are all becoming equidistant from each other.

So in addition to the issues of runtime and sampling, the geometry of high-dimensional space looks different from what we expect. To get a better understanding of “big data,” we have to update our intuition from low-dimensional geometry with analysis and mathematical theorems that are much harder to visualize.

The Johnson-Lindenstrauss Lemma

Now we turn to proving dimension reduction is possible. There are a few methods one might first think of, such as look for suitable subsets of coordinates, or sums of subsets, but these would all appear to take a long time or they simply don’t work.

Instead, the key technique is to take a random linear subspace of a certain dimension, and project every data point onto that subspace. No searching required. The fact that this works is called the Johnson-Lindenstrauss Lemma. To set up some notation, we’ll call $ d(v,w)$ the usual distance between two points.

Lemma [Johnson-Lindenstrauss (1984)]: Given a set $ X$ of $ n$ points in $ \mathbb{R}^d$, project the points in $ X$ to a randomly chosen subspace of dimension $ c$. Call the projection $ \rho$. For any $ \varepsilon > 0$, if $ c$ is at least $ \Omega(\log(n) / \varepsilon^2)$, then with probability at least 1/2 the distances between points in $ X$ are preserved up to a factor of $ (1+\varepsilon)$. That is, with good probability every pair $ v,w \in X$ will satisfy

$ \displaystyle \| v-w \|^2 (1-\varepsilon) \leq \| \rho(v) – \rho(w) \|^2 \leq \| v-w \|^2 (1+\varepsilon)$

Before we do the proof, which is quite short, it’s important to point out that the target dimension $ c$ does not depend on the original dimension! It only depends on the number of points in the dataset, and logarithmically so. That makes this lemma seem like pure magic, that you can take data in an arbitrarily high dimension and put it in a much smaller dimension.

On the other hand, if you include all of the hidden constants in the bound on the dimension, it’s not that impressive. If your data have a million dimensions and you want to preserve the distances up to 1% ($ \varepsilon = 0.01$), the bound is bigger than a million! If you decrease the preservation $ \varepsilon$ to 10% (0.1), then you get down to about 12,000 dimensions, which is more reasonable. At 45% the bound drops to around 1,000 dimensions. Here’s a plot showing the theoretical bound on $ c$ in terms of $ \varepsilon$ for $ n$ fixed to a million.

boundplot

 

But keep in mind, this is just a theoretical bound for potentially misbehaving data. Later in this post we’ll see if the practical dimension can be reduced more than the theory allows. As we’ll see, an algorithm run on the projected data is still effective even if the projection goes well beyond the theoretical bound. Because the theorem is known to be tight in the worst case (see the notes at the end) this speaks more to the robustness of the typical algorithm than to the robustness of the projection method.

A second important note is that this technique does not necessarily avoid all the problems with the curse of dimensionality. We mentioned above that one potential problem is that “random points” are roughly equidistant in high dimensions. Johnson-Lindenstrauss actually preserves this problem because it preserves distances! As a consequence, you won’t see strictly better algorithm performance if you project (which we suggested is possible in the beginning of this post). But you will alleviate slow runtimes if the runtime depends exponentially on the dimension. Indeed, if you replace the dimension $ d$ with the logarithm of the number of points $ \log n$, then $ 2^d$ becomes linear in $ n$, and $ 2^{O(d)}$ becomes polynomial.

Proof of the J-L lemma

Let’s prove the lemma.

Proof. To start we make note that one can sample from the uniform distribution on dimension-$ c$ linear subspaces of $ \mathbb{R}^d$ by choosing the entries of a $ c \times d$ matrix $ A$ independently from a normal distribution with mean 0 and variance 1. Then, to project a vector $ x$ by this matrix (call the projection $ \rho$), we can compute

$ \displaystyle \rho(x) = \frac{1}{\sqrt{c}}A x$

Now fix $ \varepsilon > 0$ and fix two points in the dataset $ x,y$. We want an upper bound on the probability that the following is false

$ \displaystyle \| x-y \|^2 (1-\varepsilon) \leq \| \rho(x) – \rho(y) \|^2 \leq \| x-y \|^2 (1+\varepsilon)$

Since that expression is a pain to work with, let’s rearrange it by calling $ u = x-y$, and rearranging (using the linearity of the projection) to get the equivalent statement.

$ \left | \| \rho(u) \|^2 – \|u \|^2 \right | \leq \varepsilon \| u \|^2$

And so we want a bound on the probability that this event does not occur, meaning the inequality switches directions.

Once we get such a bound (it will depend on $ c$ and $ \varepsilon$) we need to ensure that this bound is true for every pair of points. The union bound allows us to do this, but it also requires that the probability of the bad thing happening tends to zero faster than $ 1/\binom{n}{2}$. That’s where the $ \log(n)$ will come into the bound as stated in the theorem.

Continuing with our use of $ u$ for notation, define $ X$ to be the random variable $ \frac{c}{\| u \|^2} \| \rho(u) \|^2$. By expanding the notation and using the linearity of expectation, you can show that the expected value of $ X$ is $ c$, meaning that in expectation, distances are preserved. We are on the right track, and just need to show that the distribution of $ X$, and thus the possible deviations in distances, is tightly concentrated around $ c$. In full rigor, we will show

$ \displaystyle \Pr [X \geq (1+\varepsilon) c] < e^{-(\varepsilon^2 – \varepsilon^3) \frac{c}{4}}$

Let $ A_i$ denote the $ i$-th column of $ A$. Define by $ X_i$ the quantity $ \langle A_i, u \rangle / \| u \|$. This is a weighted average of the entries of $ A_i$ by the entries of $ u$. But since we chose the entries of $ A$ from the normal distribution, and since a weighted average of normally distributed random variables is also normally distributed (has the same distribution), $ X_i$ is a $ N(0,1)$ random variable. Moreover, each column is independent. This allows us to decompose $ X$ as

$ X = \frac{k}{\| u \|^2} \| \rho(u) \|^2 = \frac{\| Au \|^2}{\| u \|^2}$

Expanding further,

$ X = \sum_{i=1}^c \frac{\| A_i u \|^2}{\|u\|^2} = \sum_{i=1}^c X_i^2$

Now the event $ X \leq (1+\varepsilon) c$ can be expressed in terms of the nonegative variable $ e^{\lambda X}$, where $ 0 < \lambda < 1/2$ is parameter, to get

$ \displaystyle \Pr[X \geq (1+\varepsilon) c] = \Pr[e^{\lambda X} \geq e^{(1+\varepsilon)c \lambda}]$

This will become useful because the sum $ X = \sum_i X_i^2$ will split into a product momentarily. First we apply Markov’s inequality, which says that for any nonnegative random variable $ Y$, $ \Pr[Y \geq t] \leq \mathbb{E}[Y] / t$. This lets us write

$ \displaystyle \Pr[e^{\lambda X} \geq e^{(1+\varepsilon) c \lambda}] \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{(1+\varepsilon) c \lambda}}$

Now we can split up the exponent $ \lambda X$ into $ \sum_{i=1}^c \lambda X_i^2$, and using the i.i.d.-ness of the $ X_i^2$ we can rewrite the RHS of the inequality as

$ \left ( \frac{\mathbb{E}[e^{\lambda X_1^2}]}{e^{(1+\varepsilon)\lambda}} \right )^c$

A similar statement using $ -\lambda$ is true for the $ (1-\varepsilon)$ part, namely that

$ \displaystyle \Pr[X \leq (1-\varepsilon)c] \leq \left ( \frac{\mathbb{E}[e^{-\lambda X_1^2}]}{e^{-(1-\varepsilon)\lambda}} \right )^c$

The last thing that’s needed is to bound $ \mathbb{E}[e^{\lambda X_i^2}]$, but since $ X_i^2 \sim N(0,1)$, we can use the known density function for a normal distribution, and integrate to get the exact value $ \mathbb{E}[e^{\lambda X_1^2}] = \frac{1}{\sqrt{1-2\lambda}}$. Including this in the bound gives us a closed-form bound in terms of $ \lambda, c, \varepsilon$. Using standard calculus the optimal $ \lambda \in (0,1/2)$ is $ \lambda = \varepsilon / 2(1+\varepsilon)$. This gives

$ \displaystyle \Pr[X \geq (1+\varepsilon) c] \leq ((1+\varepsilon)e^{-\varepsilon})^{c/2}$

Using the Taylor series expansion for $ e^x$, one can show the bound $ 1+\varepsilon < e^{\varepsilon – (\varepsilon^2 – \varepsilon^3)/2}$, which simplifies the final upper bound to $ e^{-(\varepsilon^2 – \varepsilon^3) c/4}$.

Doing the same thing for the $ (1-\varepsilon)$ version gives an equivalent bound, and so the total bound is doubled, i.e. $ 2e^{-(\varepsilon^2 – \varepsilon^3) c/4}$.

As we said at the beginning, applying the union bound means we need

$ \displaystyle 2e^{-(\varepsilon^2 – \varepsilon^3) c/4} < \frac{1}{\binom{n}{2}}$

Solving this for $ c$ gives $ c \geq \frac{8 \log m}{\varepsilon^2 – \varepsilon^3}$, as desired.

$ \square$

Projecting in Practice

Let’s write a python program to actually perform the Johnson-Lindenstrauss dimension reduction scheme. This is sometimes called the Johnson-Lindenstrauss transform, or JLT.

First we define a random subspace by sampling an appropriately-sized matrix with normally distributed entries, and a function that performs the projection onto a given subspace (for testing).

import random
import math
import numpy

def randomSubspace(subspaceDimension, ambientDimension):
   return numpy.random.normal(0, 1, size=(subspaceDimension, ambientDimension))

def project(v, subspace):
   subspaceDimension = len(subspace)
   return (1 / math.sqrt(subspaceDimension)) * subspace.dot(v)

We have a function that computes the theoretical bound on the optimal dimension to reduce to.

def theoreticalBound(n, epsilon):
   return math.ceil(8*math.log(n) / (epsilon**2 - epsilon**3))

And then performing the JLT is simply matrix multiplication

def jlt(data, subspaceDimension):
   ambientDimension = len(data[0])
   A = randomSubspace(subspaceDimension, ambientDimension)
   return (1 / math.sqrt(subspaceDimension)) * A.dot(data.T).T

The high-dimensional dataset we’ll use comes from a data mining competition called KDD Cup 2001. The dataset we used deals with drug design, and the goal is to determine whether an organic compound binds to something called thrombin. Thrombin has something to do with blood clotting, and I won’t pretend I’m an expert. The dataset, however, has over a hundred thousand features for about 2,000 compounds. Here are a few approximate target dimensions we can hope for as epsilon varies.

&gt;&gt;&gt; [((1/x),theoreticalBound(n=2000, epsilon=1/x))
       for x in [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]]
[('0.50', 487), ('0.33', 821), ('0.25', 1298), ('0.20', 1901),
 ('0.17', 2627), ('0.14', 3477), ('0.12', 4448), ('0.11', 5542),
 ('0.10', 6757), ('0.07', 14659), ('0.05', 25604)]

Going down from a hundred thousand dimensions to a few thousand is by any measure decreases the size of the dataset by about 95%. We can also observe how the distribution of overall distances varies as the size of the subspace we project to varies.

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The last three frames are for 10, 5, and 2 dimensions respectively. As you can see the histogram starts to beef up around zero. To be honest I was expecting something a bit more dramatic like a uniform-ish distribution. Of course, the distribution of distances is not all that matters. Another concern is the worst case change in distances between any two points before and after the projection. We can see that indeed when we project to the dimension specified in the theorem, that the distances are within the prescribed bounds.

def checkTheorem(oldData, newData, epsilon):
   numBadPoints = 0

   for (x,y), (x2,y2) in zip(combinations(oldData, 2), combinations(newData, 2)):
      oldNorm = numpy.linalg.norm(x2-y2)**2
      newNorm = numpy.linalg.norm(x-y)**2

      if newNorm == 0 or oldNorm == 0:
         continue

      if abs(oldNorm / newNorm - 1) &amp;gt; epsilon:
         numBadPoints += 1

   return numBadPoints

if __name__ == &amp;quot;__main__&amp;quot;
   from data import thrombin
   train, labels = thrombin.load() 

   numPoints = len(train)
   epsilon = 0.2
   subspaceDim = theoreticalBound(numPoints, epsilon)
   ambientDim = len(train[0])
   newData = jlt(train, subspaceDim)

   print(checkTheorem(train, newData, epsilon))

This program prints zero every time I try running it, which is the poor man’s way of saying it works “with high probability.” We can also plot statistics about the number of pairs of data points that are distorted by more than $ \varepsilon$ as the subspace dimension shrinks. We ran this on the following set of subspace dimensions with $ \varepsilon = 0.1$ and took average/standard deviation over twenty trials:

   dims = [1000, 750, 500, 250, 100, 75, 50, 25, 10, 5, 2]

The result is the following chart, whose x-axis is the dimension projected to (so the left hand is the most extreme projection to 2, 5, 10 dimensions), the y-axis is the number of distorted pairs, and the error bars represent a single standard deviation away from the mean.

thrombin-worst-case

This chart provides good news about this dataset because the standard deviations are low. It tells us something that mathematicians often ignore: the predictability of the tradeoff that occurs once you go past the theoretically perfect bound. In this case, the standard deviations tell us that it’s highly predictable. Moreover, since this tradeoff curve measures pairs of points, we might conjecture that the distortion is localized around a single set of points that got significantly “rattled” by the projection. This would be an interesting exercise to explore.

Now all of these charts are really playing with the JLT and confirming the correctness of our code (and hopefully our intuition). The real question is: how well does a machine learning algorithm perform on the original data when compared to the projected data? If the algorithm only “depends” on the pairwise distances between the points, then we should expect nearly identical accuracy in the unprojected and projected versions of the data. To show this we’ll use an easy learning algorithm, the k-nearest-neighbors clustering method. The problem, however, is that there are very few positive examples in this particular dataset. So looking for the majority label of the nearest $ k$ neighbors for any $ k > 2$ unilaterally results in the “all negative” classifier, which has 97% accuracy. This happens before and after projecting.

To compensate for this, we modify k-nearest-neighbors slightly by having the label of a predicted point be 1 if any label among its nearest neighbors is 1. So it’s not a majority vote, but rather a logical OR of the labels of nearby neighbors. Our point in this post is not to solve the problem well, but rather to show how an algorithm (even a not-so-good one) can degrade as one projects the data into smaller and smaller dimensions. Here is the code.

def nearestNeighborsAccuracy(data, labels, k=10):
   from sklearn.neighbors import NearestNeighbors
   trainData, trainLabels, testData, testLabels = randomSplit(data, labels) # cross validation
   model = NearestNeighbors(n_neighbors=k).fit(trainData)
   distances, indices = model.kneighbors(testData)
   predictedLabels = []

   for x in indices:
      xLabels = [trainLabels[i] for i in x[1:]]
      predictedLabel = max(xLabels)
      predictedLabels.append(predictedLabel)

   totalAccuracy = sum(x == y for (x,y) in zip(testLabels, predictedLabels)) / len(testLabels)
   falsePositive = (sum(x == 0 and y == 1 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 0 for x in testLabels))
   falseNegative = (sum(x == 1 and y == 0 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 1 for x in testLabels))

   return totalAccuracy, falsePositive, falseNegative

And here is the accuracy of this modified k-nearest-neighbors algorithm run on the thrombin dataset. The horizontal line represents the accuracy of the produced classifier on the unmodified data set. The x-axis represents the dimension projected to (left-hand side is the lowest), and the y-axis represents the accuracy. The mean accuracy over fifty trials was plotted, with error bars representing one standard deviation. The complete code to reproduce the plot is in the Github repository.

thrombin-knn-accuracy

Likewise, we plot the proportion of false positive and false negatives for the output classifier. Note that a “positive” label made up only about 2% of the total data set. First the false positives

thrombin-knn-fp

Then the false negatives

thrombin-knn-fn

As we can see from these three charts, things don’t really change that much (for this dataset) even when we project down to around 200-300 dimensions. Note that for these parameters the “correct” theoretical choice for dimension was on the order of 5,000 dimensions, so this is a 95% savings from the naive approach, and 99.75% space savings from the original data. Not too shabby.

Notes

The $ \Omega(\log(n))$ worst-case dimension bound is asymptotically tight, though there is some small gap in the literature that depends on $ \varepsilon$. This result is due to Noga Alon, the very last result (Section 9) of this paper. [Update: as djhsu points out in the comments, this gap is now closed thanks to Larsen and Nelson]

We did dimension reduction with respect to preserving the Euclidean distance between points. One might naturally wonder if you can achieve the same dimension reduction with a different metric, say the taxicab metric or a $ p$-norm. In fact, you cannot achieve anything close to logarithmic dimension reduction for the taxicab ($ l_1$) metric. This result is due to Brinkman-Charikar in 2004.

The code we used to compute the JLT is not particularly efficient. There are much more efficient methods. One of them, borrowing its namesake from the Fast Fourier Transform, is called the Fast Johnson-Lindenstrauss Transform. The technique is due to Ailon-Chazelle from 2009, and it involves something called “preconditioning a sparse projection matrix with a randomized Fourier transform.” I don’t know precisely what that means, but it would be neat to dive into that in a future post.

The central focus in this post was whether the JLT preserves distances between points, but one might be curious as to whether the points themselves are well approximated. The answer is an enthusiastic no. If the data were images, the projected points would look nothing like the original images. However, it appears the degradation tradeoff is measurable (by some accounts perhaps linear), and there appears to be some work (also this by the same author) when restricting to sparse vectors (like word-association vectors).

Note that the JLT is not the only method for dimensionality reduction. We previously saw principal component analysis (applied to face recognition), and in the future we will cover a related technique called the Singular Value Decomposition. It is worth noting that another common technique specific to nearest-neighbor is called “locality-sensitive hashing.” Here the goal is to project the points in such a way that “similar” points land very close to each other. Say, if you were to discretize the plane into bins, these bins would form the hash values and you’d want to maximize the probability that two points with the same label land in the same bin. Then you can do things like nearest-neighbors by comparing bins.

Another interesting note, if your data is linearly separable (like the examples we saw in our age-old post on Perceptrons), then you can use the JLT to make finding a linear separator easier. First project the data onto the dimension given in the theorem. With high probability the points will still be linearly separable. And then you can use a perceptron-type algorithm in the smaller dimension. If you want to find out which side a new point is on, you project and compare with the separator in the smaller dimension.

Beyond its interest for practical dimensionality reduction, the JLT has had many other interesting theoretical consequences. More generally, the idea of “randomly projecting” your data onto some small dimensional space has allowed mathematicians to get some of the best-known results on many optimization and learning problems, perhaps the most famous of which is called MAX-CUT; the result is by Goemans-Williamson and it led to a mathematical constant being named after them, $ \alpha_{GW} =.878567 \dots$. If you’re interested in more about the theory, Santosh Vempala wrote a wonderful (and short!) treatise dedicated to this topic.

randomprojectionbook

The Inequality

Math and computer science are full of inequalities, but there is one that shows up more often in my work than any other. Of course, I’m talking about

$ \displaystyle 1+x \leq e^{x}$

This is The Inequality. I’ve been told on many occasions that the entire field of machine learning reduces to The Inequality combined with the Chernoff bound (which is proved using The Inequality).

Why does it show up so often in machine learning? Mostly because in analyzing an algorithm you want to bound the probability that some bad event happens. Bad events are usually phrased similarly to

$ \displaystyle \prod_{i=1}^m (1-p_i)$

And applying The Inequality we can bound this from above by

$ \displaystyle\prod_{i=1}^m (1-p_i) \leq \prod_{i=1}^m e^{-p_i} = e^{-\sum_{i=1}^m p_i}$

The point is that usually $ m$ is the size of your dataset, which you get to choose, and by picking larger $ m$ you make the probability of the bad event vanish exponentially quickly in $ m$. (Here $ p_i$ is unrelated to how I am about to use $ p_i$ as weights).

Of course, The Inequality has much deeper implications than bounds for the efficiency and correctness of machine learning algorithms. To convince you of the depth of this simple statement, let’s see its use in an elegant proof of the arithmetic geometric inequality.

Theorem: (The arithmetic-mean geometric-mean inequality, general version): For all non-negative real numbers $ a_1, \dots, a_n$ and all positive $ p_1, \dots, p_n$ such that $ p_1 + \dots + p_n = 1$, the following inequality holds:

$ \displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq p_1 a_1 + \dots + p_n a_n$

Note that when all the $ p_i = 1/n$ this is the standard AM-GM inequality.

Proof. This proof is due to George Polya (in Hungarian, Pólya György).

We start by modifying The Inequality $ 1+x \leq e^x$ by a shift of variables $ x \mapsto x-1$, so that the inequality now reads $ x \leq e^{x-1}$. We can apply this to each $ a_i$ giving $ a_i \leq e^{a_i – 1}$, and in fact,

$ \displaystyle a_1^{p_1} \cdots a_n^{p_n} \leq e^{\sum_{i=1}^n p_ia_i – p_i} = e^{\left ( \sum_{i=1}^n p_ia_i \right ) – 1}$

Now we have something quite curious: if we call $ A$ the sum $ p_1a_1 + \dots + p_na_n$, the above shows that $ a_1^{p_1} \cdots a_n^{p_n} \leq e^{A-1}$. Moreover, again because $ A \leq e^{A-1}$ that shows that the right hand side of the inequality we’re trying to prove is also bounded by $ e^{A-1}$. So we know that both sides of our desired inequality (and in particular, the max) is bounded from above by $ e^{A-1}$. This seems like a conundrum until we introduce the following beautiful idea: normalize by the thing you think should be the larger of the two sides of the inequality.

Define new variables $ b_i = a_i / A$ and notice that $ \sum_i p_i b_i = 1$ just by unraveling the definition. Call this sum $ B = \sum_i p_i b_i$. Now we know that

$ b_1^{p_1} \cdots b_n^{p_n} = \left ( \frac{a_1}{A} \right )^{p_1} \cdots \left ( \frac{a_n}{A} \right )^{p_n} \leq e^{B – 1} = e^0 = 1$

Now we unpack the pieces, and multiply through by $ A^{p_1}A^{p_2} \cdots A^{p_n} = A$, the result is exactly the AM-GM inequality.

$ \square$

Even deeper, there is only one case when The Inequality is tight, i.e. when $ 1+x = e^x$, and that is $ x=0$. This allows us to use the proof above to come to a full characterization of the case of equality in the proof above. Indeed, the crucial step was that $ (a_i / A) = e^{A-1}$, which is only true when $ (a_i / A) = 1$, i.e. when $ a_i = A$. Spending a few seconds thinking about this gives the characterization of equality if and only if $ a_1 = a_2 = \dots = a_n = A$.

So this is excellent: the arithmetic-geometric inequality is a deep theorem with applications all over mathematics and statistics. Adding another layer of indirection for impressiveness, one can use the AM-GM inequality to prove the Cauchy-Schwarz inequality rather directly. Sadly, the Wikipedia page for the Cauchy-Schwarz inequality hardly does it justice as far as the massive number of applications. For example, many novel techniques in geometry and number theory are proved directly from C-S. More, in fact, than I can hope to learn.

Of course, no article about The Inequality could be complete without a proof of The Inequality.

Theorem: For all $ x \in \mathbb{R}$, $ 1+x \leq e^x$.

Proof. The proof starts by proving a simpler theorem, named after Bernoulli, that $ 1+nx \leq (1+x)^n$ for every $ x [-1, \infty)$ and every $ n \in \mathbb{N}$. This is relatively straightforward by induction. The base case is trivial, and

$ \displaystyle (1+x)^{n+1} = (1+x)(1+x)^n \geq (1+x)(1+nx) = 1 + (n+1)x + nx^2$

And because $ nx^2 \geq 0$, we get Bernoulli’s inequality.

Now for any $ z \geq 0$ we can set $ x = z/n$, and get $ (1+z) = (1+nx) \leq (1+\frac{z}{n})^n$ for every $ n$. Note that Bernoulli’s inequality is preserved for larger and larger $ n$ because $ x \geq 0$. So taking limits of both sides as $ n \to \infty$ we get the definition of $ e^z$ on the right hand side of the inequality. We can prove a symmetrical inequality for $ -x$ when $ x < 0$, and this proves the theorem.

$ \square$

What other insights can we glean about The Inequality? For one, it’s a truncated version of the Taylor series approximation

$ \displaystyle e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots$

Indeed, the Taylor remainder theorem tells us that the first two terms approximate $ e^x$ around zero with error depending on some constant times $ e^x x^2 \geq 0$. In other words, $ 1+x$ is a lower bound on $ e^x$ around zero. It is perhaps miraculous that this extends to a lower bound everywhere, until you realize that exponentials grow extremely quickly and lines do not.

One might wonder whether we can improve our approximation with higher order approximations. Indeed we can, but we have to be a bit careful. In particular, $ 1+x+x^2/2 \leq e^x$ is only true for nonnegative $ x$ (because the remainder theorem now applies to $ x^3$, but if we restrict to odd terms we win: $ 1+x+x^2/2 + x^3/6 \leq e^x$ is true for all $ x$.

What is really surprising about The Inequality is that, at least in the applications I work with, we rarely see higher order approximations used. For most applications, The difference between an error term which is quadratic and one which is cubic or quartic is often not worth the extra work in analyzing the result. You get the same theorem: that something vanishes exponentially quickly.

If you’re interested in learning more about the theory of inequalities, I wholeheartedly recommend The Cauchy-Schwarz Master Class. This book is wonderfully written, and chock full of fun exercises. I know because I do exercises from books like this one on planes and trains. It’s my kind of sudoku 🙂