Singular Value Decomposition Part 2: Theorem, Proof, Algorithm

I’m just going to jump right into the definitions and rigor, so if you haven’t read the previous post motivating the singular value decomposition, go back and do that first. This post will be theorem, proof, algorithm, data. The data set we test on is a thousand-story CNN news data set. All of the data, code, and examples used in this post is in a github repository, as usual.

We start with the best-approximating k-dimensional linear subspace.

Definition: Let X = \{ x_1, \dots, x_m \} be a set of m points in \mathbb{R}^n. The best approximating k-dimensional linear subspace of X is the k-dimensional linear subspace V \subset \mathbb{R}^n which minimizes the sum of the squared distances from the points in X to V.

Let me clarify what I mean by minimizing the sum of squared distances. First we’ll start with the simple case: we have a vector x \in X, and a candidate line L (a 1-dimensional subspace) that is the span of a unit vector v. The squared distance from x to the line spanned by v is the squared length of x minus the squared length of the projection of x onto v. Here’s a picture.

vectormax

I’m saying that the pink vector z in the picture is the difference of the black and green vectors x-y, and that the “distance” from x to v is the length of the pink vector. The reason is just the Pythagorean theorem: the vector x is the hypotenuse of a right triangle whose other two sides are the projected vector y and the difference vector z.

Let’s throw down some notation. I’ll call \textup{proj}_v: \mathbb{R}^n \to \mathbb{R}^n the linear map that takes as input a vector x and produces as output the projection of x onto v. In fact we have a brief formula for this when v is a unit vector. If we call x \cdot v the usual dot product, then \textup{proj}_v(x) = (x \cdot v)v. That’s v scaled by the inner product of x and v. In the picture above, since the line L is the span of the vector v, that means that y = \textup{proj}_v(x) and z = x -\textup{proj}_v(x) = x-y.

The dot-product formula is useful for us because it allows us to compute the squared length of the projection by taking a dot product |x \cdot v|^2. So then a formula for the distance of x from the line spanned by the unit vector v is

\displaystyle (\textup{dist}_v(x))^2 = \left ( \sum_{i=1}^n x_i^2 \right ) - |x \cdot v|^2

This formula is just a restatement of the Pythagorean theorem for perpendicular vectors.

\displaystyle \sum_{i} x_i^2 = (\textup{proj}_v(x))^2 + (\textup{dist}_v(x))^2

In particular, the difference vector we originally called z has squared length \textup{dist}_v(x)^2. The vector y, which is perpendicular to z and is also the projection of x onto L, it’s squared length is (\textup{proj}_v(x))^2. And the Pythagorean theorem tells us that summing those two squared lengths gives you the squared length of the hypotenuse x.

If we were trying to find the best approximating 1-dimensional subspace for a set of data points X, then we’d want to minimize the sum of the squared distances for every point x \in X. Namely, we want the v that solves \min_{|v|=1} \sum_{x \in X} (\textup{dist}_v(x))^2.

With some slight algebra we can make our life easier. The short version: minimizing the sum of squared distances is the same thing as maximizing the sum of squared lengths of the projections. The longer version: let’s go back to a single point x and the line spanned by v. The Pythagorean theorem told us that

\displaystyle \sum_{i} x_i^2 = (\textup{proj}_v(x))^2 + (\textup{dist}_v(x))^2

The squared length of x is constant. It’s an input to the algorithm and it doesn’t change through a run of the algorithm. So we get the squared distance by subtracting (\textup{proj}_v(x))^2 from a constant number,

\displaystyle \sum_{i} x_i^2 - (\textup{proj}_v(x))^2 = (\textup{dist}_v(x))^2

which means if we want to minimize the squared distance, we can instead maximize the squared projection. Maximizing the subtracted thing minimizes the whole expression.

It works the same way if you’re summing over all the data points in X. In fact, we can say it much more compactly this way. If the rows of A are your data points, then Av contains as each entry the (signed) dot products x_i \cdot v. And the squared norm of this vector, |Av|^2, is exactly the sum of the squared lengths of the projections of the data onto the line spanned by v. The last thing is that maximizing a square is the same as maximizing its square root, so we can switch freely between saying our objective is to find the unit vector v that maximizes |Av| and that which maximizes |Av|^2.

At this point you should be thinking,

Great, we have written down an optimization problem: \max_{v : |v|=1} |Av|. If we could solve this, we’d have the best 1-dimensional linear approximation to the data contained in the rows of A. But (1) how do we solve that problem? And (2) you promised a k-dimensional approximating subspace. I feel betrayed! Swindled! Bamboozled!

Here’s the fantastic thing. We can solve the 1-dimensional optimization problem efficiently (we’ll do it later in this post), and (2) is answered by the following theorem.

The SVD Theorem: Computing the best k-dimensional subspace reduces to k applications of the one-dimensional problem.

We will prove this after we introduce the terms “singular value” and “singular vector.”

Singular values and vectors

As I just said, we can get the best k-dimensional approximating linear subspace by solving the one-dimensional maximization problem k times. The singular vectors of A are defined recursively as the solutions to these sub-problems. That is, I’ll call v_1 the first singular vector of A, and it is:

\displaystyle v_1 = \arg \max_{v, |v|=1} |Av|

And the corresponding first singular value, denoted \sigma_1(A), is the maximal value of the optimization objective, i.e. |Av_1|. (I will use this term frequently, that |Av| is the “objective” of the optimization problem.) Informally speaking, (\sigma_1(A))^2 represents how much of the data was captured by the first singular vector. Meaning, how close the vectors are to lying on the line spanned by v_1. Larger values imply the approximation is better. In fact, if all the data points lie on a line, then (\sigma_1(A))^2 is the sum of the squared norms of the rows of A.

Now here is where we see the reduction from the k-dimensional case to the 1-dimensional case. To find the best 2-dimensional subspace, you first find the best one-dimensional subspace (spanned by v_1), and then find the best 1-dimensional subspace, but only considering those subspaces that are the spans of unit vectors perpendicular to v_1. The notation for “vectors v perpendicular to v_1” is v \perp v_1. Restating, the second singular vector v _2 is defined as

\displaystyle v_2 = \arg \max_{v \perp v_1, |v| = 1} |Av|

And the SVD theorem implies the subspace spanned by \{ v_1, v_2 \} is the best 2-dimensional linear approximation to the data. Likewise \sigma_2(A) = |Av_2| is the second singular value. Its squared magnitude tells us how much of the data that was not “captured” by v_1 is captured by v_2. Again, if the data lies in a 2-dimensional subspace, then the span of \{ v_1, v_2 \} will be that subspace.

We can continue this process. Recursively define v_k, the k-th singular vector, to be the vector which maximizes |Av|, when v is considered only among the unit vectors which are perpendicular to \textup{span} \{ v_1, \dots, v_{k-1} \}. The corresponding singular value \sigma_k(A) is the value of the optimization problem.

As a side note, because of the way we defined the singular values as the objective values of “nested” optimization problems, the singular values are decreasing, \sigma_1(A) \geq \sigma_2(A) \geq \dots \geq \sigma_n(A) \geq 0. This is obvious: you only pick v_2 in the second optimization problem because you already picked v_1 which gave a bigger singular value, so v_2‘s objective can’t be bigger.

If you keep doing this, one of two things happen. Either you reach v_n and since the domain is n-dimensional there are no remaining vectors to choose from, the v_i are an orthonormal basis of \mathbb{R}^n. This means that the data in A contains a full-rank submatrix. The data does not lie in any smaller-dimensional subspace. This is what you’d expect from real data.

Alternatively, you could get to a stage v_k with k < n and when you try to solve the optimization problem you find that every perpendicular v has Av = 0. In this case, the data actually does lie in a k-dimensional subspace, and the first-through-k-th singular vectors you computed span this subspace.

Let’s do a quick sanity check: how do we know that the singular vectors v_i form a basis? Well formally they only span a basis of the row space of A, i.e. a basis of the subspace spanned by the data contained in the rows of A. But either way the point is that each v_{i+1} spans a new dimension from the previous v_1, \dots, v_i because we’re choosing v_{i+1} to be orthogonal to all the previous v_i. So the answer to our sanity check is “by construction.”

Back to the singular vectors, the discussion from the last post tells us intuitively that the data is probably never in a small subspace.  You never expect the process of finding singular vectors to stop before step n, and if it does you take a step back and ask if something deeper is going on. Instead, in real life you specify how much of the data you want to capture, and you keep computing singular vectors until you’ve passed the threshold. Alternatively, you specify the amount of computing resources you’d like to spend by fixing the number of singular vectors you’ll compute ahead of time, and settle for however good the k-dimensional approximation is.

Before we get into any code or solve the 1-dimensional optimization problem, let’s prove the SVD theorem.

Proof of SVD theorem.

Recall we’re trying to prove that the first k singular vectors provide a linear subspace W which maximizes the squared-sum of the projections of the data onto W. For k=1 this is trivial, because we defined v_1 to be the solution to that optimization problem. The case of k=2 contains all the important features of the general inductive step. Let W be any best-approximating 2-dimensional linear subspace for the rows of A. We’ll show that the subspace spanned by the two singular vectors v_1, v_2 is at least as good (and hence equally good).

Let w_1, w_2 be any orthonormal basis for W and let |Aw_1|^2 + |Aw_2|^2 be the quantity that we’re trying to maximize (and which W maximizes by assumption). Moreover, we can pick the basis vector w_2 to be perpendicular to v_1. To prove this we consider two cases: either v_1 is already perpendicular to W in which case it’s trivial, or else v_1 isn’t perpendicular to W and you can choose w_1 to be \textup{proj}_W(v_1) and choose w_2 to be any unit vector perpendicular to w_1.

Now since v_1 maximizes |Av|, we have |Av_1|^2 \geq |Aw_1|^2. Moreover, since w_2 is perpendicular to v_1, the way we chose v_2 also makes |Av_2|^2 \geq |Aw_2|^2. Hence the objective |Av_1|^2 + |Av_2|^2 \geq |Aw_1|^2 + |Aw_2|^2, as desired.

For the general case of k, the inductive hypothesis tells us that the first k terms of the objective for k+1 singular vectors is maximized, and we just have to pick any vector w_{k+1} that is perpendicular to all v_1, v_2, \dots, v_k, and the rest of the proof is just like the 2-dimensional case.

\square

Now remember that in the last post we started with the definition of the SVD as a decomposition of a matrix A = U\Sigma V^T? And then we said that this is a certain kind of change of basis? Well the singular vectors v_i together form the columns of the matrix V (the rows of V^T), and the corresponding singular values \sigma_i(A) are the diagonal entries of \Sigma. When A is understood we’ll abbreviate the singular value as \sigma_i.

To reiterate with the thoughts from last post, the process of applying A is exactly recovered by the process of first projecting onto the (full-rank space of) singular vectors v_1, \dots, v_k, scaling each coordinate of that projection according to the corresponding singular values, and then applying this U thing we haven’t talked about yet.

So let’s determine what U has to be. The way we picked v_i to make A diagonal gives us an immediate suggestion: use the Av_i as the columns of U. Indeed, define u_i = Av_i, the images of the singular vectors under A. We can swiftly show the u_i form a basis of the image of A. The reason is because if v = \sum_i c_i v_i (using all n of the singular vectors v_i), then by linearity Av = \sum_{i} c_i Av_i = \sum_i c_i u_i. It is also easy to see why the u_i are orthogonal (prove it as an exercise). Let’s further make sure the u_i are unit vectors and redefine them as u_i = \frac{1}{\sigma_i}Av_i

If you put these thoughts together, you can say exactly what A does to any given vector x. Since the v_i form an orthonormal basis, x = \sum_i (x \cdot v_i) v_i, and then applying A gives

\displaystyle \begin{aligned}Ax &= A \left ( \sum_i (x \cdot v_i) v_i \right ) \\  &= \sum_i (x \cdot v_i) A_i v_i \\ &= \sum_i (x \cdot v_i) \sigma_i u_i \end{aligned}

If you’ve been closely reading this blog in the last few months, you’ll recognize a very nice way to write the last line of the above equation. It’s an outer product. So depending on your favorite symbols, you’d write this as either A = \sum_{i} \sigma_i u_i \otimes v_i or A = \sum_i \sigma_i u_i v_i^T. Or, if you like expressing things as matrix factorizations, as A = U\Sigma V^T. All three are describing the same object.

Let’s move on to some code.

A black box example

Before we implement SVD from scratch (an urge that commands me from the depths of my soul!), let’s see a black-box example that uses existing tools. For this we’ll use the numpy library.

Recall our movie-rating matrix from the last post:

movieratings

The code to compute the svd of this matrix is as simple as it gets:

from numpy.linalg import svd

movieRatings = [
    [2, 5, 3],
    [1, 2, 1],
    [4, 1, 1],
    [3, 5, 2],
    [5, 3, 1],
    [4, 5, 5],
    [2, 4, 2],
    [2, 2, 5],
]

U, singularValues, V = svd(movieRatings)

Printing these values out gives

[[-0.39458526  0.23923575 -0.35445911 -0.38062172 -0.29836818 -0.49464816 -0.30703202 -0.29763321]
 [-0.15830232  0.03054913 -0.15299759 -0.45334816  0.31122898  0.23892035 -0.37313346  0.67223457]
 [-0.22155201 -0.52086121  0.39334917 -0.14974792 -0.65963979  0.00488292 -0.00783684  0.25934607]
 [-0.39692635 -0.08649009 -0.41052882  0.74387448 -0.10629499  0.01372565 -0.17959298  0.26333462]
 [-0.34630257 -0.64128825  0.07382859 -0.04494155  0.58000668 -0.25806239  0.00211823 -0.24154726]
 [-0.53347449  0.19168874  0.19949342 -0.03942604  0.00424495  0.68715732 -0.06957561 -0.40033035]
 [-0.31660464  0.06109826 -0.30599517 -0.19611823 -0.01334272  0.01446975  0.85185852  0.19463493]
 [-0.32840223  0.45970413  0.62354764  0.1783041   0.17631186 -0.39879476  0.06065902  0.25771578]]
[ 15.09626916   4.30056855   3.40701739]
[[-0.54184808 -0.67070995 -0.50650649]
 [-0.75152295  0.11680911  0.64928336]
 [ 0.37631623 -0.73246419  0.56734672]]

Now this is a bit weird, because the matrices U, V are the wrong shape! Remember, there are only supposed to be three vectors since the input matrix has rank three. So what gives? This is a distinction that goes by the name “full” versus “reduced” SVD. The idea goes back to our original statement that U \Sigma V^T is a decomposition with U, V^T both orthogonal and square matrices. But in the derivation we did in the last section, the U and V were not square. The singular vectors v_i could potentially stop before even becoming full rank.

In order to get to square matrices, what people sometimes do is take the two bases v_1, \dots, v_k and u_1, \dots, u_k and arbitrarily choose ways to complete them to a full orthonormal basis of their respective vector spaces. In other words, they just make the matrix square by filling it with data for no reason other than that it’s sometimes nice to have a complete basis. We don’t care about this. To be honest, I think the only place this comes in useful is in the desire to be particularly tidy in a mathematical formulation of something.

We can still work with it programmatically. By fudging around a bit with numpy’s shapes to get a diagonal matrix, we can reconstruct the input rating matrix from the factors.

Sigma = np.vstack([
    np.diag(singularValues),
    np.zeros((5, 3)),
])

print(np.round(movieRatings - np.dot(U, np.dot(Sigma, V)), decimals=10))

And the output is, as one expects, a matrix of all zeros. Meaning that we decomposed the movie rating matrix, and built it back up from the factors.

We can actually get the SVD as we defined it (with rectangular matrices) by passing a special flag to numpy’s svd.

U, singularValues, V = svd(movieRatings, full_matrices=False)
print(U)
print(singularValues)
print(V)

Sigma = np.diag(singularValues)
print(np.round(movieRatings - np.dot(U, np.dot(Sigma, V)), decimals=10))

And the result

[[-0.39458526  0.23923575 -0.35445911]
 [-0.15830232  0.03054913 -0.15299759]
 [-0.22155201 -0.52086121  0.39334917]
 [-0.39692635 -0.08649009 -0.41052882]
 [-0.34630257 -0.64128825  0.07382859]
 [-0.53347449  0.19168874  0.19949342]
 [-0.31660464  0.06109826 -0.30599517]
 [-0.32840223  0.45970413  0.62354764]]
[ 15.09626916   4.30056855   3.40701739]
[[-0.54184808 -0.67070995 -0.50650649]
 [-0.75152295  0.11680911  0.64928336]
 [ 0.37631623 -0.73246419  0.56734672]]
[[-0. -0. -0.]
 [-0. -0.  0.]
 [ 0. -0.  0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [ 0. -0. -0.]]

This makes the reconstruction less messy, since we can just multiply everything without having to add extra rows of zeros to \Sigma.

What do the singular vectors and values tell us about the movie rating matrix? (Besides nothing, since it’s a contrived example) You’ll notice that the first singular vector \sigma_1 > 15 while the other two singular values are around 4. This tells us that the first singular vector covers a large part of the structure of the matrix. I.e., a rank-1 matrix would be a pretty good approximation to the whole thing. As an exercise to the reader, write a program that evaluates this claim (how good is “good”?).

The greedy optimization routine

Now we’re going to write SVD from scratch. We’ll first implement the greedy algorithm for the 1-d optimization problem, and then we’ll perform the inductive step to get a full algorithm. Then we’ll run it on the CNN data set.

The method we’ll use to solve the 1-dimensional problem isn’t necessarily industry strength (see this document for a hint of what industry strength looks like), but it is simple conceptually. It’s called the power method. Now that we have our decomposition of theorem, understanding how the power method works is quite easy.

Let’s work in the language of a matrix decomposition A = U \Sigma V^T, more for practice with that language than anything else (using outer products would give us the same result with slightly different computations). Then let’s observe A^T A, wherein we’ll use the fact that U is orthonormal and so U^TU is the identity matrix:

\displaystyle A^TA = (U \Sigma V^T)^T(U \Sigma V^T) = V \Sigma U^TU \Sigma V^T = V \Sigma^2 V^T

So we can completely eliminate U from the discussion, and look at just V \Sigma^2 V^T. And what’s nice about this matrix is that we can compute its eigenvectors, and eigenvectors turn out to be exactly the singular vectors. The corresponding eigenvalues are the squared singular values. This should be clear from the above derivation. If you apply (V \Sigma^2 V^T) to any v_i, the only parts of the product that aren’t zero are the ones involving v_i with itself, and the scalar \sigma_i^2 factors in smoothly. It’s dead simple to check.

Theorem: Let x be a random unit vector and let B = A^TA = V \Sigma^2 V^T. Then with high probability, \lim_{s \to \infty} B^s x is in the span of the first singular vector v_1. If we normalize B^s x to a unit vector at each s, then furthermore the limit is v_1.

Proof. Start with a random unit vector x, and write it in terms of the singular vectors x = \sum_i c_i v_i. That means Bx = \sum_i c_i \sigma_i^2 v_i. If you recursively apply this logic, you get B^s x = \sum_i c_i \sigma_i^{2s} v_i. In particular, the dot product of (B^s x) with any v_j is c_i \sigma_j^{2s}.

What this means is that so long as the first singular value \sigma_1 is sufficiently larger than the second one \sigma_2, and in turn all the other singular values, the part of B^s x  corresponding to v_1 will be much larger than the rest. Recall that if you expand a vector in terms of an orthonormal basis, in this case B^s x expanded in the v_i, the coefficient of B^s x on v_j is exactly the dot product. So to say that B^sx converges to being in the span of v_1 is the same as saying that the ratio of these coefficients, |(B^s x \cdot v_1)| / |(B^s x \cdot v_j)| \to \infty for any j. In other words, the coefficient corresponding to the first singular vector dominates all of the others. And so if we normalize, the coefficient of B^s x corresponding to v_1 tends to 1, while the rest tend to zero.

Indeed, this ratio is just (\sigma_1 / \sigma_j)^{2s} and the base of this exponential is bigger than 1.

\square

If you want to be a little more precise and find bounds on the number of iterations required to converge, you can. The worry is that your random starting vector is “too close” to one of the smaller singular vectors v_j, so that if the ratio of \sigma_1 / \sigma_j is small, then the “pull” of v_1 won’t outweigh the pull of v_j fast enough. Choosing a random unit vector allows you to ensure with high probability that this doesn’t happen. And conditioned on it not happening (or measuring “how far the event is from happening” precisely), you can compute a precise number of iterations required to converge. The last two pages of these lecture notes have all the details.

We won’t compute a precise number of iterations. Instead we’ll just compute until the angle between B^{s+1}x and B^s x is very small. Here’s the algorithm

import numpy as np
from numpy.linalg import norm

from random import normalvariate
from math import sqrt

def randomUnitVector(n):
    unnormalized = [normalvariate(0, 1) for _ in range(n)]
    theNorm = sqrt(sum(x * x for x in unnormalized))
    return [x / theNorm for x in unnormalized]

def svd_1d(A, epsilon=1e-10):
    ''' The one-dimensional SVD '''

    n, m = A.shape
    x = randomUnitVector(m)
    lastV = None
    currentV = x
    B = np.dot(A.T, A)

    iterations = 0
    while True:
        iterations += 1
        lastV = currentV
        currentV = np.dot(B, lastV)
        currentV = currentV / norm(currentV)

        if abs(np.dot(currentV, lastV)) > 1 - epsilon:
            print("converged in {} iterations!".format(iterations))
            return currentV

We start with a random unit vector x, and then loop computing x_{t+1} = Bx_t, renormalizing at each step. The condition for stopping is that the magnitude of the dot product between x_t and x_{t+1} (since they’re unit vectors, this is the cosine of the angle between them) is very close to 1.

And using it on our movie ratings example:

if __name__ == "__main__":
    movieRatings = np.array([
        [2, 5, 3],
        [1, 2, 1],
        [4, 1, 1],
        [3, 5, 2],
        [5, 3, 1],
        [4, 5, 5],
        [2, 4, 2],
        [2, 2, 5],
    ], dtype='float64')

    print(svd_1d(movieRatings))

With the result

converged in 6 iterations!
[-0.54184805 -0.67070993 -0.50650655]

Note that the sign of the vector may be different from numpy’s output because we start with a random vector to begin with.

The recursive step, getting from v_1 to the entire SVD, is equally straightforward. Say you start with the matrix A and you compute v_1. You can use v_1 to compute u_1 and \sigma_1(A). Then you want to ensure you’re ignoring all vectors in the span of v_1 for your next greedy optimization, and to do this you can simply subtract the rank 1 component of A corresponding to v_1. I.e., set A' = A - \sigma_1(A) u_1 v_1^T. Then it’s easy to see that \sigma_1(A') = \sigma_2(A) and basically all the singular vectors shift indices by 1 when going from A to A'. Then you repeat.

If that’s not clear enough, here’s the code.

def svd(A, epsilon=1e-10):
    n, m = A.shape
    svdSoFar = []

    for i in range(m):
        matrixFor1D = A.copy()

        for singularValue, u, v in svdSoFar[:i]:
            matrixFor1D -= singularValue * np.outer(u, v)

        v = svd_1d(matrixFor1D, epsilon=epsilon)  # next singular vector
        u_unnormalized = np.dot(A, v)
        sigma = norm(u_unnormalized)  # next singular value
        u = u_unnormalized / sigma

        svdSoFar.append((sigma, u, v))

    # transform it into matrices of the right shape
    singularValues, us, vs = [np.array(x) for x in zip(*svdSoFar)]

    return singularValues, us.T, vs

And we can run this on our movie rating matrix to get the following

>>> theSVD = svd(movieRatings)
>>> theSVD[0]
array([ 15.09626916,   4.30056855,   3.40701739])
>>> theSVD[1]
array([[ 0.39458528, -0.23923093,  0.35446407],
       [ 0.15830233, -0.03054705,  0.15299815],
       [ 0.221552  ,  0.52085578, -0.39336072],
       [ 0.39692636,  0.08649568,  0.41052666],
       [ 0.34630257,  0.64128719, -0.07384286],
       [ 0.53347448, -0.19169154, -0.19948959],
       [ 0.31660465, -0.0610941 ,  0.30599629],
       [ 0.32840221, -0.45971273, -0.62353781]])
>>> theSVD[2]
array([[ 0.54184805,  0.67071006,  0.50650638],
       [ 0.75151641, -0.11679644, -0.64929321],
       [-0.37632934,  0.73246611, -0.56733554]])

Checking this against our numpy output shows it’s within a reasonable level of precision (considering the power method took on the order of ten iterations!)

>>> np.round(np.abs(npSVD[0]) - np.abs(theSVD[1]), decimals=5)
array([[ -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [  0.00000000e+00,  -1.00000000e-05,   1.00000000e-05],
       [  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   1.00000000e-05],
       [ -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [ -0.00000000e+00,   1.00000000e-05,  -1.00000000e-05]])
>>> np.round(np.abs(npSVD[2]) - np.abs(theSVD[2]), decimals=5)
array([[  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [ -1.00000000e-05,  -1.00000000e-05,   1.00000000e-05],
       [  1.00000000e-05,   0.00000000e+00,  -1.00000000e-05]])
>>> np.round(np.abs(npSVD[1]) - np.abs(theSVD[0]), decimals=5)
array([ 0.,  0., -0.])

So there we have it. We added an extra little bit to the svd function, an argument k which stops computing the svd after it reaches rank k.

CNN stories

One interesting use of the SVD is in topic modeling. Topic modeling is the process of taking a bunch of documents (news stories, or emails, or movie scripts, whatever) and grouping them by topic, where the algorithm gets to choose what counts as a “topic.” Topic modeling is just the name that natural language processing folks use instead of clustering.

The SVD can help one model topics as follows. First you construct a matrix A called a document-term matrix whose rows correspond to words in some fixed dictionary and whose columns correspond to documents. The (i,j) entry of A contains the number of times word i shows up in document j. Or, more precisely, some quantity derived from that count, like a normalized count. See this table on wikipedia for a list of options related to that. We’ll just pick one arbitrarily for use in this post.

The point isn’t how we normalize the data, but what the SVD of A = U \Sigma V^T means in this context. Recall that the domain of A, as a linear map, is a vector space whose dimension is the number of stories. We think of the vectors in this space as documents, or rather as an “embedding” of the abstract concept of a document using the counts of how often each word shows up in a document as a proxy for the semantic meaning of the document. Likewise, the codomain is the space of all words, and each word is embedded by which documents it occurs in. If we compare this to the movie rating example, it’s the same thing: a movie is the vector of ratings it receives from people, and a person is the vector of ratings of various movies.

Say you take a rank 3 approximation to A. Then you get three singular vectors v_1, v_2, v_3 which form a basis for a subspace of words, i.e., the “idealized” words. These idealized words are your topics, and you can compute where a “new word” falls by looking at which documents it appears in (writing it as a vector in the domain) and saying its “topic” is the closest of the v_1, v_2, v_3. The same process applies to new documents. You can use this to cluster existing documents as well.

The dataset we’ll use for this post is a relatively small corpus of a thousand CNN stories picked from 2012. Here’s an excerpt from one of them

$ cat data/cnn-stories/story479.txt
3 things to watch on Super Tuesday
Here are three things to watch for: Romney's big day. He's been the off-and-on frontrunner throughout the race, but a big Super Tuesday could begin an end game toward a sometimes hesitant base coalescing behind former Massachusetts Gov. Mitt Romney. Romney should win his home state of Massachusetts, neighboring Vermont and Virginia, ...

So let’s first build this document-term matrix, with the normalized values, and then we’ll compute it’s SVD and see what the topics look like.

Step 1 is cleaning the data. We used a bunch of routines from the nltk library that boils down to this loop:

    for filename, documentText in documentDict.items():
        tokens = tokenize(documentText)
        tagged_tokens = pos_tag(tokens)
        wnl = WordNetLemmatizer()
        stemmedTokens = [wnl.lemmatize(word, wordnetPos(tag)).lower()
                         for word, tag in tagged_tokens]

This turns the Super Tuesday story into a list of words (with repetition):

["thing", "watch", "three", "thing", "watch", "big", ... ]

If you’ll notice the name Romney doesn’t show up in the list of words. I’m only keeping the words that show up in the top 100,000 most common English words, and then lemmatizing all of the words to their roots. It’s not a perfect data cleaning job, but it’s simple and good enough for our purposes.

Now we can create the document term matrix.

def makeDocumentTermMatrix(data):
    words = allWords(data)  # get the set of all unique words

    wordToIndex = dict((word, i) for i, word in enumerate(words))
    indexToWord = dict(enumerate(words))
    indexToDocument = dict(enumerate(data))

    matrix = np.zeros((len(words), len(data)))
    for docID, document in enumerate(data):
        docWords = Counter(document['words'])
        for word, count in docWords.items():
            matrix[wordToIndex[word], docID] = count

    return matrix, (indexToWord, indexToDocument)

This creates a matrix with the raw integer counts. But what we need is a normalized count. The idea is that a common word like “thing” shows up disproportionately more often than “election,” and we don’t want raw magnitude of a word count to outweigh its semantic contribution to the classification. This is the applied math part of the algorithm design. So what we’ll do (and this technique together with SVD is called latent semantic indexing) is normalize each entry so that it measures both the frequency of a term in a document and the relative frequency of a term compared to the global frequency of that term. There are many ways to do this, and we’ll just pick one. See the github repository if you’re interested.

So now lets compute a rank 10 decomposition and see how to cluster the results.

    data = load()
    matrix, (indexToWord, indexToDocument) = makeDocumentTermMatrix(data)
    matrix = normalize(matrix)
    sigma, U, V = svd(matrix, k=10)

This uses our svd, not numpy’s. Though numpy’s routine is much faster, it’s fun to see things work with code written from scratch. The result is too large to display here, but I can report the singular values.

>>> sigma
array([ 42.85249098,  21.85641975,  19.15989197,  16.2403354 ,
        15.40456779,  14.3172779 ,  13.47860033,  13.23795002,
        12.98866537,  12.51307445])

Now we take our original inputs and project them onto the subspace spanned by the singular vectors. This is the part that represents each word (resp., document) in terms of the idealized words (resp., documents), the singular vectors. Then we can apply a simple k-means clustering algorithm to the result, and observe the resulting clusters as documents.

    projectedDocuments = np.dot(matrix.T, U)
    projectedWords = np.dot(matrix, V.T)

    documentCenters, documentClustering = cluster(projectedDocuments)
    wordCenters, wordClustering = cluster(projectedWords)

    wordClusters = [
        [indexToWord[i] for (i, x) in enumerate(wordClustering) if x == j]
        for j in range(len(set(wordClustering)))
    ]

    documentClusters = [
        [indexToDocument[i]['text']
         for (i, x) in enumerate(documentClustering) if x == j]
        for j in range(len(set(documentClustering)))
    ]

And now we can inspect individual clusters. Right off the bat we can tell the clusters aren’t quite right simply by looking at the sizes of each cluster.

>>> Counter(wordClustering)
Counter({1: 9689, 2: 1051, 8: 680, 5: 557, 3: 321, 7: 225, 4: 174, 6: 124, 9: 123})
>>> Counter(documentClustering)
Counter({7: 407, 6: 109, 0: 102, 5: 87, 9: 85, 2: 65, 8: 55, 4: 47, 3: 23, 1: 15})

What looks wrong to me is the size of the largest word cluster. If we could group words by topic, then this is saying there’s a topic with over nine thousand words associated with it! Inspecting it even closer, it includes words like “vegan,” “skunk,” and “pope.” On the other hand, some word clusters are spot on. Examine, for example, the fifth cluster which includes words very clearly associated with crime stories.

>>> wordClusters[4]
['account', 'accuse', 'act', 'affiliate', 'allegation', 'allege', 'altercation', 'anything', 'apartment', 'arrest', 'arrive', 'assault', 'attorney', 'authority', 'bag', 'black', 'blood', 'boy', 'brother', 'bullet', 'candy', 'car', 'carry', 'case', 'charge', 'chief', 'child', 'claim', 'client', 'commit', 'community', 'contact', 'convenience', 'court', 'crime', 'criminal', 'cry', 'dead', 'deadly', 'death', 'defense', 'department', 'describe', 'detail', 'determine', 'dispatcher', 'district', 'document', 'enforcement', 'evidence', 'extremely', 'family', 'father', 'fear', 'fiancee', 'file', 'five', 'foot', 'friend', 'front', 'gate', 'girl', 'girlfriend', 'grand', 'ground', 'guilty', 'gun', 'gunman', 'gunshot', 'hand', 'happen', 'harm', 'head', 'hear', 'heard', 'hoodie', 'hour', 'house', 'identify', 'immediately', 'incident', 'information', 'injury', 'investigate', 'investigation', 'investigator', 'involve', 'judge', 'jury', 'justice', 'kid', 'killing', 'lawyer', 'legal', 'letter', 'life', 'local', 'man', 'men', 'mile', 'morning', 'mother', 'murder', 'near', 'nearby', 'neighbor', 'newspaper', 'night', 'nothing', 'office', 'officer', 'online', 'outside', 'parent', 'person', 'phone', 'police', 'post', 'prison', 'profile', 'prosecute', 'prosecution', 'prosecutor', 'pull', 'racial', 'racist', 'release', 'responsible', 'return', 'review', 'role', 'saw', 'scene', 'school', 'scream', 'search', 'sentence', 'serve', 'several', 'shoot', 'shooter', 'shooting', 'shot', 'slur', 'someone', 'son', 'sound', 'spark', 'speak', 'staff', 'stand', 'store', 'story', 'student', 'surveillance', 'suspect', 'suspicious', 'tape', 'teacher', 'teen', 'teenager', 'told', 'tragedy', 'trial', 'vehicle', 'victim', 'video', 'walk', 'watch', 'wear', 'whether', 'white', 'witness', 'young']

As sad as it makes me to see that ‘black’ and ‘slur’ and ‘racial’ appear in this category, it’s a reminder that naively using the output of a machine learning algorithm can perpetuate racism.

Here’s another interesting cluster corresponding to economic words:

>>> wordClusters[6]
['agreement', 'aide', 'analyst', 'approval', 'approve', 'austerity', 'average', 'bailout', 'beneficiary', 'benefit', 'bill', 'billion', 'break', 'broadband', 'budget', 'class', 'combine', 'committee', 'compromise', 'conference', 'congressional', 'contribution', 'core', 'cost', 'currently', 'cut', 'deal', 'debt', 'defender', 'deficit', 'doc', 'drop', 'economic', 'economy', 'employee', 'employer', 'erode', 'eurozone', 'expire', 'extend', 'extension', 'fee', 'finance', 'fiscal', 'fix', 'fully', 'fund', 'funding', 'game', 'generally', 'gleefully', 'growth', 'hamper', 'highlight', 'hike', 'hire', 'holiday', 'increase', 'indifferent', 'insistence', 'insurance', 'job', 'juncture', 'latter', 'legislation', 'loser', 'low', 'lower', 'majority', 'maximum', 'measure', 'middle', 'negotiation', 'offset', 'oppose', 'package', 'pass', 'patient', 'pay', 'payment', 'payroll', 'pension', 'plight', 'portray', 'priority', 'proposal', 'provision', 'rate', 'recession', 'recovery', 'reduce', 'reduction', 'reluctance', 'repercussion', 'rest', 'revenue', 'rich', 'roughly', 'sale', 'saving', 'scientist', 'separate', 'sharp', 'showdown', 'sign', 'specialist', 'spectrum', 'spending', 'strength', 'tax', 'tea', 'tentative', 'term', 'test', 'top', 'trillion', 'turnaround', 'unemployed', 'unemployment', 'union', 'wage', 'welfare', 'worker', 'worth']

One can also inspect the stories, though the clusters are harder to print out here. Interestingly the first cluster of documents are stories exclusively about Trayvon Martin. The second cluster is mostly international military conflicts. The third cluster also appears to be about international conflict, but what distinguishes it from the first cluster is that every story in the second cluster discusses Syria.

>>> len([x for x in documentClusters[1] if 'Syria' in x]) / len(documentClusters[1])
0.05555555555555555
>>> len([x for x in documentClusters[2] if 'Syria' in x]) / len(documentClusters[2])
1.0

Anyway, you can explore the data more at your leisure (and tinker with the parameters to improve it!).

Issues with the power method

Though I mentioned that the power method isn’t an industry strength algorithm I didn’t say why. Let’s revisit that before we finish. The problem is that the convergence rate of even the 1-dimensional problem depends on the ratio of the first and second singular values, \sigma_1 / \sigma_2. If that ratio is very close to 1, then the convergence will take a long time and need many many matrix-vector multiplications.

One way to alleviate that is to do the trick where, to compute a large power of a matrix, you iteratively square B. But that requires computing a matrix square (instead of a bunch of matrix-vector products), and that requires a lot of time and memory if the matrix isn’t sparse. When the matrix is sparse, you can actually do the power method quite quickly, from what I’ve heard and read.

But nevertheless, the industry standard methods involve computing a particular matrix decomposition that is not only faster than the power method, but also numerically stable. That means that the algorithm’s runtime and accuracy doesn’t depend on slight changes in the entries of the input matrix. Indeed, you can have two matrices where \sigma_1 / \sigma_2 is very close to 1, but changing a single entry will make that ratio much larger. The power method depends on this, so it’s not numerically stable. But the industry standard technique is not. This technique involves something called Householder reflections. So while the power method was great for a proof of concept, there’s much more work to do if you want true SVD power.

Until next time!

Big Dimensions, and What You Can Do About It

Data is abundant, data is big, and big is a problem. Let me start with an example. Let’s say you have a list of movie titles and you want to learn their genre: romance, action, drama, etc. And maybe in this scenario IMDB doesn’t exist so you can’t scrape the answer. Well, the title alone is almost never enough information. One nice way to get more data is to do the following:

  1. Pick a large dictionary of words, say the most common 100,000 non stop-words in the English language.
  2. Crawl the web looking for documents that include the title of a film.
  3. For each film, record the counts of all other words appearing in those documents.
  4. Maybe remove instances of “movie” or “film,” etc.

After this process you have a length-100,000 vector of integers associated with each movie title. IMDB’s database has around 1.5 million listed movies, and if we have a 32-bit integer per vector entry, that’s 600 GB of data to get every movie.

One way to try to find genres is to cluster this (unlabeled) dataset of vectors, and then manually inspect the clusters and assign genres. With a really fast computer we could simply run an existing clustering algorithm on this dataset and be done. Of course, clustering 600 GB of data takes a long time, but there’s another problem. The geometric intuition that we use to design clustering algorithms degrades as the length of the vectors in the dataset grows. As a result, our algorithms perform poorly. This phenomenon is called the “curse of dimensionality” (“curse” isn’t a technical term), and we’ll return to the mathematical curiosities shortly.

A possible workaround is to try to come up with faster algorithms or be more patient. But a more interesting mathematical question is the following:

Is it possible to condense high-dimensional data into smaller dimensions and retain the important geometric properties of the data?

This goal is called dimension reduction. Indeed, all of the chatter on the internet is bound to encode redundant information, so for our movie title vectors it seems the answer should be “yes.” But the questions remain, how does one find a low-dimensional condensification? (Condensification isn’t a word, the right word is embedding, but embedding is overloaded so we’ll wait until we define it) And what mathematical guarantees can you prove about the resulting condensed data? After all, it stands to reason that different techniques preserve different aspects of the data. Only math will tell.

In this post we’ll explore this so-called “curse” of dimensionality, explain the formality of why it’s seen as a curse, and implement a wonderfully simple technique called “the random projection method” which preserves pairwise distances between points after the reduction. As usual, and all the code, data, and tests used in the making of this post are on Github.

Some curious issues, and the “curse”

We start by exploring the curse of dimensionality with experiments on synthetic data.

In two dimensions, take a circle centered at the origin with radius 1 and its bounding square.

circle.png

The circle fills up most of the area in the square, in fact it takes up exactly \pi out of 4 which is about 78%. In three dimensions we have a sphere and a cube, and the ratio of sphere volume to cube volume is a bit smaller, 4 \pi /3 out of a total of 8, which is just over 52%. What about in a thousand dimensions? Let’s try by simulation.

import random

def randUnitCube(n):
   return [(random.random() - 0.5)*2 for _ in range(n)]

def sphereCubeRatio(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   return sum(1 for x in randomSample if sum(a**2 for a in x) &lt;= 1) / numSamples 

The result is as we computed for small dimension,

 &gt;&gt;&gt; sphereCubeRatio(2,10000)
0.7857
&gt;&gt;&gt; sphereCubeRatio(3,10000)
0.5196

And much smaller for larger dimension

&gt;&gt;&gt; sphereCubeRatio(20,100000) # 100k samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,1000000) # 1M samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,2000000)
5e-07

Forget a thousand dimensions, for even twenty dimensions, a million samples wasn’t enough to register a single random point inside the unit sphere. This illustrates one concern, that when we’re sampling random points in the d-dimensional unit cube, we need at least 2^d samples to ensure we’re getting a even distribution from the whole space. In high dimensions, this face basically rules out a naive Monte Carlo approximation, where you sample random points to estimate the probability of an event too complicated to sample from directly. A machine learning viewpoint of the same problem is that in dimension d, if your machine learning algorithm requires a representative sample of the input space in order to make a useful inference, then you require 2^d samples to learn.

Luckily, we can answer our original question because there is a known formula for the volume of a sphere in any dimension. Rather than give the closed form formula, which involves the gamma function and is incredibly hard to parse, we’ll state the recursive form. Call V_i the volume of the unit sphere in dimension i. Then V_0 = 1 by convention, V_1 = 2 (it’s an interval), and V_n = \frac{2 \pi V_{n-2}}{n}. If you unpack this recursion you can see that the numerator looks like (2\pi)^{n/2} and the denominator looks like a factorial, except it skips every other number. So an even dimension would look like 2 \cdot 4 \cdot \dots \cdot n, and this grows larger than a fixed exponential. So in fact the total volume of the sphere vanishes as the dimension grows! (In addition to the ratio vanishing!)

def sphereVolume(n):
   values = [0] * (n+1)
   for i in range(n+1):
      if i == 0:
         values[i] = 1
      elif i == 1:
         values[i] = 2
      else:
         values[i] = 2*math.pi / i * values[i-2]

   return values[-1]

This should be counterintuitive. I think most people would guess, when asked about how the volume of the unit sphere changes as the dimension grows, that it stays the same or gets bigger.  But at a hundred dimensions, the volume is already getting too small to fit in a float.

&gt;&gt;&gt; sphereVolume(20)
0.025806891390014047
&gt;&gt;&gt; sphereVolume(100)
2.3682021018828297e-40
&gt;&gt;&gt; sphereVolume(1000)
0.0

The scary thing is not just that this value drops, but that it drops exponentially quickly. A consequence is that, if you’re trying to cluster data points by looking at points within a fixed distance r of one point, you have to carefully measure how big r needs to be to cover the same proportional volume as it would in low dimension.

Here’s a related issue. Say I take a bunch of points generated uniformly at random in the unit cube.

from itertools import combinations

def distancesRandomPoints(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   pairwiseDistances = [dist(x,y) for (x,y) in combinations(randomSample, 2)]
   return pairwiseDistances

In two dimensions, the histogram of distances between points looks like this

2d-distances.png

However, as the dimension grows the distribution of distances changes. It evolves like the following animation, in which each frame is an increase in dimension from 2 to 100.

distances-animation.gif

The shape of the distribution doesn’t appear to be changing all that much after the first few frames, but the center of the distribution tends to infinity (in fact, it grows like \sqrt{n}). The variance also appears to stay constant. This chart also becomes more variable as the dimension grows, again because we should be sampling exponentially many more points as the dimension grows (but we don’t). In other words, as the dimension grows the average distance grows and the tightness of the distribution stays the same. So at a thousand dimensions the average distance is about 26, tightly concentrated between 24 and 28. When the average is a thousand, the distribution is tight between 998 and 1002. If one were to normalize this data, it would appear that random points are all becoming equidistant from each other.

So in addition to the issues of runtime and sampling, the geometry of high-dimensional space looks different from what we expect. To get a better understanding of “big data,” we have to update our intuition from low-dimensional geometry with analysis and mathematical theorems that are much harder to visualize.

The Johnson-Lindenstrauss Lemma

Now we turn to proving dimension reduction is possible. There are a few methods one might first think of, such as look for suitable subsets of coordinates, or sums of subsets, but these would all appear to take a long time or they simply don’t work.

Instead, the key technique is to take a random linear subspace of a certain dimension, and project every data point onto that subspace. No searching required. The fact that this works is called the Johnson-Lindenstrauss Lemma. To set up some notation, we’ll call d(v,w) the usual distance between two points.

Lemma [Johnson-Lindenstrauss (1984)]: Given a set X of n points in \mathbb{R}^d, project the points in X to a randomly chosen subspace of dimension c. Call the projection \rho. For any \varepsilon > 0, if c is at least \Omega(\log(n) / \varepsilon^2), then with probability at least 1/2 the distances between points in X are preserved up to a factor of (1+\varepsilon). That is, with good probability every pair v,w \in X will satisfy

\displaystyle \| v-w \|^2 (1-\varepsilon) \leq \| \rho(v) - \rho(w) \|^2 \leq \| v-w \|^2 (1+\varepsilon)

Before we do the proof, which is quite short, it’s important to point out that the target dimension c does not depend on the original dimension! It only depends on the number of points in the dataset, and logarithmically so. That makes this lemma seem like pure magic, that you can take data in an arbitrarily high dimension and put it in a much smaller dimension.

On the other hand, if you include all of the hidden constants in the bound on the dimension, it’s not that impressive. If your data have a million dimensions and you want to preserve the distances up to 1% (\varepsilon = 0.01), the bound is bigger than a million! If you decrease the preservation \varepsilon to 10% (0.1), then you get down to about 12,000 dimensions, which is more reasonable. At 45% the bound drops to around 1,000 dimensions. Here’s a plot showing the theoretical bound on c in terms of \varepsilon for n fixed to a million.

boundplot

 

But keep in mind, this is just a theoretical bound for potentially misbehaving data. Later in this post we’ll see if the practical dimension can be reduced more than the theory allows. As we’ll see, an algorithm run on the projected data is still effective even if the projection goes well beyond the theoretical bound. Because the theorem is known to be tight in the worst case (see the notes at the end) this speaks more to the robustness of the typical algorithm than to the robustness of the projection method.

A second important note is that this technique does not necessarily avoid all the problems with the curse of dimensionality. We mentioned above that one potential problem is that “random points” are roughly equidistant in high dimensions. Johnson-Lindenstrauss actually preserves this problem because it preserves distances! As a consequence, you won’t see strictly better algorithm performance if you project (which we suggested is possible in the beginning of this post). But you will alleviate slow runtimes if the runtime depends exponentially on the dimension. Indeed, if you replace the dimension d with the logarithm of the number of points \log n, then 2^d becomes linear in n, and 2^{O(d)} becomes polynomial.

Proof of the J-L lemma

Let’s prove the lemma.

Proof. To start we make note that one can sample from the uniform distribution on dimension-c linear subspaces of \mathbb{R}^d by choosing the entries of a c \times d matrix A independently from a normal distribution with mean 0 and variance 1. Then, to project a vector x by this matrix (call the projection \rho), we can compute

\displaystyle \rho(x) = \frac{1}{\sqrt{c}}A x

Now fix \varepsilon > 0 and fix two points in the dataset x,y. We want an upper bound on the probability that the following is false

\displaystyle \| x-y \|^2 (1-\varepsilon) \leq \| \rho(x) - \rho(y) \|^2 \leq \| x-y \|^2 (1+\varepsilon)

Since that expression is a pain to work with, let’s rearrange it by calling u = x-y, and rearranging (using the linearity of the projection) to get the equivalent statement.

\left | \| \rho(u) \|^2 - \|u \|^2 \right | \leq \varepsilon \| u \|^2

And so we want a bound on the probability that this event does not occur, meaning the inequality switches directions.

Once we get such a bound (it will depend on c and \varepsilon) we need to ensure that this bound is true for every pair of points. The union bound allows us to do this, but it also requires that the probability of the bad thing happening tends to zero faster than 1/\binom{n}{2}. That’s where the \log(n) will come into the bound as stated in the theorem.

Continuing with our use of u for notation, define X to be the random variable \frac{c}{\| u \|^2} \| \rho(u) \|^2. By expanding the notation and using the linearity of expectation, you can show that the expected value of X is c, meaning that in expectation, distances are preserved. We are on the right track, and just need to show that the distribution of X, and thus the possible deviations in distances, is tightly concentrated around c. In full rigor, we will show

\displaystyle \Pr [X \geq (1+\varepsilon) c] < e^{-(\varepsilon^2 - \varepsilon^3) \frac{c}{4}}

Let A_i denote the i-th column of A. Define by X_i the quantity \langle A_i, u \rangle / \| u \|. This is a weighted average of the entries of A_i by the entries of u. But since we chose the entries of A from the normal distribution, and since a weighted average of normally distributed random variables is also normally distributed (has the same distribution), X_i is a N(0,1) random variable. Moreover, each column is independent. This allows us to decompose X as

X = \frac{k}{\| u \|^2} \| \rho(u) \|^2 = \frac{\| Au \|^2}{\| u \|^2}

Expanding further,

X = \sum_{i=1}^c \frac{\| A_i u \|^2}{\|u\|^2} = \sum_{i=1}^c X_i^2

Now the event X \leq (1+\varepsilon) c can be expressed in terms of the nonegative variable e^{\lambda X}, where 0 < \lambda < 1/2 is parameter, to get

\displaystyle \Pr[X \geq (1+\varepsilon) c] = \Pr[e^{\lambda X} \geq e^{(1+\varepsilon)c \lambda}]

This will become useful because the sum X = \sum_i X_i^2 will split into a product momentarily. First we apply Markov’s inequality, which says that for any nonnegative random variable Y, \Pr[Y \geq t] \leq \mathbb{E}[Y] / t. This lets us write

\displaystyle \Pr[e^{\lambda X} \geq e^{(1+\varepsilon) c \lambda}] \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{(1+\varepsilon) c \lambda}}

Now we can split up the exponent \lambda X into \sum_{i=1}^c \lambda X_i^2, and using the i.i.d.-ness of the X_i^2 we can rewrite the RHS of the inequality as

\left ( \frac{\mathbb{E}[e^{\lambda X_1^2}]}{e^{(1+\varepsilon)\lambda}} \right )^c

A similar statement using -\lambda is true for the (1-\varepsilon) part, namely that

\displaystyle \Pr[X \leq (1-\varepsilon)c] \leq \left ( \frac{\mathbb{E}[e^{-\lambda X_1^2}]}{e^{-(1-\varepsilon)\lambda}} \right )^c

The last thing that’s needed is to bound \mathbb{E}[e^{\lambda X_i^2}], but since X_i^2 \sim N(0,1), we can use the known density function for a normal distribution, and integrate to get the exact value \mathbb{E}[e^{\lambda X_1^2}] = \frac{1}{\sqrt{1-2\lambda}}. Including this in the bound gives us a closed-form bound in terms of \lambda, c, \varepsilon. Using standard calculus the optimal \lambda \in (0,1/2) is \lambda = \varepsilon / 2(1+\varepsilon). This gives

\displaystyle \Pr[X \geq (1+\varepsilon) c] \leq ((1+\varepsilon)e^{-\varepsilon})^{c/2}

Using the Taylor series expansion for e^x, one can show the bound 1+\varepsilon < e^{\varepsilon - (\varepsilon^2 - \varepsilon^3)/2}, which simplifies the final upper bound to e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

Doing the same thing for the (1-\varepsilon) version gives an equivalent bound, and so the total bound is doubled, i.e. 2e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

As we said at the beginning, applying the union bound means we need

\displaystyle 2e^{-(\varepsilon^2 - \varepsilon^3) c/4} < \frac{1}{\binom{n}{2}}

Solving this for c gives c \geq \frac{8 \log m}{\varepsilon^2 - \varepsilon^3}, as desired.

\square

Projecting in Practice

Let’s write a python program to actually perform the Johnson-Lindenstrauss dimension reduction scheme. This is sometimes called the Johnson-Lindenstrauss transform, or JLT.

First we define a random subspace by sampling an appropriately-sized matrix with normally distributed entries, and a function that performs the projection onto a given subspace (for testing).

import random
import math
import numpy

def randomSubspace(subspaceDimension, ambientDimension):
   return numpy.random.normal(0, 1, size=(subspaceDimension, ambientDimension))

def project(v, subspace):
   subspaceDimension = len(subspace)
   return (1 / math.sqrt(subspaceDimension)) * subspace.dot(v)

We have a function that computes the theoretical bound on the optimal dimension to reduce to.

def theoreticalBound(n, epsilon):
   return math.ceil(8*math.log(n) / (epsilon**2 - epsilon**3))

And then performing the JLT is simply matrix multiplication

def jlt(data, subspaceDimension):
   ambientDimension = len(data[0])
   A = randomSubspace(subspaceDimension, ambientDimension)
   return (1 / math.sqrt(subspaceDimension)) * A.dot(data.T).T

The high-dimensional dataset we’ll use comes from a data mining competition called KDD Cup 2001. The dataset we used deals with drug design, and the goal is to determine whether an organic compound binds to something called thrombin. Thrombin has something to do with blood clotting, and I won’t pretend I’m an expert. The dataset, however, has over a hundred thousand features for about 2,000 compounds. Here are a few approximate target dimensions we can hope for as epsilon varies.

&gt;&gt;&gt; [((1/x),theoreticalBound(n=2000, epsilon=1/x))
       for x in [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]]
[('0.50', 487), ('0.33', 821), ('0.25', 1298), ('0.20', 1901),
 ('0.17', 2627), ('0.14', 3477), ('0.12', 4448), ('0.11', 5542),
 ('0.10', 6757), ('0.07', 14659), ('0.05', 25604)]

Going down from a hundred thousand dimensions to a few thousand is by any measure decreases the size of the dataset by about 95%. We can also observe how the distribution of overall distances varies as the size of the subspace we project to varies.

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The last three frames are for 10, 5, and 2 dimensions respectively. As you can see the histogram starts to beef up around zero. To be honest I was expecting something a bit more dramatic like a uniform-ish distribution. Of course, the distribution of distances is not all that matters. Another concern is the worst case change in distances between any two points before and after the projection. We can see that indeed when we project to the dimension specified in the theorem, that the distances are within the prescribed bounds.

def checkTheorem(oldData, newData, epsilon):
   numBadPoints = 0

   for (x,y), (x2,y2) in zip(combinations(oldData, 2), combinations(newData, 2)):
      oldNorm = numpy.linalg.norm(x2-y2)**2
      newNorm = numpy.linalg.norm(x-y)**2

      if newNorm == 0 or oldNorm == 0:
         continue

      if abs(oldNorm / newNorm - 1) &gt; epsilon:
         numBadPoints += 1

   return numBadPoints

if __name__ == &quot;__main__&quot;
   from data import thrombin
   train, labels = thrombin.load() 

   numPoints = len(train)
   epsilon = 0.2
   subspaceDim = theoreticalBound(numPoints, epsilon)
   ambientDim = len(train[0])
   newData = jlt(train, subspaceDim)

   print(checkTheorem(train, newData, epsilon))

This program prints zero every time I try running it, which is the poor man’s way of saying it works “with high probability.” We can also plot statistics about the number of pairs of data points that are distorted by more than \varepsilon as the subspace dimension shrinks. We ran this on the following set of subspace dimensions with \varepsilon = 0.1 and took average/standard deviation over twenty trials:

   dims = [1000, 750, 500, 250, 100, 75, 50, 25, 10, 5, 2]

The result is the following chart, whose x-axis is the dimension projected to (so the left hand is the most extreme projection to 2, 5, 10 dimensions), the y-axis is the number of distorted pairs, and the error bars represent a single standard deviation away from the mean.

thrombin-worst-case

This chart provides good news about this dataset because the standard deviations are low. It tells us something that mathematicians often ignore: the predictability of the tradeoff that occurs once you go past the theoretically perfect bound. In this case, the standard deviations tell us that it’s highly predictable. Moreover, since this tradeoff curve measures pairs of points, we might conjecture that the distortion is localized around a single set of points that got significantly “rattled” by the projection. This would be an interesting exercise to explore.

Now all of these charts are really playing with the JLT and confirming the correctness of our code (and hopefully our intuition). The real question is: how well does a machine learning algorithm perform on the original data when compared to the projected data? If the algorithm only “depends” on the pairwise distances between the points, then we should expect nearly identical accuracy in the unprojected and projected versions of the data. To show this we’ll use an easy learning algorithm, the k-nearest-neighbors clustering method. The problem, however, is that there are very few positive examples in this particular dataset. So looking for the majority label of the nearest k neighbors for any k > 2 unilaterally results in the “all negative” classifier, which has 97% accuracy. This happens before and after projecting.

To compensate for this, we modify k-nearest-neighbors slightly by having the label of a predicted point be 1 if any label among its nearest neighbors is 1. So it’s not a majority vote, but rather a logical OR of the labels of nearby neighbors. Our point in this post is not to solve the problem well, but rather to show how an algorithm (even a not-so-good one) can degrade as one projects the data into smaller and smaller dimensions. Here is the code.

def nearestNeighborsAccuracy(data, labels, k=10):
   from sklearn.neighbors import NearestNeighbors
   trainData, trainLabels, testData, testLabels = randomSplit(data, labels) # cross validation
   model = NearestNeighbors(n_neighbors=k).fit(trainData)
   distances, indices = model.kneighbors(testData)
   predictedLabels = []

   for x in indices:
      xLabels = [trainLabels[i] for i in x[1:]]
      predictedLabel = max(xLabels)
      predictedLabels.append(predictedLabel)

   totalAccuracy = sum(x == y for (x,y) in zip(testLabels, predictedLabels)) / len(testLabels)
   falsePositive = (sum(x == 0 and y == 1 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 0 for x in testLabels))
   falseNegative = (sum(x == 1 and y == 0 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 1 for x in testLabels))

   return totalAccuracy, falsePositive, falseNegative

And here is the accuracy of this modified k-nearest-neighbors algorithm run on the thrombin dataset. The horizontal line represents the accuracy of the produced classifier on the unmodified data set. The x-axis represents the dimension projected to (left-hand side is the lowest), and the y-axis represents the accuracy. The mean accuracy over fifty trials was plotted, with error bars representing one standard deviation. The complete code to reproduce the plot is in the Github repository [link link link].

thrombin-knn-accuracy

Likewise, we plot the proportion of false positive and false negatives for the output classifier. Note that a “positive” label made up only about 2% of the total data set. First the false positives

thrombin-knn-fp

Then the false negatives

thrombin-knn-fn

As we can see from these three charts, things don’t really change that much (for this dataset) even when we project down to around 200-300 dimensions. Note that for these parameters the “correct” theoretical choice for dimension was on the order of 5,000 dimensions, so this is a 95% savings from the naive approach, and 99.75% space savings from the original data. Not too shabby.

Notes

The \Omega(\log(n)) worst-case dimension bound is asymptotically tight, though there is some small gap in the literature that depends on \varepsilon. This result is due to Noga Alon, the very last result (Section 9) of this paper.

We did dimension reduction with respect to preserving the Euclidean distance between points. One might naturally wonder if you can achieve the same dimension reduction with a different metric, say the taxicab metric or a p-norm. In fact, you cannot achieve anything close to logarithmic dimension reduction for the taxicab (l_1) metric. This result is due to Brinkman-Charikar in 2004.

The code we used to compute the JLT is not particularly efficient. There are much more efficient methods. One of them, borrowing its namesake from the Fast Fourier Transform, is called the Fast Johnson-Lindenstrauss Transform. The technique is due to Ailon-Chazelle from 2009, and it involves something called “preconditioning a sparse projection matrix with a randomized Fourier transform.” I don’t know precisely what that means, but it would be neat to dive into that in a future post.

The central focus in this post was whether the JLT preserves distances between points, but one might be curious as to whether the points themselves are well approximated. The answer is an enthusiastic no. If the data were images, the projected points would look nothing like the original images. However, it appears the degradation tradeoff is measurable (by some accounts perhaps linear), and there appears to be some work (also this by the same author) when restricting to sparse vectors (like word-association vectors).

Note that the JLT is not the only method for dimensionality reduction. We previously saw principal component analysis (applied to face recognition), and in the future we will cover a related technique called the Singular Value Decomposition. It is worth noting that another common technique specific to nearest-neighbor is called “locality-sensitive hashing.” Here the goal is to project the points in such a way that “similar” points land very close to each other. Say, if you were to discretize the plane into bins, these bins would form the hash values and you’d want to maximize the probability that two points with the same label land in the same bin. Then you can do things like nearest-neighbors by comparing bins.

Another interesting note, if your data is linearly separable (like the examples we saw in our age-old post on Perceptrons), then you can use the JLT to make finding a linear separator easier. First project the data onto the dimension given in the theorem. With high probability the points will still be linearly separable. And then you can use a perceptron-type algorithm in the smaller dimension. If you want to find out which side a new point is on, you project and compare with the separator in the smaller dimension.

Beyond its interest for practical dimensionality reduction, the JLT has had many other interesting theoretical consequences. More generally, the idea of “randomly projecting” your data onto some small dimensional space has allowed mathematicians to get some of the best-known results on many optimization and learning problems, perhaps the most famous of which is called MAX-CUT; the result is by Goemans-Williamson and it led to a mathematical constant being named after them, \alpha_{GW} =.878567 \dots. If you’re interested in more about the theory, Santosh Vempala wrote a wonderful (and short!) treatise dedicated to this topic.

randomprojectionbook

The Many Faces of Set Cover

A while back Peter Norvig posted a wonderful pair of articles about regex golf. The idea behind regex golf is to come up with the shortest possible regular expression that matches one given list of strings, but not the other.

“Regex Golf,” by Randall Munroe.

In the first article, Norvig runs a basic algorithm to recreate and improve the results from the comic, and in the second he beefs it up with some improved search heuristics. My favorite part about this topic is that regex golf can be phrased in terms of a problem called set cover. I noticed this when reading the comic, and was delighted to see Norvig use that as the basis of his algorithm.

The set cover problem shows up in other places, too. If you have a database of items labeled by users, and you want to find the smallest set of labels to display that covers every item in the database, you’re doing set cover. I hear there are applications in biochemistry and biology but haven’t seen them myself.

If you know what a set is (just think of the “set” or “hash set” type from your favorite programming language), then set cover has a simple definition.

Definition (The Set Cover Problem): You are given a finite set U called a “universe” and sets S_1, \dots, S_n each of which is a subset of U. You choose some of the S_i to ensure that every x \in U is in one of your chosen sets, and you want to minimize the number of S_i you picked.

It’s called a “cover” because the sets you pick “cover” every element of U. Let’s do a simple. Let U = \{ 1,2,3,4,5 \} and

\displaystyle S_1 = \{ 1,3,4 \}, S_2 = \{ 2,3,5 \}, S_3 = \{ 1,4,5 \}, S_4 = \{ 2,4 \}

Then the smallest possible number of sets you can pick is 2, and you can achieve this by picking both S_1, S_2 or both S_2, S_3. The connection to regex golf is that you pick U to be the set of strings you want to match, and you pick a set of regexes that match some of the strings in U but none of the strings you want to avoid matching (I’ll call them V). If w is such a regex, then you can form the set S_w of strings that w matches. Then if you find a small set cover with the strings w_1, \dots, w_t, then you can “or” them together to get a single regex w_1 \mid w_2 \mid \dots \mid w_t that matches all of U but none of V.

Set cover is what’s called NP-hard, and one implication is that we shouldn’t hope to find an efficient algorithm that will always give you the shortest regex for every regex golf problem. But despite this, there are approximation algorithms for set cover. What I mean by this is that there is a regex-golf algorithm A that outputs a subset of the regexes matching all of U, and the number of regexes it outputs is such-and-such close to the minimum possible number. We’ll make “such-and-such” more formal later in the post.

What made me sad was that Norvig didn’t go any deeper than saying, “We can try to approximate set cover, and the greedy algorithm is pretty good.” It’s true, but the ideas are richer than that! Set cover is a simple example to showcase interesting techniques from theoretical computer science. And perhaps ironically, in Norvig’s second post a header promised the article would discuss the theory of set cover, but I didn’t see any of what I think of as theory. Instead he partially analyzes the structure of the regex golf instances he cares about. This is useful, but not really theoretical in any way unless he can say something universal about those instances.

I don’t mean to bash Norvig. His articles were great! And in-depth theory was way beyond scope. So this post is just my opportunity to fill in some theory gaps. We’ll do three things:

  1. Show formally that set cover is NP-hard.
  2. Prove the approximation guarantee of the greedy algorithm.
  3. Show another (very different) approximation algorithm based on linear programming.

Along the way I’ll argue that by knowing (or at least seeing) the details of these proofs, one can get a better sense of what features to look for in the set cover instance you’re trying to solve. We’ll also see how set cover depicts the broader themes of theoretical computer science.

NP-hardness

The first thing we should do is show that set cover is NP-hard. Intuitively what this means is that we can take some hard problem P and encode instances of P inside set cover problems. This idea is called a reduction, because solving problem P will “reduce” to solving set cover, and the method we use to encode instance of P as set cover problems will have a small amount of overhead. This is one way to say that set cover is “at least as hard as” P.

The hard problem we’ll reduce to set cover is called 3-satisfiability (3-SAT). In 3-SAT, the input is a formula whose variables are either true or false, and the formula is expressed as an OR of a bunch of clauses, each of which is an AND of three variables (or their negations). This is called 3-CNF form. A simple example:

\displaystyle (x \vee y \vee \neg z) \wedge (\neg x \vee w \vee y) \wedge (z \vee x \vee \neg w)

The goal of the algorithm is to decide whether there is an assignment to the variables which makes the formula true. 3-SAT is one of the most fundamental problems we believe to be hard and, roughly speaking, by reducing it to set cover we include set cover in a class called NP-complete, and if any one of these problems can be solved efficiently, then they all can (this is the famous P versus NP problem, and an efficient algorithm would imply P equals NP).

So a reduction would consist of the following: you give me a formula \varphi in 3-CNF form, and I have to produce (in a way that depends on \varphi!) a universe U and a choice of subsets S_i \subset U in such a way that

\varphi has a true assignment of variables if and only if the corresponding set cover problem has a cover using k sets.

In other words, I’m going to design a function f from 3-SAT instances to set cover instances, such that x is satisfiable if and only if f(x) has a set cover with k sets.

Why do I say it only for k sets? Well, if you can always answer this question then I claim you can find the minimum size of a set cover needed by doing a binary search for the smallest value of k. So finding the minimum size of a set cover reduces to the problem of telling if theres a set cover of size k.

Now let’s do the reduction from 3-SAT to set cover.

If you give me \varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m where each C_i is a clause and the variables are denoted x_1, \dots, x_n, then I will choose as my universe U to be the set of all the clauses and indices of the variables (these are all just formal symbols). i.e.

\displaystyle U = \{ C_1, C_2, \dots, C_m, 1, 2, \dots, n \}

The first part of U will ensure I make all the clauses true, and the last part will ensure I don’t pick a variable to be both true and false at the same time.

To show how this works I have to pick my subsets. For each variable x_i, I’ll make two sets, one called S_{x_i} and one called S_{\neg x_i}. They will both contain i in addition to the clauses which they make true when the corresponding literal is true (by literal I just mean the variable or its negation). For example, if C_j uses the literal \neg x_7, then S_{\neg x_7} will contain C_j but S_{x_7} will not. Finally, I’ll set k = n, the number of variables.

Now to prove this reduction works I have to prove two things: if my starting formula has a satisfying assignment I have to show the set cover problem has a cover of size k. Indeed, take the sets S_{y} for all literals y that are set to true in a satisfying assignment. There can be at most n true literals since half are true and half are false, so there will be at most n sets, and these sets clearly cover all of U because every literal has to be satisfied by some literal or else the formula isn’t true.

The reverse direction is similar: if I have a set cover of size n, I need to use it to come up with a satisfying truth assignment for the original formula. But indeed, the sets that get chosen can’t include both a S_{x_i} and its negation set S_{\neg x_i}, because there are n of the elements \{1, 2, \dots, n \} \subset U, and each i is only in the two S_{x_i}, S_{\neg x_i}. Just by counting if I cover all the indices i, I already account for n sets! And finally, since I have covered all the clauses, the literals corresponding to the sets I chose give exactly a satisfying assignment.

Whew! So set cover is NP-hard because I encoded this logic problem 3-SAT within its rules. If we think 3-SAT is hard (and we do) then set cover must also be hard. So if we can’t hope to solve it exactly we should try to approximate the best solution.

The greedy approach

The method that Norvig uses in attacking the meta-regex golf problem is the greedy algorithm. The greedy algorithm is exactly what you’d expect: you maintain a list L of the subsets you’ve picked so far, and at each step you pick the set S_i that maximizes the number of new elements of U that aren’t already covered by the sets in L. In python pseudocode:

def greedySetCover(universe, sets):
   chosenSets = set()
   leftToCover = universe.copy()
   unchosenSets = sets

   covered = lambda s: leftToCover & s

   while universe != 0:
      if len(chosenSets) == len(sets):
         raise Exception("No set cover possible")
      
      nextSet = max(unchosenSets, key=lambda s: len(covered(s)))
      unchosenSets.remove(nextSet)
      chosenSets.add(nextSet)
      leftToCover -= nextSet
      
   return chosenSets

This is what theory has to say about the greedy algorithm:

Theorem: If it is possible to cover U by the sets in F = \{ S_1, \dots, S_n \}, then the greedy algorithm always produces a cover that at worst has size O(\log(n)) \textup{OPT}, where \textup{OPT} is the size of the smallest cover. Moreover, this is asymptotically the best any algorithm can do.

One simple fact we need from calculus is that the following sum is asymptotically the same as \log(n):

\displaystyle H(n) = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \log(n) + O(1)

Proof. [adapted from Wan] Let’s say the greedy algorithm picks sets T_1, T_2, \dots, T_k in that order. We’ll set up a little value system for the elements of U. Specifically, the value of each T_i is 1, and in step i we evenly distribute this unit value across all newly covered elements of T_i. So for T_1 each covered element gets value 1/|T_1|, and if T_2 covers four new elements, each gets a value of 1/4. One can think of this “value” as a price, or energy, or unit mass, or whatever. It’s just an accounting system (albeit a clever one) we use to make some inequalities clear later.

In general call the value v_x of element x \in U the value assigned to x at the step where it’s first covered. In particular, the number of sets chosen by the greedy algorithm k is just \sum_{x \in U} v_x. We’re just bunching back together the unit value we distributed for each step of the algorithm.

Now we want to compare the sets chosen by greedy to the optimal choice. Call a smallest set cover C_{\textup{OPT}}. Let’s stare at the following inequality.

\displaystyle \sum_{x \in U} v_x \leq \sum_{S \in C_{\textup{OPT}}} \sum_{x \in S} v_x

It’s true because each x counts for a v_x at most once in the left hand side, and in the right hand side the sets in C_{\textup{OPT}} must hit each x at least once but may hit some x more than once. Also remember the left hand side is equal to k.

Now we want to show that the inner sum on the right hand side, \sum_{x \in S} v_x, is at most H(|S|). This will in fact prove the entire theorem: because each set S_i has size at most n, the inequality above will turn into

\displaystyle k \leq |C_{\textup{OPT}}| H(|S|) \leq |C_{\textup{OPT}}| H(n)

And so k \leq \textup{OPT} \cdot O(\log(n)), which is the statement of the theorem.

So we want to show that \sum_{x \in S} v_x \leq H(|S|). For each j define \delta_j(S) to be the number of elements in S not covered in T_1, \cup \dots \cup T_j. Notice that \delta_{j-1}(S) - \delta_{j}(S) is the number of elements of S that are covered for the first time in step j. If we call t_S the smallest integer j for which \delta_j(S) = 0, we can count up the differences up to step t_S, we get

\sum_{x \in S} v_x = \sum_{i=1}^{t_S} (\delta_{i-1}(S) - \delta_i(S)) \cdot \frac{1}{T_i - (T_1 \cup \dots \cup T_{i-1})}

The rightmost term is just the cost assigned to the relevant elements at step i. Moreover, because T_i covers more new elements than S (by definition of the greedy algorithm), the fraction above is at most 1/\delta_{i-1}(S). The end is near. For brevity I’ll drop the (S) from \delta_j(S).

\displaystyle \begin{aligned} \sum_{x \in S} v_x & \leq \sum_{i=1}^{t_S} (\delta_{i-1} - \delta_i) \frac{1}{\delta_{i-1}} \\ & \leq \sum_{i=1}^{t_S} (\frac{1}{1 + \delta_i} + \frac{1}{2+\delta_i} \dots + \frac{1}{\delta_{i-1}}) \\ & = \sum_{i=1}^{t_S} H(\delta_{i-1}) - H(\delta_i) \\ &= H(\delta_0) - H(\delta_{t_S}) = H(|S|) \end{aligned}

And that proves the claim.

\square

I have three postscripts to this proof:

  1. This is basically the exact worst-case approximation that the greedy algorithm achieves. In fact, Petr Slavik proved in 1996 that the greedy gives you a set of size exactly (\log n - \log \log n + O(1)) \textup{OPT} in the worst case.
  2. This is also the best approximation that any set cover algorithm can achieve, provided that P is not NP. This result was basically known in 1994, but it wasn’t until 2013 and the use of some very sophisticated tools that the best possible bound was found with the smallest assumptions.
  3. In the proof we used that |S| \leq n to bound things, but if we knew that our sets S_i (i.e. subsets matched by a regex) had sizes bounded by, say, B, the same proof would show that the approximation factor is \log(B) instead of \log n. However, in order for that to be useful you need B to be a constant, or at least to grow more slowly than any polynomial in n, since e.g. \log(n^{0.1}) = 0.1 \log n. In fact, taking a second look at Norvig’s meta regex golf problem, some of his instances had this property! Which means the greedy algorithm gives a much better approximation ratio for certain meta regex golf problems than it does for the worst case general problem. This is one instance where knowing the proof of a theorem helps us understand how to specialize it to our interests.
norvig-table

Norvig’s frequency table for president meta-regex golf. The left side counts the size of each set (defined by a regex)

The linear programming approach

So we just said that you can’t possibly do better than the greedy algorithm for approximating set cover. There must be nothing left to say, job well done, right? Wrong! Our second analysis, based on linear programming, shows that instances with special features can have better approximation results.

In particular, if we’re guaranteed that each element x \in U occurs in at most B of the sets S_i, then the linear programming approach will give a B-approximation, i.e. a cover whose size is at worst larger than OPT by a multiplicative factor of B. In the case that B is constant, we can beat our earlier greedy algorithm.

The technique is now a classic one in optimization, called LP-relaxation (LP stands for linear programming). The idea is simple. Most optimization problems can be written as integer linear programs, that is there you have n variables x_1, \dots, x_n \in \{ 0, 1 \} and you want to maximize (or minimize) a linear function of the x_i subject to some linear constraints. The thing you’re trying to optimize is called the objective. While in general solving integer linear programs is NP-hard, we can relax the “integer” requirement to 0 \leq x_i \leq 1, or something similar. The resulting linear program, called the relaxed program, can be solved efficiently using the simplex algorithm or another more complicated method.

The output of solving the relaxed program is an assignment of real numbers for the x_i that optimizes the objective function. A key fact is that the solution to the relaxed linear program will be at least as good as the solution to the original integer program, because the optimal solution to the integer program is a valid candidate for the optimal solution to the linear program. Then the idea is that if we use some clever scheme to round the x_i to integers, we can measure how much this degrades the objective and prove that it doesn’t degrade too much when compared to the optimum of the relaxed program, which means it doesn’t degrade too much when compared to the optimum of the integer program as well.

If this sounds wishy washy and vague don’t worry, we’re about to make it super concrete for set cover.

We’ll make a binary variable x_i for each set S_i in the input, and x_i = 1 if and only if we include it in our proposed cover. Then the objective function we want to minimize is \sum_{i=1}^n x_i. If we call our elements X = \{ e_1, \dots, e_m \}, then we need to write down a linear constraint that says each element e_j is hit by at least one set in the proposed cover. These constraints have to depend on the sets S_i, but that’s not a problem. One good constraint for element e_j is

\displaystyle \sum_{i : e_j \in S_i} x_i \geq 1

In words, the only way that an e_j will not be covered is if all the sets containing it have their x_i = 0. And we need one of these constraints for each j. Putting it together, the integer linear program is

The integer program for set cover.

The integer program for set cover.

Once we understand this formulation of set cover, the relaxation is trivial. We just replace the last constraint with inequalities.

setcoverlp

For a given candidate assignment x to the x_i, call Z(x) the objective value (in this case \sum_i x_i). Now we can be more concrete about the guarantees of this relaxation method. Let \textup{OPT}_{\textup{IP}} be the optimal value of the integer program and x_{\textup{IP}} a corresponding assignment to x_i achieving the optimum. Likewise let \textup{OPT}_{\textup{LP}}, x_{\textup{LP}} be the optimal things for the linear relaxation. We will prove:

Theorem: There is a deterministic algorithm that rounds x_{\textup{LP}} to integer values x so that the objective value Z(x) \leq B \textup{OPT}_{\textup{IP}}, where B is the maximum number of sets that any element e_j occurs in. So this gives a B-approximation of set cover.

Proof. Let B be as described in the theorem, and call y = x_{\textup{LP}} to make the indexing notation easier. The rounding algorithm is to set x_i = 1 if y_i \geq 1/B and zero otherwise.

To prove the theorem we need to show two things hold about this new candidate solution x:

  1. The choice of all S_i for which x_i = 1 covers every element.
  2. The number of sets chosen (i.e. Z(x)) is at most B times more than \textup{OPT}_{\textup{LP}}.

Since \textup{OPT}_{\textup{LP}} \leq \textup{OPT}_{\textup{IP}}, so if we can prove number 2 we get Z(x) \leq B \textup{OPT}_{\textup{LP}} \leq B \textup{OPT}_{\textup{IP}}, which is the theorem.

So let’s prove 1. Fix any j and we’ll show that element e_j is covered by some set in the rounded solution. Call B_j the number of times element e_j occurs in the input sets. By definition B_j \leq B, so 1/B_j \geq 1/B. Recall y was the optimal solution to the relaxed linear program, and so it must be the case that the linear constraint for each e_j is satisfied: \sum_{i : e_j \in S_i} x_i \geq 1. We know that there are B_j terms and they sums to at least 1, so not all terms can be smaller than 1/B_j (otherwise they’d sum to something less than 1). In other words, some variable x_i in the sum is at least 1/B_j \geq 1/B, and so x_i is set to 1 in the rounded solution, corresponding to a set S_i that contains e_j. This finishes the proof of 1.

Now let’s prove 2. For each j, we know that for each x_i = 1, the corresponding variable y_i \geq 1/B. In particular 1 \leq y_i B. Now we can simply bound the sum.

\displaystyle \begin{aligned} Z(x) = \sum_i x_i &\leq \sum_i x_i (B y_i) \\ &\leq B \sum_{i} y_i \\ &= B \cdot \textup{OPT}_{\textup{LP}} \end{aligned}

The second inequality is true because some of the x_i are zero, but we can ignore them when we upper bound and just include all the y_i. This proves part 2 and the theorem.

\square

I’ve got some more postscripts to this proof:

  1. The proof works equally well when the sets are weighted, i.e. your cost for picking a set is not 1 for every set but depends on some arbitrarily given constants w_i \geq 0.
  2. We gave a deterministic algorithm rounding y to x, but one can get the same result (with high probability) using a randomized algorithm. The idea is to flip a coin with bias y_i roughly \log(n) times and set x_i = 1 if and only if the coin lands heads at least once. The guarantee is no better than what we proved, but for some other problems randomness can help you get approximations where we don’t know of any deterministic algorithms to get the same guarantees. I can’t think of any off the top of my head, but I’m pretty sure they’re out there.
  3. For step 1 we showed that at least one term in the inequality for e_j would be rounded up to 1, and this guaranteed we covered all the elements. A natural question is: why not also round up at most one term of each of these inequalities? It might be that in the worst case you don’t get a better guarantee, but it would be a quick extra heuristic you could use to post-process a rounded solution.
  4. Solving linear programs is slow. There are faster methods based on so-called “primal-dual” methods that use information about the dual of the linear program to construct a solution to the problem. Goemans and Williamson have a nice self-contained chapter on their website about this with a ton of applications.

Additional Reading

Williamson and Shmoys have a large textbook called The Design of Approximation Algorithms. One problem is that this field is like a big heap of unrelated techniques, so it’s not like the book will build up some neat theoretical foundation that works for every problem. Rather, it’s messy and there are lots of details, but there are definitely diamonds in the rough, such as the problem of (and algorithms for) coloring 3-colorable graphs with “approximately 3” colors, and the infamous unique games conjecture.

I wrote a post a while back giving conditions which, if a problem satisfies those conditions, the greedy algorithm will give a constant-factor approximation. This is much better than the worst case \log(n)-approximation we saw in this post. Moreover, I also wrote a post about matroids, which is a characterization of problems where the greedy algorithm is actually optimal.

Set cover is one of the main tools that IBM’s AntiVirus software uses to detect viruses. Similarly to the regex golf problem, they find a set of strings that occurs source code in some viruses but not (usually) in good programs. Then they look for a small set of strings that covers all the viruses, and their virus scan just has to search binaries for those strings. Hopefully the size of your set cover is really small compared to the number of viruses you want to protect against. I can’t find a reference that details this, but that is understandable because it is proprietary software.

Until next time!

Linear Programming and the Simplex Algorithm

In the last post in this series we saw some simple examples of linear programs, derived the concept of a dual linear program, and saw the duality theorem and the complementary slackness conditions which give a rough sketch of the stopping criterion for an algorithm. This time we’ll go ahead and write this algorithm for solving linear programs, and next time we’ll apply the algorithm to an industry-strength version of the nutrition problem we saw last time. The algorithm we’ll implement is called the simplex algorithm. It was the first algorithm for solving linear programs, invented in the 1940’s by George Dantzig, and it’s still the leading practical algorithm, and it was a key part of a Nobel Prize. It’s by far one of the most important algorithms ever devised.

As usual, we’ll post all of the code written in the making of this post on this blog’s Github page.

Slack variables and equality constraints

The simplex algorithm can solve any kind of linear program, but it only accepts a special form of the program as input. So first we have to do some manipulations. Recall that the primal form of a linear program was the following minimization problem.

\min \left \langle c, x \right \rangle \\ \textup{s.t. } Ax \geq b, x \geq 0

where the brackets mean “dot product.” And its dual is

\max \left \langle y, b \right \rangle \\ \textup{s.t. } A^Ty \leq c, y \geq 0

The linear program can actually have more complicated constraints than just the ones above. In general, one might want to have “greater than” and “less than” constraints in the same problem. It turns out that this isn’t any harder, and moreover the simplex algorithm only uses equality constraints, and with some finicky algebra we can turn any set of inequality or equality constraints into a set of equality constraints.

We’ll call our goal the “standard form,” which is as follows:

\max \left \langle c, x \right \rangle \\ \textup{s.t. } Ax = b, x \geq 0

It seems impossible to get the usual minimization/maximization problem into standard form until you realize there’s nothing stopping you from adding more variables to the problem. That is, say we’re given a constraint like:

\displaystyle x_7 + x_3 \leq 10,

we can add a new variable \xi, called a slack variable, so that we get an equality:

\displaystyle x_7 + x_3 + \xi = 10

And now we can just impose that \xi \geq 0. The idea is that \xi represents how much “slack” there is in the inequality, and you can always choose it to make the condition an equality. So if the equality holds and the variables are nonnegative, then the x_i will still satisfy their original inequality. For “greater than” constraints, we can do the same thing but subtract a nonnegative variable. Finally, if we have a minimization problem “\min z” we can convert it to \max -z.

So, to combine all of this together, if we have the following linear program with each kind of constraint,

Screen Shot 2014-10-05 at 12.06.19 AM

We can add new variables \xi_1, \xi_2, and write it as

Screen Shot 2014-10-05 at 12.06.41 AM

By defining the vector variable x = (x_1, x_2, x_3, \xi_1, \xi_2) and c = (-1,-1,-1,0,0) and A to have -1, 0, 1 as appropriately for the new variables, we see that the system is written in standard form.

This is the kind of tedious transformation we can automate with a program. Assuming there are n variables, the input consists of the vector c of length n, and three matrix-vector pairs (A, b) representing the three kinds of constraints. It’s a bit annoying to describe, but the essential idea is that we compute a rectangular “identity” matrix whose diagonal entries are \pm 1, and then join this with the original constraint matrix row-wise. The reader can see the full implementation in the Github repository for this post, though we won’t use this particular functionality in the algorithm that follows.

There are some other additional things we could do: for example there might be some variables that are completely unrestricted. What you do in this case is take an unrestricted variable z and replace it by the difference of two unrestricted variables z' - z''.  For simplicity we’ll ignore this, but it would be a fruitful exercise for the reader to augment the function to account for these.

What happened to the slackness conditions?

The “standard form” of our linear program raises an obvious question: how can the complementary slackness conditions make sense if everything is an equality? It turns out that one can redo all the work one did for linear programs of the form we gave last time (minimize w.r.t. greater-than constraints) for programs in the new “standard form” above. We even get the same complementary slackness conditions! If you want to, you can do this entire routine quite a bit faster if you invoke the power of Lagrangians. We won’t do that here, but the tool shows up as a way to work with primal-dual conversions in many other parts of mathematics, so it’s a good buzzword to keep in mind.

In our case, the only difference with the complementary slackness conditions is that one of the two is trivial: \left \langle y^*, Ax^* - b \right \rangle = 0. This is because if our candidate solution x^* is feasible, then it will have to satisfy Ax = b already. The other one, that \left \langle x^*, A^Ty^* - c \right \rangle = 0, is the only one we need to worry about.

Again, the complementary slackness conditions give us inspiration here. Recall that, informally, they say that when a variable is used at all, it is used as much as it can be to fulfill its constraint (the corresponding dual constraint is tight). So a solution will correspond to a choice of some variables which are either used or not, and a choice of nonzero variables will correspond to a solution. We even saw this happen in the last post when we observed that broccoli trumps oranges. If we can get a good handle on how to navigate the set of these solutions, then we’ll have a nifty algorithm.

Let’s make this official and lay out our assumptions.

Extreme points and basic solutions

Remember that the graphical way to solve a linear program is to look at the line (or hyperplane) given by \langle c, x \rangle = q and keep increasing q (or decreasing it, if you are minimizing) until the very last moment when this line touches the region of feasible solutions. Also recall that the “feasible region” is just the set of all solutions to Ax = b, that is the solutions that satisfy the constraints. We imagined this picture:

The constraints define a convex area of "feasible solutions." Image source: Wikipedia.

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

With this geometric intuition it’s clear that there will always be an optimal solution on a vertex of the feasible region. These points are called extreme points of the feasible region. But because we will almost never work in the plane again (even introducing slack variables makes us relatively high dimensional!) we want an algebraic characterization of these extreme points.

If you have a little bit of practice with convex sets the correct definition is very natural. Recall that a set X is convex if for any two points x, y \in X every point on the line segment between x and y is also in X. An algebraic way to say this (thinking of these points now as vectors) is that every point \delta x + (1-\delta) y \in X when 0 \leq \delta \leq 1. Now an extreme point is just a point that isn’t on the inside of any such line, i.e. can’t be written this way for 0 < \delta < 1. For example,

A convex set with extremal points in red. Image credit Wikipedia.

A convex set with extremal points in red. Image credit Wikipedia.

Another way to say this is that if z is an extreme point then whenever z can be written as \delta x + (1-\delta) y for some 0 < \delta < 1, then actually x=y=z. Now since our constraints are all linear (and there are a finite number of them) they won’t define a convex set with weird curves like the one above. This means that there are a finite number of extreme points that just correspond to the intersections of some of the constraints. So there are at most 2^n possibilities.

Indeed we want a characterization of extreme points that’s specific to linear programs in standard form, “\max \langle c, x \rangle \textup{ s.t. } Ax=b, x \geq 0.” And here is one.

Definition: Let A be an m \times n matrix with n \geq m. A solution x to Ax=b is called basic if at most m of its entries are nonzero.

The reason we call it “basic” is because, under some mild assumptions we describe below, a basic solution corresponds to a vector space basis of \mathbb{R}^m. Which basis? The one given by the m columns of A used in the basic solution. We don’t need to talk about bases like this, though, so in the event of a headache just think of the basis as a set B \subset \{ 1, 2, \dots, n \} of size m corresponding to the nonzero entries of the basic solution.

Indeed, what we’re doing here is looking at the matrix A_B formed by taking the columns of A whose indices are in B, and the vector x_B in the same way, and looking at the equation A_Bx_B = b. If all the parts of x that we removed were zero then this will hold if and only if Ax=b. One might worry that A_B is not invertible, so we’ll go ahead and assume it is. In fact, we’ll assume that every set of m columns of A forms a basis and that the rows of A are also linearly independent. This isn’t without loss of generality because if some rows or columns are not linearly independent, we can remove the offending constraints and variables without changing the set of solutions (this is why it’s so nice to work with the standard form).

Moreover, we’ll assume that every basic solution has exactly m nonzero variables. A basic solution which doesn’t satisfy this assumption is called degenerate, and they’ll essentially be special corner cases in the simplex algorithm. Finally, we call a basic solution feasible if (in addition to satisfying Ax=b) it satisfies x \geq 0. Now that we’ve made all these assumptions it’s easy to see that choosing m nonzero variables uniquely determines a basic feasible solution. Again calling the sub-matrix A_B for a basis B, it’s just x_B = A_B^{-1}b. Now to finish our characterization, we just have to show that under the same assumptions basic feasible solutions are exactly the extremal points of the feasible region.

Proposition: A vector x is a basic feasible solution if and only if it’s an extreme point of the set \{ x : Ax = b, x \geq 0 \}.

Proof. For one direction, suppose you have a basic feasible solution x, and say we write it as x = \delta y + (1-\delta) z for some 0 < \delta < 1. We want to show that this implies y = z. Since all of these points are in the feasible region, all of their coordinates are nonnegative. So whenever a coordinate x_i = 0 it must be that both y_i = z_i = 0. Since x has exactly n-m zero entries, it must be that y, z both have at least n-m zero entries, and hence y,z are both basic. By our non-degeneracy assumption they both then have exactly m nonzero entries. Let B be the set of the nonzero indices of x. Because Ay = Az = b, we have A(y-z) = 0. Now y-z has all of its nonzero entries in B, and because the columns of A_B are linearly independent, the fact that A_B(y-z) = 0 implies y-z = 0.

In the other direction, suppose  that you have some extreme point x which is feasible but not basic. In other words, there are more than m nonzero entries of x, and we’ll call the indices J = \{ j_1, \dots, j_t \} where t > m. The columns of A_J are linearly dependent (since they’re t vectors in \mathbb{R}^m), and so let \sum_{i=1}^t z_{j_i} A_{j_i} be a nontrivial linear combination of the columns of A. Add zeros to make the z_{j_i} into a length n vector z, so that Az = 0. Now

A(x + \varepsilon z) = A(x - \varepsilon z) = Ax = b

And if we pick \varepsilon sufficiently small x \pm \varepsilon z will still be nonnegative, because the only entries we’re changing of x are the strictly positive ones. Then x = \delta (x + \varepsilon z) + (1 - \delta) \varepsilon z for \delta = 1/2, but this is very embarrassing for x who was supposed to be an extreme point. \square

Now that we know extreme points are the same as basic feasible solutions, we need to show that any linear program that has some solution has a basic feasible solution. This is clear geometrically: any time you have an optimum it has to either lie on a line or at a vertex, and if it lies on a line then you can slide it to a vertex without changing its value. Nevertheless, it is a useful exercise to go through the algebra.

Theorem. Whenever a linear program is feasible and bounded, it has a basic feasible solution.

Proof. Let x be an optimal solution to the LP. If x has at most m nonzero entries then it’s a basic solution and by the non-degeneracy assumption it must have exactly m nonzero entries. In this case there’s nothing to do, so suppose that x has r > m nonzero entries. It can’t be a basic feasible solution, and hence is not an extreme point of the set of feasible solutions (as proved by the last theorem). So write it as x = \delta y + (1-\delta) z for some feasible y \neq z and 0 < \delta < 1.

The only thing we know about x is it’s optimal. Let c be the cost vector, and the optimality says that \langle c,x \rangle \geq \langle c,y \rangle, and \langle c,x \rangle \geq \langle c,z \rangle. We claim that in fact these are equal, that y, z are both optimal as well. Indeed, say y were not optimal, then

\displaystyle \langle c, y \rangle < \langle c,x \rangle = \delta \langle c,y \rangle + (1-\delta) \langle c,z \rangle

Which can be rearranged to show that \langle c,y \rangle < \langle c, z \rangle. Unfortunately for x, this implies that it was not optimal all along:

\displaystyle \langle c,x \rangle < \delta \langle c, z \rangle + (1-\delta) \langle c,z \rangle = \langle c,z \rangle

An identical argument works to show z is optimal, too. Now we claim we can use y,z to get a new solution that has fewer than r nonzero entries. Once we show this we’re done: inductively repeat the argument with the smaller solution until we get down to exactly m nonzero variables. As before we know that y,z must have at least as many zeros as x. If they have more zeros we’re done. And if they have exactly as many zeros we can do the following trick. Write w = \gamma y + (1- \gamma)z for a \gamma \in \mathbb{R} we’ll choose later. Note that no matter the \gamma, w is optimal. Rewriting w = z + \gamma (y-z), we just have to pick a \gamma that ensures one of the nonzero coefficients of z is zeroed out while maintaining nonnegativity. Indeed, we can just look at the index i which minimizes z_i / (y-z)_i and use \delta = - z_i / (y-z)_i. \square.

So we have an immediate (and inefficient) combinatorial algorithm: enumerate all subsets of size m, compute the corresponding basic feasible solution x_B = A_B^{-1}b, and see which gives the biggest objective value. The problem is that, even if we knew the value of m, this would take time n^m, and it’s not uncommon for m to be in the tens or hundreds (and if we don’t know m the trivial search is exponential).

So we have to be smarter, and this is where the simplex tableau comes in.

The simplex tableau

Now say you have any basis B and any feasible solution x. For now x might not be a basic solution, and even if it is, its basis of nonzero entries might not be the same as B. We can decompose the equation Ax = b into the basis part and the non basis part:

A_Bx_B + A_{B'} x_{B'} = b

and solving the equation for x_B gives

x_B = A^{-1}_B(b - A_{B'} x_{B'})

It may look like we’re making a wicked abuse of notation here, but both A_Bx_B and A_{B'}x_{B'} are vectors of length m so the dimensions actually do work out. Now our feasible solution x has to satisfy Ax = b, and the entries of x are all nonnegative, so it must be that x_B \geq 0 and x_{B'} \geq 0, and by the equality above A^{-1}_B (b - A_{B'}x_{B'}) \geq 0 as well. Now let’s write the maximization objective \langle c, x \rangle by expanding it first in terms of the x_B, x_{B'}, and then expanding x_B.

\displaystyle \begin{aligned} \langle c, x \rangle & = \langle c_B, x_B \rangle + \langle c_{B'}, x_{B'} \rangle \\  & = \langle c_B, A^{-1}_B(b - A_{B'}x_{B'}) \rangle + \langle c_{B'}, x_{B'} \rangle \\  & = \langle c_B, A^{-1}_Bb \rangle + \langle c_{B'} - (A^{-1}_B A_{B'})^T c_B, x_{B'} \rangle \end{aligned}

If we want to maximize the objective, we can just maximize this last line. There are two cases. In the first, the vector c_{B'} - (A^{-1}_B A_{B'})^T c_B \leq 0 and A_B^{-1}b \geq 0. In the above equation, this tells us that making any component of x_{B'} bigger will decrease the overall objective. In other words, \langle c, x \rangle \leq \langle c_B, A_B^{-1}b \rangle. Picking x = A_B^{-1}b (with zeros in the non basis part) meets this bound and hence must be optimal. In other words, no matter what basis B we’ve chosen (i.e., no matter the candidate basic feasible solution), if the two conditions hold then we’re done.

Now the crux of the algorithm is the second case: if the conditions aren’t met, we can pick a positive index of c_{B'} - (A_B^{-1}A_{B'})^Tc_B and increase the corresponding value of x_{B'} to increase the objective value. As we do this, other variables in the solution will change as well (by decreasing), and we have to stop when one of them hits zero. In doing so, this changes the basis by removing one index and adding another. In reality, we’ll figure out how much to increase ahead of time, and the change will correspond to a single elementary row-operation in a matrix.

Indeed, the matrix we’ll use to represent all of this data is called a tableau in the literature. The columns of the tableau will correspond to variables, and the rows to constraints. The last row of the tableau will maintain a candidate solution y to the dual problem. Here’s a rough picture to keep the different parts clear while we go through the details.

tableau

But to make it work we do a slick trick, which is to “left-multiply everything” by A_B^{-1}. In particular, if we have an LP given by c, A, b, then for any basis it’s equivalent to the LP given by c, A_B^{-1}A, A_{B}^{-1} b (just multiply your solution to the new program by A_B to get a solution to the old one). And so the actual tableau will be of this form.

tableau-symbols

When we say it’s in this form, it’s really only true up to rearranging columns. This is because the chosen basis will always be represented by an identity matrix (as it is to start with), so to find the basis you can find the embedded identity sub-matrix. In fact, the beginning of the simplex algorithm will have the initial basis sitting in the last few columns of the tableau.

Let’s look a little bit closer at the last row. The first portion is zero because A_B^{-1}A_B is the identity. But furthermore with this A_B^{-1} trick the dual LP involves A_B^{-1} everywhere there’s a variable. In particular, joining all but the last column of the last row of the tableau, we have the vector c - A_B^T(A_B^{-1})^T c, and setting y = A_B^{-1}c_B we get a candidate solution for the dual. What makes the trick even slicker is that A_B^{-1}b is already the candidate solution x_B, since (A_B^{-1}A)_B^{-1} is the identity. So we’re implicitly keeping track of two solutions here, one for the primal LP, given by the last column of the tableau, and one for the dual, contained in the last row of the tableau.

I told you the last row was the dual solution, so why all the other crap there? This is the final slick in the trick: the last row further encodes the complementary slackness conditions. Now that we recognize the dual candidate sitting there, the complementary slackness conditions simply ask for the last row to be non-positive (this is just another way of saying what we said at the beginning of this section!). You should check this, but it gives us a stopping criterion: if the last row is non-positive then stop and output the last column.

The simplex algorithm

Now (finally!) we can describe and implement the simplex algorithm in its full glory. Recall that our informal setup has been:

  1. Find an initial basic feasible solution, and set up the corresponding tableau.
  2. Find a positive index of the last row, and increase the corresponding variable (adding it to the basis) just enough to make another variable from the basis zero (removing it from the basis).
  3. Repeat step 2 until the last row is nonpositive.
  4. Output the last column.

This is almost correct, except for some details about how increasing the corresponding variables works. What we’ll really do is represent the basis variables as pivots (ones in the tableau) and then the first 1 in each row will be the variable whose value is given by the entry in the last column of that row. So, for example, the last entry in the first row may be the optimal value for x_5, if the fifth column is the first entry in row 1 to have a 1.

As we describe the algorithm, we’ll illustrate it running on a simple example. In doing this we’ll see what all the different parts of the tableau correspond to from the previous section in each step of the algorithm.

example

Spoiler alert: the optimum is x_1 = 2, x_2 = 1 and the value of the max is 8.

So let’s be more programmatically formal about this. The main routine is essentially pseudocode, and the difficulty is in implementing the helper functions

def simplex(c, A, b):
   tableau = initialTableau(c, A, b)

   while canImprove(tableau):
      pivot = findPivotIndex(tableau)
      pivotAbout(tableau, pivot)

   return primalSolution(tableau), objectiveValue(tableau)

Let’s start with the initial tableau. We’ll assume the user’s inputs already include the slack variables. In particular, our example data before adding slack is

c = [3, 2]
A = [[1, 2], [1, -1]]
b = [4, 1]

And after adding slack:

c = [3, 2, 0, 0]
A = [[1,  2,  1,  0],
     [1, -1,  0,  1]]
b = [4, 1]

Now to set up the initial tableau we need an initial feasible solution in mind. The reader is recommended to work this part out with a pencil, since it’s much easier to write down than it is to explain. Since we introduced slack variables, our initial feasible solution (basis) B can just be (0,0,1,1). And so x_B is just the slack variables, c_B is the zero vector, and A_B is the 2×2 identity matrix. Now A_B^{-1}A_{B'} = A_{B'}, which is just the original two columns of A we started with, and A_B^{-1}b = b. For the last row, c_B is zero so the part under A_B^{-1}A_B is the zero vector. The part under A_B^{-1}A_{B'} is just c_{B'} = (3,2).

Rather than move columns around every time the basis B changes, we’ll keep the tableau columns in order of (x_1, \dots, x_n, \xi_1, \dots, \xi_m). In other words, for our example the initial tableau should look like this.

[[ 1,  2,  1,  0,  4],
 [ 1, -1,  0,  1,  1],
 [ 3,  2,  0,  0,  0]]

So implementing initialTableau is just a matter of putting the data in the right place.

def initialTableau(c, A, b):
   tableau = [row[:] + [x] for row, x in zip(A, b)]
   tableau.append(c[:] + [0])
   return tableau

As an aside: in the event that we don’t start with the trivial basic feasible solution of “trivially use the slack variables,” we’d have to do a lot more work in this function. Next, the primalSolution() and objectiveValue() functions are simple, because they just extract the encoded information out from the tableau (some helper functions are omitted for brevity).

def primalSolution(tableau):
   # the pivot columns denote which variables are used
   columns = transpose(tableau)
   indices = [j for j, col in enumerate(columns[:-1]) if isPivotCol(col)]
   return list(zip(indices, columns[-1]))

def objectiveValue(tableau):
   return -(tableau[-1][-1])

Similarly, the canImprove() function just checks if there’s a nonnegative entry in the last row

def canImprove(tableau):
   lastRow = tableau[-1]
   return any(x &gt; 0 for x in lastRow[:-1])

Let’s run the first loop of our simplex algorithm. The first step is checking to see if anything can be improved (in our example it can). Then we have to find a pivot entry in the tableau. This part includes some edge-case checking, but if the edge cases aren’t a problem then the strategy is simple: find a positive entry corresponding to some entry j of B', and then pick an appropriate entry in that column to use as the pivot. Pivoting increases the value of x_j (from zero) to whatever is the largest we can make it without making some other variables become negative. As we’ve said before, we’ll stop increasing x_j when some other variable hits zero, and we can compute which will be the first to do so by looking at the current values of x_B = A_B^{-1}b (in the last column of the tableau), and seeing how pivoting will affect them. If you stare at it for long enough, it becomes clear that the first variable to hit zero will be the entry x_i of the basis for which x_i / A_{i,j} is minimal (and A_{i,j} has to be positve). This is because, in order to maintain the linear equalities, every entry of x_B will be decreased by that value during a pivot, and we can’t let any of the variables become negative.

All of this results in the following function, where we have left out the degeneracy/unboundedness checks.

def findPivotIndex(tableau):
   # pick first nonzero index of the last row
   column = [i for i,x in enumerate(tableau[-1][:-1]) if x &gt; 0][0]
   quotients = [(i, r[-1] / r[column]) for i,r in enumerate(tableau[:-1]) if r[column] &gt; 0]

   # pick row index minimizing the quotient
   row = min(quotients, key=lambda x: x[1])[0]
   return row, column

For our example, the minimizer is the (1,0) entry (second row, first column). Pivoting is just doing the usual elementary row operations (we covered this in a primer a while back on row-reduction). The pivot function we use here is no different, and in particular mutates the list in place.

def pivotAbout(tableau, pivot):
   i,j = pivot

   pivotDenom = tableau[i][j]
   tableau[i] = [x / pivotDenom for x in tableau[i]]

   for k,row in enumerate(tableau):
      if k != i:
         pivotRowMultiple = [y * tableau[k][j] for y in tableau[i]]
         tableau[k] = [x - y for x,y in zip(tableau[k], pivotRowMultiple)]

And in our example pivoting around the chosen entry gives the new tableau.

[[ 0.,  3.,  1., -1.,  3.],
 [ 1., -1.,  0.,  1.,  1.],
 [ 0.,  5.,  0., -3., -3.]]

In particular, B is now (1,0,1,0), since our pivot removed the second slack variable \xi_2 from the basis. Currently our solution has x_1 = 1, \xi_1 = 3. Notice how the identity submatrix is still sitting in there, the columns are just swapped around.

There’s still a positive entry in the bottom row, so let’s continue. The next pivot is (0,1), and pivoting around that entry gives the following tableau:

[[ 0.        ,  1.        ,  0.33333333, -0.33333333,  1.        ],
 [ 1.        ,  0.        ,  0.33333333,  0.66666667,  2.        ],
 [ 0.        ,  0.        , -1.66666667, -1.33333333, -8.        ]]

And because all of the entries in the bottom row are negative, we’re done. We read off the solution as we described, so that the first variable is 2 and the second is 1, and the objective value is the opposite of the bottom right entry, 8.

To see all of the source code, including the edge-case-checking we left out of this post, see the Github repository for this post.

Obvious questions and sad answers

An obvious question is: what is the runtime of the simplex algorithm? Is it polynomial in the size of the tableau? Is it even guaranteed to stop at some point? The surprising truth is that nobody knows the answer to all of these questions! Originally (in the 1940’s) the simplex algorithm actually had an exponential runtime in the worst case, though this was not known until 1972. And indeed, to this day while some variations are known to terminate, no variation is known to have polynomial runtime in the worst case. Some of the choices we made in our implementation (for example, picking the first column with a positive entry in the bottom row) have the potential to cycle, i.e., variables leave and enter the basis without changing the objective at all. Doing something like picking a random positive column, or picking the column which will increase the objective value by the largest amount are alternatives. Unfortunately, every single pivot-picking rule is known to give rise to exponential-time simplex algorithms in the worst case (in fact, this was discovered as recently as 2011!). So it remains open whether there is a variant of the simplex method that runs in guaranteed polynomial time.

But then, in a stunning turn of events, Leonid Khachiyan proved in the 70’s that in fact linear programs can always be solved in polynomial time, via a completely different algorithm called the ellipsoid method. Following that was a method called the interior point method, which is significantly more efficient. Both of these algorithms generalize to problems that are harder than linear programming as well, so we will probably cover them in the distant future of this blog.

Despite the celebratory nature of these two results, people still use the simplex algorithm for industrial applications of linear programming. The reason is that it’s much faster in practice, and much simpler to implement and experiment with.

The next obvious question has to do with the poignant observation that whole numbers are great. That is, you often want the solution to your problem to involve integers, and not real numbers. But adding the constraint that the variables in a linear program need to be integer valued (even just 0-1 valued!) is NP-complete. This problem is called integer linear programming, or just integer programming (IP). So we can’t hope to solve IP, and rightly so: the reader can verify easily that boolean satisfiability instances can be written as linear programs where each clause corresponds to a constraint.

This brings up a very interesting theoretical issue: if we take an integer program and just remove the integrality constraints, and solve the resulting linear program, how far away are the two solutions? If they’re close, then we can hope to give a good approximation to the integer program by solving the linear program and somehow turning the resulting solution back into an integer solution. In fact this is a very popular technique called LP-rounding. We’ll also likely cover that on this blog at some point.

Oh there’s so much to do and so little time! Until next time.