The Reasonable Effectiveness of the Multiplicative Weights Update Algorithm

papad

Christos Papadimitriou, who studies multiplicative weights in the context of biology.

Hard to believe

Sanjeev Arora and his coauthors consider it “a basic tool [that should be] taught to all algorithms students together with divide-and-conquer, dynamic programming, and random sampling.” Christos Papadimitriou calls it “so hard to believe that it has been discovered five times and forgotten.” It has formed the basis of algorithms in machine learning, optimization, game theory, economics, biology, and more.

What mystical algorithm has such broad applications? Now that computer scientists have studied it in generality, it’s known as the Multiplicative Weights Update Algorithm (MWUA). Procedurally, the algorithm is simple. I can even describe the core idea in six lines of pseudocode. You start with a collection of n objects, and each object has a weight.

Set all the object weights to be 1.
For some large number of rounds:
   Pick an object at random proportionally to the weights
   Some event happens
   Increase the weight of the chosen object if it does well in the event
   Otherwise decrease the weight

The name “multiplicative weights” comes from how we implement the last step: if the weight of the chosen object at step t is w_t before the event, and G represents how well the object did in the event, then we’ll update the weight according to the rule:

\displaystyle w_{t+1} = w_t (1 + G)

Think of this as increasing the weight by a small multiple of the object’s performance on a given round.

Here is a simple example of how it might be used. You have some money you want to invest, and you have a bunch of financial experts who are telling you what to invest in every day. So each day you pick an expert, and you follow their advice, and you either make a thousand dollars, or you lose a thousand dollars, or something in between. Then you repeat, and your goal is to figure out which expert is the most reliable.

This is how we use multiplicative weights: if we number the experts 1, \dots, N, we give each expert a weight w_i which starts at 1. Then, each day we pick an expert at random (where experts with larger weights are more likely to be picked) and at the end of the day we have some gain or loss G. Then we update the weight of the chosen expert by multiplying it by (1 + G / 1000). Sometimes you have enough information to update the weights of experts you didn’t choose, too. The theoretical guarantees of the algorithm say we’ll find the best expert quickly (“quickly” will be concrete later).

In fact, let’s play a game where you, dear reader, get to decide the rewards for each expert and each day. I programmed the multiplicative weights algorithm to react according to your choices. Click the image below to go to the demo.

mwua

This core mechanism of updating weights can be interpreted in many ways, and that’s part of the reason it has sprouted up all over mathematics and computer science. Just a few examples of where this has led:

  1. In game theory, weights are the “belief” of a player about the strategy of an opponent. The most famous algorithm to use this is called Fictitious Play, and others include EXP3 for minimizing regret in the so-called “adversarial bandit learning” problem.
  2. In machine learning, weights are the difficulty of a specific training example, so that higher weights mean the learning algorithm has to “try harder” to accommodate that example. The first result I’m aware of for this is the Perceptron (and similar Winnow) algorithm for learning hyperplane separators. The most famous is the AdaBoost algorithm.
  3. Analogously, in optimization, the weights are the difficulty of a specific constraint, and this technique can be used to approximately solve linear and semidefinite programs. The approximation is because MWUA only provides a solution with some error.
  4. In mathematical biology, the weights represent the fitness of individual alleles, and filtering reproductive success based on this and updating weights for successful organisms produces a mechanism very much like evolution. With modifications, it also provides a mechanism through which to understand sex in the context of evolutionary biology.
  5. The TCP protocol, which basically defined the internet, uses additive and multiplicative weight updates (which are very similar in the analysis) to manage congestion.
  6. You can get easy \log(n)-approximation algorithms for many NP-hard problems, such as set cover.

Additional, more technical examples can be found in this survey of Arora et al.

In the rest of this post, we’ll implement a generic Multiplicative Weights Update Algorithm, we’ll prove it’s main theoretical guarantees, and we’ll implement a linear program solver as an example of its applicability. As usual, all of the code used in the making of this post is available in a Github repository.

The generic MWUA algorithm

Let’s start by writing down pseudocode and an implementation for the MWUA algorithm in full generality.

In general we have some set X of objects and some set Y of “event outcomes” which can be completely independent. If these sets are finite, we can write down a table M whose rows are objects, whose columns are outcomes, and whose i,j entry M(i,j) is the reward produced by object x_i when the outcome is y_j. We will also write this as M(x, y) for object x and outcome y. The only assumption we’ll make on the rewards is that the values M(x, y) are bounded by some small constant B (by small I mean B should not require exponentially many bits to write down as compared to the size of X). In symbols, M(x,y) \in [0,B]. There are minor modifications you can make to the algorithm if you want negative rewards, but for simplicity we will leave that out. Note the table M just exists for analysis, and the algorithm does not know its values. Moreover, while the values in M are static, the choice of outcome y for a given round may be nondeterministic.

The MWUA algorithm randomly chooses an object x \in X in every round, observing the outcome y \in Y, and collecting the reward M(x,y) (or losing it as a penalty). The guarantee of the MWUA theorem is that the expected sum of rewards/penalties of MWUA is not much worse than if one had picked the best object (in hindsight) every single round.

Let’s describe the algorithm in notation first and build up pseudocode as we go. The input to the algorithm is the set of objects, a subroutine that observes an outcome, a black-box reward function, a learning rate parameter, and a number of rounds.

def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
   ...

We define for object x a nonnegative number w_x we call a “weight.” The weights will change over time so we’ll also sub-script a weight with a round number t, i.e. w_{x,t} is the weight of object x in round t. Initially, all the weights are 1. Then MWUA continues in rounds. We start each round by drawing an example randomly with probability proportional to the weights. Then we observe the outcome for that round and the reward for that round.

# draw: [float] -> int
# pick an index from the given list of floats proportionally
# to the size of the entry (i.e. normalize to a probability
# distribution and draw according to the probabilities).
def draw(weights):
    choice = random.uniform(0, sum(weights))
    choiceIndex = 0

    for weight in weights:
        choice -= weight
        if choice <= 0:
            return choiceIndex

        choiceIndex += 1

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate numRounds):
   weights = [1] * len(objects)
   for t in numRounds:
      chosenObjectIndex = draw(weights)
      chosenObject = objects[chosenObjectIndex]

      outcome = observeOutcome(t, weights, chosenObject)
      thisRoundReward = reward(chosenObject, outcome)

      ...

Sampling objects in this way is the same as associating a distribution D_t to each round, where if S_t = \sum_{x \in X} w_{x,t} then the probability of drawing x, which we denote D_t(x), is w_{x,t} / S_t. We don’t need to keep track of this distribution in the actual run of the algorithm, but it will help us with the mathematical analysis.

Next comes the weight update step. Let’s call our learning rate variable parameter \varepsilon. In round t say we have object x_t and outcome y_t, then the reward is M(x_t, y_t). We update the weight of the chosen object x_t according to the formula:

\displaystyle w_{x_t, t} = w_{x_t} (1 + \varepsilon M(x_t, y_t) / B)

In the more general event that you have rewards for all objects (if not, the reward-producing function can output zero), you would perform this weight update on all objects x \in X. This turns into the following Python snippet, where we hide the division by B into the choice of learning rate:

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
   weights = [1] * len(objects)
   for t in numRounds:
      chosenObjectIndex = draw(weights)
      chosenObject = objects[chosenObjectIndex]

      outcome = observeOutcome(t, weights, chosenObject)
      thisRoundReward = reward(chosenObject, outcome)

      for i in range(len(weights)):
         weights[i] *= (1 + learningRate * reward(objects[i], outcome))

One of the amazing things about this algorithm is that the outcomes and rewards could be chosen adaptively by an adversary who knows everything about the MWUA algorithm (except which random numbers the algorithm generates to make its choices). This means that the rewards in round t can depend on the weights in that same round! We will exploit this when we solve linear programs later in this post.

But even in such an oppressive, exploitative environment, MWUA persists and achieves its guarantee. And now we can state that guarantee.

Theorem (from Arora et al): The cumulative reward of the MWUA algorithm is, up to constant multiplicative factors, at least the cumulative reward of the best object minus \log(n), where n is the number of objects. (Exact formula at the end of the proof)

The core of the proof, which we’ll state as a lemma, uses one of the most elegant proof techniques in all of mathematics. It’s the idea of constructing a potential function, and tracking the change in that potential function over time. Such a proof usually has the mysterious script:

  1. Define potential function, in our case S_t.
  2. State what seems like trivial facts about the potential function to write S_{t+1} in terms of S_t, and hence get general information about S_T for some large T.
  3. Theorem is proved.
  4. Wait, what?

Clearly, coming up with a useful potential function is a difficult and prized skill.

In this proof our potential function is the sum of the weights of the objects in a given round, S_t = \sum_{x \in X} w_{x, t}. Now the lemma.

Lemma: Let B be the bound on the size of the rewards, and 0 < \varepsilon < 1/2 a learning parameter. Recall that D_t(x) is the probability that MWUA draws object x in round t. Write the expected reward for MWUA for round t as the following (using only the definition of expected value):

\displaystyle R_t = \sum_{x \in X} D_t(x) M(x, y_t)

 Then the claim of the lemma is:

\displaystyle S_{t+1} \leq S_t e^{\varepsilon R_t / B}

Proof. Expand S_{t+1} = \sum_{x \in X} w_{x, t+1} using the definition of the MWUA update:

\displaystyle \sum_{x \in X} w_{x, t+1} = \sum_{x \in X} w_{x, t}(1 + \varepsilon M(x, y_t) / B)

Now distribute w_{x, t} and split into two sums:

\displaystyle \dots = \sum_{x \in X} w_{x, t} + \frac{\varepsilon}{B} \sum_{x \in X} w_{x,t} M(x, y_t)

Using the fact that D_t(x) = \frac{w_{x,t}}{S_t}, we can replace w_{x,t} with D_t(x) S_t, which allows us to get R_t

\displaystyle \begin{aligned} \dots &= S_t + \frac{\varepsilon S_t}{B} \sum_{x \in X} D_t(x) M(x, y_t) \\ &= S_t \left ( 1 + \frac{\varepsilon R_t}{B} \right ) \end{aligned}

And then using the fact that (1 + x) \leq e^x (Taylor series), we can bound the last expression by S_te^{\varepsilon R_t / B}, as desired.

\square

Now using the lemma, we can get a hold on S_T for a large T, namely that

\displaystyle S_T \leq S_1 e^{\varepsilon \sum_{t=1}^T R_t / B}

If |X| = n then S_1=n, simplifying the above. Moreover, the sum of the weights in round T is certainly greater than any single weight, so that for every fixed object x \in X,

\displaystyle S_T \geq w_{x,T} \leq  (1 + \varepsilon)^{\sum_t M(x, y_t) / B}

Squeezing S_t between these two inequalities and taking logarithms (to simplify the exponents) gives

\displaystyle \left ( \sum_t M(x, y_t) / B \right ) \log(1+\varepsilon) \leq \log n + \frac{\varepsilon}{B} \sum_t R_t

Multiply through by B, divide by \varepsilon, rearrange, and use the fact that when 0 < \varepsilon < 1/2 we have \log(1 + \varepsilon) \geq \varepsilon - \varepsilon^2 (Taylor series) to get

\displaystyle \sum_t R_t \geq \left [ \sum_t M(x, y_t) \right ] (1-\varepsilon) - \frac{B \log n}{\varepsilon}

The bracketed term is the payoff of object x, and MWUA’s payoff is at least a fraction of that minus the logarithmic term. The bound applies to any object x \in X, and hence to the best one. This proves the theorem.

\square

Briefly discussing the bound itself, we see that the smaller the learning rate is, the closer you eventually get to the best object, but by contrast the more the subtracted quantity B \log(n) / \varepsilon hurts you. If your target is an absolute error bound against the best performing object on average, you can do more algebra to determine how many rounds you need in terms of a fixed \delta. The answer is roughly: let \varepsilon = O(\delta / B) and pick T = O(B^2 \log(n) / \delta^2). See this survey for more.

MWUA for linear programs

Now we’ll approximately solve a linear program using MWUA. Recall that a linear program is an optimization problem whose goal is to minimize (or maximize) a linear function of many variables. The objective to minimize is usually given as a dot product c \cdot x, where c is a fixed vector and x = (x_1, x_2, \dots, x_n) is a vector of non-negative variables the algorithm gets to choose. The choices for x are also constrained by a set of m linear inequalities, A_i \cdot x \geq b_i, where A_i is a fixed vector and b_i is a scalar for i = 1, \dots, m. This is usually summarized by putting all the A_i in a matrix, b_i in a vector, as

x_{\textup{OPT}} = \textup{argmin}_x \{ c \cdot x \mid Ax \geq b, x \geq 0 \}

We can further simplify the constraints by assuming we know the optimal value Z = c \cdot x_{\textup{OPT}} in advance, by doing a binary search (more on this later). So, if we ignore the hard constraint Ax \geq b, the “easy feasible region” of possible x‘s includes \{ x \mid x \geq 0, c \cdot x = Z \}.

In order to fit linear programming into the MWUA framework we have to define two things.

  1. The objects: the set of linear inequalities A_i \cdot x \geq b_i.
  2. The rewards: the error of a constraint for a special input vector x_t.

Number 2 is curious (why would we give a reward for error?) but it’s crucial and we’ll discuss it momentarily.

The special input x_t depends on the weights in round t (which is allowed, recall). Specifically, if the weights are w = (w_1, \dots, w_m), we ask for a vector x_t in our “easy feasible region” which satisfies

\displaystyle (A^T w) \cdot x_t \geq w \cdot b

For this post we call the implementation of procuring such a vector the “oracle,” since it can be seen as the black-box problem of, given a vector \alpha and a scalar \beta and a convex region R, finding a vector x \in R satisfying \alpha \cdot x \geq \beta. This allows one to solve more complex optimization problems with the same technique, swapping in a new oracle as needed. Our choice of inputs, \alpha = A^T w, \beta = w \cdot b, are particular to the linear programming formulation.

Two remarks on this choice of inputs. First, the vector A^T w is a weighted average of the constraints in A, and w \cdot b is a weighted average of the thresholds. So this this inequality is a “weighted average” inequality (specifically, a convex combination, since the weights are nonnegative). In particular, if no such x exists, then the original linear program has no solution. Indeed, given a solution x^* to the original linear program, each constraint, say A_1 x^*_1 \geq b_1, is unaffected by left-multiplication by w_1.

Second, and more important to the conceptual understanding of this algorithm, the choice of rewards and the multiplicative updates ensure that easier constraints show up less prominently in the inequality by having smaller weights. That is, if we end up overly satisfying a constraint, we penalize that object for future rounds so we don’t waste our effort on it. The byproduct of MWUA—the weights—identify the hardest constraints to satisfy, and so in each round we can put a proportionate amount of effort into solving (one of) the hard constraints. This is why it makes sense to reward error; the error is a signal for where to improve, and by over-representing the hard constraints, we force MWUA’s attention on them.

At the end, our final output is an average of the x_t produced in each round, i.e. x^* = \frac{1}{T}\sum_t x_t. This vector satisfies all the constraints to a roughly equal degree. We will skip the proof that this vector does what we want, but see these notes for a simple proof. We’ll spend the rest of this post implementing the scheme outlined above.

Implementing the oracle

Fix the convex region R = \{ c \cdot x = Z, x \geq 0 \} for a known optimal value Z. Define \textup{oracle}(\alpha, \beta) as the problem of finding an x \in R such that \alpha \cdot x \geq \beta.

For the case of this linear region R, we can simply find the index i which maximizes \alpha_i Z / c_i. If this value exceeds \beta, we can return the vector with that value in the i-th position and zeros elsewhere. Otherwise, the problem has no solution.

To prove the “no solution” part, say n=2 and you have x = (x_1, x_2) a solution to \alpha \cdot x \geq \beta. Then for whichever index makes \alpha_i Z / c_i bigger, say i=1, you can increase \alpha \cdot x without changing c \cdot x = Z by replacing x_1 with x_1 + (c_2/c_1)x_2 and x_2 with zero. I.e., we’re moving the solution x along the line c \cdot x = Z until it reaches a vertex of the region bounded by c \cdot x = Z and x \geq 0. This must happen when all entries but one are zero. This is the same reason why optimal solutions of (generic) linear programs occur at vertices of their feasible regions.

The code for this becomes quite simple. Note we use the numpy library in the entire codebase to make linear algebra operations fast and simple to read.

def makeOracle(c, optimalValue):
    n = len(c)

    def oracle(weightedVector, weightedThreshold):
        def quantity(i):
            return weightedVector[i] * optimalValue / c[i] if c[i] > 0 else -1

        biggest = max(range(n), key=quantity)
        if quantity(biggest) < weightedThreshold:
            raise InfeasibleException

        return numpy.array([optimalValue / c[i] if i == biggest else 0 for i in range(n)])

    return oracle

Implementing the core solver

The core solver implements the discussion from previously, given the optimal value of the linear program as input. To avoid too many single-letter variable names, we use linearObjective instead of c.

def solveGivenOptimalValue(A, b, linearObjective, optimalValue, learningRate=0.1):
    m, n = A.shape  # m equations, n variables
    oracle = makeOracle(linearObjective, optimalValue)

    def reward(i, specialVector):
        ...

    def observeOutcome(_, weights, __):
        ...

    numRounds = 1000
    weights, cumulativeReward, outcomes = MWUA(
        range(m), observeOutcome, reward, learningRate, numRounds
    )
    averageVector = sum(outcomes) / numRounds

    return averageVector

First we make the oracle, then the reward and outcome-producing functions, then we invoke the MWUA subroutine. Here are those two functions; they are closures because they need access to A and b. Note that neither c nor the optimal value show up here.

    def reward(i, specialVector):
        constraint = A[i]
        threshold = b[i]
        return threshold - numpy.dot(constraint, specialVector)

    def observeOutcome(_, weights, __):
        weights = numpy.array(weights)
        weightedVector = A.transpose().dot(weights)
        weightedThreshold = weights.dot(b)
        return oracle(weightedVector, weightedThreshold)

Implementing the binary search, and an example

Finally, the top-level routine. Note that the binary search for the optimal value is sophisticated (though it could be more sophisticated). It takes a max range for the search, and invokes the optimization subroutine, moving the upper bound down if the linear program is feasible and moving the lower bound up otherwise.

def solve(A, b, linearObjective, maxRange=1000):
    optRange = [0, maxRange]

    while optRange[1] - optRange[0] > 1e-8:
        proposedOpt = sum(optRange) / 2
        print("Attempting to solve with proposedOpt=%G" % proposedOpt)

        # Because the binary search starts so high, it results in extreme
        # reward values that must be tempered by a slow learning rate. Exercise
        # to the reader: determine absolute bounds for the rewards, and set
        # this learning rate in a more principled fashion.
        learningRate = 1 / max(2 * proposedOpt * c for c in linearObjective)
        learningRate = min(learningRate, 0.1)

        try:
            result = solveGivenOptimalValue(A, b, linearObjective, proposedOpt, learningRate)
            optRange[1] = proposedOpt
        except InfeasibleException:
            optRange[0] = proposedOpt

    return result

Finally, a simple example:

A = numpy.array([[1, 2, 3], [0, 4, 2]])
b = numpy.array([5, 6])
c = numpy.array([1, 2, 1])

x = solve(A, b, c)
print(x)
print(c.dot(x))
print(A.dot(x) - b)

The output:

Attempting to solve with proposedOpt=500
Attempting to solve with proposedOpt=250
Attempting to solve with proposedOpt=125
Attempting to solve with proposedOpt=62.5
Attempting to solve with proposedOpt=31.25
Attempting to solve with proposedOpt=15.625
Attempting to solve with proposedOpt=7.8125
Attempting to solve with proposedOpt=3.90625
Attempting to solve with proposedOpt=1.95312
Attempting to solve with proposedOpt=2.92969
Attempting to solve with proposedOpt=3.41797
Attempting to solve with proposedOpt=3.17383
Attempting to solve with proposedOpt=3.05176
Attempting to solve with proposedOpt=2.99072
Attempting to solve with proposedOpt=3.02124
Attempting to solve with proposedOpt=3.00598
Attempting to solve with proposedOpt=2.99835
Attempting to solve with proposedOpt=3.00217
Attempting to solve with proposedOpt=3.00026
Attempting to solve with proposedOpt=2.99931
Attempting to solve with proposedOpt=2.99978
Attempting to solve with proposedOpt=3.00002
Attempting to solve with proposedOpt=2.9999
Attempting to solve with proposedOpt=2.99996
Attempting to solve with proposedOpt=2.99999
Attempting to solve with proposedOpt=3.00001
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3  # note %G rounds the printed values
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
[ 0.     0.987  1.026]
3.00000000425
[  5.20000072e-02   8.49831849e-09]

So there we have it. A fiendishly clever use of multiplicative weights for solving linear programs.

Discussion

One of the nice aspects of MWUA is it’s completely transparent. If you want to know why a decision was made, you can simply look at the weights and look at the history of rewards of the objects. There’s also a clear interpretation of what is being optimized, as the potential function used in the proof is a measure of both quality and adaptability to change. The latter is why MWUA succeeds even in adversarial settings, and why it makes sense to think about MWUA in the context of evolutionary biology.

This even makes one imagine new problems that traditional algorithms cannot solve, but which MWUA handles with grace. For example, imagine trying to solve an “online” linear program in which over time a constraint can change. MWUA can adapt to maintain its approximate solution.

The linear programming technique is known in the literature as the Plotkin-Shmoys-Tardos framework for covering and packing problems. The same ideas extend to other convex optimization problems, including semidefinite programming.

If you’ve been reading this entire post screaming “This is just gradient descent!” Then you’re right and wrong. It bears a striking resemblance to gradient descent (see this document for details about how special cases of MWUA are gradient descent by another name), but the adaptivity for the rewards makes MWUA different.

Even though so many people have been advocating for MWUA over the past decade, it’s surprising that it doesn’t show up in the general math/CS discourse on the internet or even in many algorithms courses. The Arora survey I referenced is from 2005 and the linear programming technique I demoed is originally from 1991! I took algorithms classes wherever I could, starting undergraduate in 2007, and I didn’t even hear a whisper of this technique until midway through my PhD in theoretical CS (I did, however, study fictitious play in a game theory class). I don’t have an explanation for why this is the case, except maybe that it takes more than 20 years for techniques to make it to the classroom. At the very least, this is one good reason to go to graduate school. You learn the things (and where to look for the things) which haven’t made it to classrooms yet.

Until next time!

Advertisements

Singular Value Decomposition Part 2: Theorem, Proof, Algorithm

I’m just going to jump right into the definitions and rigor, so if you haven’t read the previous post motivating the singular value decomposition, go back and do that first. This post will be theorem, proof, algorithm, data. The data set we test on is a thousand-story CNN news data set. All of the data, code, and examples used in this post is in a github repository, as usual.

We start with the best-approximating k-dimensional linear subspace.

Definition: Let X = \{ x_1, \dots, x_m \} be a set of m points in \mathbb{R}^n. The best approximating k-dimensional linear subspace of X is the k-dimensional linear subspace V \subset \mathbb{R}^n which minimizes the sum of the squared distances from the points in X to V.

Let me clarify what I mean by minimizing the sum of squared distances. First we’ll start with the simple case: we have a vector x \in X, and a candidate line L (a 1-dimensional subspace) that is the span of a unit vector v. The squared distance from x to the line spanned by v is the squared length of x minus the squared length of the projection of x onto v. Here’s a picture.

vectormax

I’m saying that the pink vector z in the picture is the difference of the black and green vectors x-y, and that the “distance” from x to v is the length of the pink vector. The reason is just the Pythagorean theorem: the vector x is the hypotenuse of a right triangle whose other two sides are the projected vector y and the difference vector z.

Let’s throw down some notation. I’ll call \textup{proj}_v: \mathbb{R}^n \to \mathbb{R}^n the linear map that takes as input a vector x and produces as output the projection of x onto v. In fact we have a brief formula for this when v is a unit vector. If we call x \cdot v the usual dot product, then \textup{proj}_v(x) = (x \cdot v)v. That’s v scaled by the inner product of x and v. In the picture above, since the line L is the span of the vector v, that means that y = \textup{proj}_v(x) and z = x -\textup{proj}_v(x) = x-y.

The dot-product formula is useful for us because it allows us to compute the squared length of the projection by taking a dot product |x \cdot v|^2. So then a formula for the distance of x from the line spanned by the unit vector v is

\displaystyle (\textup{dist}_v(x))^2 = \left ( \sum_{i=1}^n x_i^2 \right ) - |x \cdot v|^2

This formula is just a restatement of the Pythagorean theorem for perpendicular vectors.

\displaystyle \sum_{i} x_i^2 = (\textup{proj}_v(x))^2 + (\textup{dist}_v(x))^2

In particular, the difference vector we originally called z has squared length \textup{dist}_v(x)^2. The vector y, which is perpendicular to z and is also the projection of x onto L, it’s squared length is (\textup{proj}_v(x))^2. And the Pythagorean theorem tells us that summing those two squared lengths gives you the squared length of the hypotenuse x.

If we were trying to find the best approximating 1-dimensional subspace for a set of data points X, then we’d want to minimize the sum of the squared distances for every point x \in X. Namely, we want the v that solves \min_{|v|=1} \sum_{x \in X} (\textup{dist}_v(x))^2.

With some slight algebra we can make our life easier. The short version: minimizing the sum of squared distances is the same thing as maximizing the sum of squared lengths of the projections. The longer version: let’s go back to a single point x and the line spanned by v. The Pythagorean theorem told us that

\displaystyle \sum_{i} x_i^2 = (\textup{proj}_v(x))^2 + (\textup{dist}_v(x))^2

The squared length of x is constant. It’s an input to the algorithm and it doesn’t change through a run of the algorithm. So we get the squared distance by subtracting (\textup{proj}_v(x))^2 from a constant number,

\displaystyle \sum_{i} x_i^2 - (\textup{proj}_v(x))^2 = (\textup{dist}_v(x))^2

which means if we want to minimize the squared distance, we can instead maximize the squared projection. Maximizing the subtracted thing minimizes the whole expression.

It works the same way if you’re summing over all the data points in X. In fact, we can say it much more compactly this way. If the rows of A are your data points, then Av contains as each entry the (signed) dot products x_i \cdot v. And the squared norm of this vector, |Av|^2, is exactly the sum of the squared lengths of the projections of the data onto the line spanned by v. The last thing is that maximizing a square is the same as maximizing its square root, so we can switch freely between saying our objective is to find the unit vector v that maximizes |Av| and that which maximizes |Av|^2.

At this point you should be thinking,

Great, we have written down an optimization problem: \max_{v : |v|=1} |Av|. If we could solve this, we’d have the best 1-dimensional linear approximation to the data contained in the rows of A. But (1) how do we solve that problem? And (2) you promised a k-dimensional approximating subspace. I feel betrayed! Swindled! Bamboozled!

Here’s the fantastic thing. We can solve the 1-dimensional optimization problem efficiently (we’ll do it later in this post), and (2) is answered by the following theorem.

The SVD Theorem: Computing the best k-dimensional subspace reduces to k applications of the one-dimensional problem.

We will prove this after we introduce the terms “singular value” and “singular vector.”

Singular values and vectors

As I just said, we can get the best k-dimensional approximating linear subspace by solving the one-dimensional maximization problem k times. The singular vectors of A are defined recursively as the solutions to these sub-problems. That is, I’ll call v_1 the first singular vector of A, and it is:

\displaystyle v_1 = \arg \max_{v, |v|=1} |Av|

And the corresponding first singular value, denoted \sigma_1(A), is the maximal value of the optimization objective, i.e. |Av_1|. (I will use this term frequently, that |Av| is the “objective” of the optimization problem.) Informally speaking, (\sigma_1(A))^2 represents how much of the data was captured by the first singular vector. Meaning, how close the vectors are to lying on the line spanned by v_1. Larger values imply the approximation is better. In fact, if all the data points lie on a line, then (\sigma_1(A))^2 is the sum of the squared norms of the rows of A.

Now here is where we see the reduction from the k-dimensional case to the 1-dimensional case. To find the best 2-dimensional subspace, you first find the best one-dimensional subspace (spanned by v_1), and then find the best 1-dimensional subspace, but only considering those subspaces that are the spans of unit vectors perpendicular to v_1. The notation for “vectors v perpendicular to v_1” is v \perp v_1. Restating, the second singular vector v _2 is defined as

\displaystyle v_2 = \arg \max_{v \perp v_1, |v| = 1} |Av|

And the SVD theorem implies the subspace spanned by \{ v_1, v_2 \} is the best 2-dimensional linear approximation to the data. Likewise \sigma_2(A) = |Av_2| is the second singular value. Its squared magnitude tells us how much of the data that was not “captured” by v_1 is captured by v_2. Again, if the data lies in a 2-dimensional subspace, then the span of \{ v_1, v_2 \} will be that subspace.

We can continue this process. Recursively define v_k, the k-th singular vector, to be the vector which maximizes |Av|, when v is considered only among the unit vectors which are perpendicular to \textup{span} \{ v_1, \dots, v_{k-1} \}. The corresponding singular value \sigma_k(A) is the value of the optimization problem.

As a side note, because of the way we defined the singular values as the objective values of “nested” optimization problems, the singular values are decreasing, \sigma_1(A) \geq \sigma_2(A) \geq \dots \geq \sigma_n(A) \geq 0. This is obvious: you only pick v_2 in the second optimization problem because you already picked v_1 which gave a bigger singular value, so v_2‘s objective can’t be bigger.

If you keep doing this, one of two things happen. Either you reach v_n and since the domain is n-dimensional there are no remaining vectors to choose from, the v_i are an orthonormal basis of \mathbb{R}^n. This means that the data in A contains a full-rank submatrix. The data does not lie in any smaller-dimensional subspace. This is what you’d expect from real data.

Alternatively, you could get to a stage v_k with k < n and when you try to solve the optimization problem you find that every perpendicular v has Av = 0. In this case, the data actually does lie in a k-dimensional subspace, and the first-through-k-th singular vectors you computed span this subspace.

Let’s do a quick sanity check: how do we know that the singular vectors v_i form a basis? Well formally they only span a basis of the row space of A, i.e. a basis of the subspace spanned by the data contained in the rows of A. But either way the point is that each v_{i+1} spans a new dimension from the previous v_1, \dots, v_i because we’re choosing v_{i+1} to be orthogonal to all the previous v_i. So the answer to our sanity check is “by construction.”

Back to the singular vectors, the discussion from the last post tells us intuitively that the data is probably never in a small subspace.  You never expect the process of finding singular vectors to stop before step n, and if it does you take a step back and ask if something deeper is going on. Instead, in real life you specify how much of the data you want to capture, and you keep computing singular vectors until you’ve passed the threshold. Alternatively, you specify the amount of computing resources you’d like to spend by fixing the number of singular vectors you’ll compute ahead of time, and settle for however good the k-dimensional approximation is.

Before we get into any code or solve the 1-dimensional optimization problem, let’s prove the SVD theorem.

Proof of SVD theorem.

Recall we’re trying to prove that the first k singular vectors provide a linear subspace W which maximizes the squared-sum of the projections of the data onto W. For k=1 this is trivial, because we defined v_1 to be the solution to that optimization problem. The case of k=2 contains all the important features of the general inductive step. Let W be any best-approximating 2-dimensional linear subspace for the rows of A. We’ll show that the subspace spanned by the two singular vectors v_1, v_2 is at least as good (and hence equally good).

Let w_1, w_2 be any orthonormal basis for W and let |Aw_1|^2 + |Aw_2|^2 be the quantity that we’re trying to maximize (and which W maximizes by assumption). Moreover, we can pick the basis vector w_2 to be perpendicular to v_1. To prove this we consider two cases: either v_1 is already perpendicular to W in which case it’s trivial, or else v_1 isn’t perpendicular to W and you can choose w_1 to be \textup{proj}_W(v_1) and choose w_2 to be any unit vector perpendicular to w_1.

Now since v_1 maximizes |Av|, we have |Av_1|^2 \geq |Aw_1|^2. Moreover, since w_2 is perpendicular to v_1, the way we chose v_2 also makes |Av_2|^2 \geq |Aw_2|^2. Hence the objective |Av_1|^2 + |Av_2|^2 \geq |Aw_1|^2 + |Aw_2|^2, as desired.

For the general case of k, the inductive hypothesis tells us that the first k terms of the objective for k+1 singular vectors is maximized, and we just have to pick any vector w_{k+1} that is perpendicular to all v_1, v_2, \dots, v_k, and the rest of the proof is just like the 2-dimensional case.

\square

Now remember that in the last post we started with the definition of the SVD as a decomposition of a matrix A = U\Sigma V^T? And then we said that this is a certain kind of change of basis? Well the singular vectors v_i together form the columns of the matrix V (the rows of V^T), and the corresponding singular values \sigma_i(A) are the diagonal entries of \Sigma. When A is understood we’ll abbreviate the singular value as \sigma_i.

To reiterate with the thoughts from last post, the process of applying A is exactly recovered by the process of first projecting onto the (full-rank space of) singular vectors v_1, \dots, v_k, scaling each coordinate of that projection according to the corresponding singular values, and then applying this U thing we haven’t talked about yet.

So let’s determine what U has to be. The way we picked v_i to make A diagonal gives us an immediate suggestion: use the Av_i as the columns of U. Indeed, define u_i = Av_i, the images of the singular vectors under A. We can swiftly show the u_i form a basis of the image of A. The reason is because if v = \sum_i c_i v_i (using all n of the singular vectors v_i), then by linearity Av = \sum_{i} c_i Av_i = \sum_i c_i u_i. It is also easy to see why the u_i are orthogonal (prove it as an exercise). Let’s further make sure the u_i are unit vectors and redefine them as u_i = \frac{1}{\sigma_i}Av_i

If you put these thoughts together, you can say exactly what A does to any given vector x. Since the v_i form an orthonormal basis, x = \sum_i (x \cdot v_i) v_i, and then applying A gives

\displaystyle \begin{aligned}Ax &= A \left ( \sum_i (x \cdot v_i) v_i \right ) \\  &= \sum_i (x \cdot v_i) A_i v_i \\ &= \sum_i (x \cdot v_i) \sigma_i u_i \end{aligned}

If you’ve been closely reading this blog in the last few months, you’ll recognize a very nice way to write the last line of the above equation. It’s an outer product. So depending on your favorite symbols, you’d write this as either A = \sum_{i} \sigma_i u_i \otimes v_i or A = \sum_i \sigma_i u_i v_i^T. Or, if you like expressing things as matrix factorizations, as A = U\Sigma V^T. All three are describing the same object.

Let’s move on to some code.

A black box example

Before we implement SVD from scratch (an urge that commands me from the depths of my soul!), let’s see a black-box example that uses existing tools. For this we’ll use the numpy library.

Recall our movie-rating matrix from the last post:

movieratings

The code to compute the svd of this matrix is as simple as it gets:

from numpy.linalg import svd

movieRatings = [
    [2, 5, 3],
    [1, 2, 1],
    [4, 1, 1],
    [3, 5, 2],
    [5, 3, 1],
    [4, 5, 5],
    [2, 4, 2],
    [2, 2, 5],
]

U, singularValues, V = svd(movieRatings)

Printing these values out gives

[[-0.39458526  0.23923575 -0.35445911 -0.38062172 -0.29836818 -0.49464816 -0.30703202 -0.29763321]
 [-0.15830232  0.03054913 -0.15299759 -0.45334816  0.31122898  0.23892035 -0.37313346  0.67223457]
 [-0.22155201 -0.52086121  0.39334917 -0.14974792 -0.65963979  0.00488292 -0.00783684  0.25934607]
 [-0.39692635 -0.08649009 -0.41052882  0.74387448 -0.10629499  0.01372565 -0.17959298  0.26333462]
 [-0.34630257 -0.64128825  0.07382859 -0.04494155  0.58000668 -0.25806239  0.00211823 -0.24154726]
 [-0.53347449  0.19168874  0.19949342 -0.03942604  0.00424495  0.68715732 -0.06957561 -0.40033035]
 [-0.31660464  0.06109826 -0.30599517 -0.19611823 -0.01334272  0.01446975  0.85185852  0.19463493]
 [-0.32840223  0.45970413  0.62354764  0.1783041   0.17631186 -0.39879476  0.06065902  0.25771578]]
[ 15.09626916   4.30056855   3.40701739]
[[-0.54184808 -0.67070995 -0.50650649]
 [-0.75152295  0.11680911  0.64928336]
 [ 0.37631623 -0.73246419  0.56734672]]

Now this is a bit weird, because the matrices U, V are the wrong shape! Remember, there are only supposed to be three vectors since the input matrix has rank three. So what gives? This is a distinction that goes by the name “full” versus “reduced” SVD. The idea goes back to our original statement that U \Sigma V^T is a decomposition with U, V^T both orthogonal and square matrices. But in the derivation we did in the last section, the U and V were not square. The singular vectors v_i could potentially stop before even becoming full rank.

In order to get to square matrices, what people sometimes do is take the two bases v_1, \dots, v_k and u_1, \dots, u_k and arbitrarily choose ways to complete them to a full orthonormal basis of their respective vector spaces. In other words, they just make the matrix square by filling it with data for no reason other than that it’s sometimes nice to have a complete basis. We don’t care about this. To be honest, I think the only place this comes in useful is in the desire to be particularly tidy in a mathematical formulation of something.

We can still work with it programmatically. By fudging around a bit with numpy’s shapes to get a diagonal matrix, we can reconstruct the input rating matrix from the factors.

Sigma = np.vstack([
    np.diag(singularValues),
    np.zeros((5, 3)),
])

print(np.round(movieRatings - np.dot(U, np.dot(Sigma, V)), decimals=10))

And the output is, as one expects, a matrix of all zeros. Meaning that we decomposed the movie rating matrix, and built it back up from the factors.

We can actually get the SVD as we defined it (with rectangular matrices) by passing a special flag to numpy’s svd.

U, singularValues, V = svd(movieRatings, full_matrices=False)
print(U)
print(singularValues)
print(V)

Sigma = np.diag(singularValues)
print(np.round(movieRatings - np.dot(U, np.dot(Sigma, V)), decimals=10))

And the result

[[-0.39458526  0.23923575 -0.35445911]
 [-0.15830232  0.03054913 -0.15299759]
 [-0.22155201 -0.52086121  0.39334917]
 [-0.39692635 -0.08649009 -0.41052882]
 [-0.34630257 -0.64128825  0.07382859]
 [-0.53347449  0.19168874  0.19949342]
 [-0.31660464  0.06109826 -0.30599517]
 [-0.32840223  0.45970413  0.62354764]]
[ 15.09626916   4.30056855   3.40701739]
[[-0.54184808 -0.67070995 -0.50650649]
 [-0.75152295  0.11680911  0.64928336]
 [ 0.37631623 -0.73246419  0.56734672]]
[[-0. -0. -0.]
 [-0. -0.  0.]
 [ 0. -0.  0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [-0. -0. -0.]
 [ 0. -0. -0.]]

This makes the reconstruction less messy, since we can just multiply everything without having to add extra rows of zeros to \Sigma.

What do the singular vectors and values tell us about the movie rating matrix? (Besides nothing, since it’s a contrived example) You’ll notice that the first singular vector \sigma_1 > 15 while the other two singular values are around 4. This tells us that the first singular vector covers a large part of the structure of the matrix. I.e., a rank-1 matrix would be a pretty good approximation to the whole thing. As an exercise to the reader, write a program that evaluates this claim (how good is “good”?).

The greedy optimization routine

Now we’re going to write SVD from scratch. We’ll first implement the greedy algorithm for the 1-d optimization problem, and then we’ll perform the inductive step to get a full algorithm. Then we’ll run it on the CNN data set.

The method we’ll use to solve the 1-dimensional problem isn’t necessarily industry strength (see this document for a hint of what industry strength looks like), but it is simple conceptually. It’s called the power method. Now that we have our decomposition of theorem, understanding how the power method works is quite easy.

Let’s work in the language of a matrix decomposition A = U \Sigma V^T, more for practice with that language than anything else (using outer products would give us the same result with slightly different computations). Then let’s observe A^T A, wherein we’ll use the fact that U is orthonormal and so U^TU is the identity matrix:

\displaystyle A^TA = (U \Sigma V^T)^T(U \Sigma V^T) = V \Sigma U^TU \Sigma V^T = V \Sigma^2 V^T

So we can completely eliminate U from the discussion, and look at just V \Sigma^2 V^T. And what’s nice about this matrix is that we can compute its eigenvectors, and eigenvectors turn out to be exactly the singular vectors. The corresponding eigenvalues are the squared singular values. This should be clear from the above derivation. If you apply (V \Sigma^2 V^T) to any v_i, the only parts of the product that aren’t zero are the ones involving v_i with itself, and the scalar \sigma_i^2 factors in smoothly. It’s dead simple to check.

Theorem: Let x be a random unit vector and let B = A^TA = V \Sigma^2 V^T. Then with high probability, \lim_{s \to \infty} B^s x is in the span of the first singular vector v_1. If we normalize B^s x to a unit vector at each s, then furthermore the limit is v_1.

Proof. Start with a random unit vector x, and write it in terms of the singular vectors x = \sum_i c_i v_i. That means Bx = \sum_i c_i \sigma_i^2 v_i. If you recursively apply this logic, you get B^s x = \sum_i c_i \sigma_i^{2s} v_i. In particular, the dot product of (B^s x) with any v_j is c_i \sigma_j^{2s}.

What this means is that so long as the first singular value \sigma_1 is sufficiently larger than the second one \sigma_2, and in turn all the other singular values, the part of B^s x  corresponding to v_1 will be much larger than the rest. Recall that if you expand a vector in terms of an orthonormal basis, in this case B^s x expanded in the v_i, the coefficient of B^s x on v_j is exactly the dot product. So to say that B^sx converges to being in the span of v_1 is the same as saying that the ratio of these coefficients, |(B^s x \cdot v_1)| / |(B^s x \cdot v_j)| \to \infty for any j. In other words, the coefficient corresponding to the first singular vector dominates all of the others. And so if we normalize, the coefficient of B^s x corresponding to v_1 tends to 1, while the rest tend to zero.

Indeed, this ratio is just (\sigma_1 / \sigma_j)^{2s} and the base of this exponential is bigger than 1.

\square

If you want to be a little more precise and find bounds on the number of iterations required to converge, you can. The worry is that your random starting vector is “too close” to one of the smaller singular vectors v_j, so that if the ratio of \sigma_1 / \sigma_j is small, then the “pull” of v_1 won’t outweigh the pull of v_j fast enough. Choosing a random unit vector allows you to ensure with high probability that this doesn’t happen. And conditioned on it not happening (or measuring “how far the event is from happening” precisely), you can compute a precise number of iterations required to converge. The last two pages of these lecture notes have all the details.

We won’t compute a precise number of iterations. Instead we’ll just compute until the angle between B^{s+1}x and B^s x is very small. Here’s the algorithm

import numpy as np
from numpy.linalg import norm

from random import normalvariate
from math import sqrt

def randomUnitVector(n):
    unnormalized = [normalvariate(0, 1) for _ in range(n)]
    theNorm = sqrt(sum(x * x for x in unnormalized))
    return [x / theNorm for x in unnormalized]

def svd_1d(A, epsilon=1e-10):
    ''' The one-dimensional SVD '''

    n, m = A.shape
    x = randomUnitVector(m)
    lastV = None
    currentV = x
    B = np.dot(A.T, A)

    iterations = 0
    while True:
        iterations += 1
        lastV = currentV
        currentV = np.dot(B, lastV)
        currentV = currentV / norm(currentV)

        if abs(np.dot(currentV, lastV)) > 1 - epsilon:
            print("converged in {} iterations!".format(iterations))
            return currentV

We start with a random unit vector x, and then loop computing x_{t+1} = Bx_t, renormalizing at each step. The condition for stopping is that the magnitude of the dot product between x_t and x_{t+1} (since they’re unit vectors, this is the cosine of the angle between them) is very close to 1.

And using it on our movie ratings example:

if __name__ == "__main__":
    movieRatings = np.array([
        [2, 5, 3],
        [1, 2, 1],
        [4, 1, 1],
        [3, 5, 2],
        [5, 3, 1],
        [4, 5, 5],
        [2, 4, 2],
        [2, 2, 5],
    ], dtype='float64')

    print(svd_1d(movieRatings))

With the result

converged in 6 iterations!
[-0.54184805 -0.67070993 -0.50650655]

Note that the sign of the vector may be different from numpy’s output because we start with a random vector to begin with.

The recursive step, getting from v_1 to the entire SVD, is equally straightforward. Say you start with the matrix A and you compute v_1. You can use v_1 to compute u_1 and \sigma_1(A). Then you want to ensure you’re ignoring all vectors in the span of v_1 for your next greedy optimization, and to do this you can simply subtract the rank 1 component of A corresponding to v_1. I.e., set A' = A - \sigma_1(A) u_1 v_1^T. Then it’s easy to see that \sigma_1(A') = \sigma_2(A) and basically all the singular vectors shift indices by 1 when going from A to A'. Then you repeat.

If that’s not clear enough, here’s the code.

def svd(A, epsilon=1e-10):
    n, m = A.shape
    svdSoFar = []

    for i in range(m):
        matrixFor1D = A.copy()

        for singularValue, u, v in svdSoFar[:i]:
            matrixFor1D -= singularValue * np.outer(u, v)

        v = svd_1d(matrixFor1D, epsilon=epsilon)  # next singular vector
        u_unnormalized = np.dot(A, v)
        sigma = norm(u_unnormalized)  # next singular value
        u = u_unnormalized / sigma

        svdSoFar.append((sigma, u, v))

    # transform it into matrices of the right shape
    singularValues, us, vs = [np.array(x) for x in zip(*svdSoFar)]

    return singularValues, us.T, vs

And we can run this on our movie rating matrix to get the following

>>> theSVD = svd(movieRatings)
>>> theSVD[0]
array([ 15.09626916,   4.30056855,   3.40701739])
>>> theSVD[1]
array([[ 0.39458528, -0.23923093,  0.35446407],
       [ 0.15830233, -0.03054705,  0.15299815],
       [ 0.221552  ,  0.52085578, -0.39336072],
       [ 0.39692636,  0.08649568,  0.41052666],
       [ 0.34630257,  0.64128719, -0.07384286],
       [ 0.53347448, -0.19169154, -0.19948959],
       [ 0.31660465, -0.0610941 ,  0.30599629],
       [ 0.32840221, -0.45971273, -0.62353781]])
>>> theSVD[2]
array([[ 0.54184805,  0.67071006,  0.50650638],
       [ 0.75151641, -0.11679644, -0.64929321],
       [-0.37632934,  0.73246611, -0.56733554]])

Checking this against our numpy output shows it’s within a reasonable level of precision (considering the power method took on the order of ten iterations!)

>>> np.round(np.abs(npSVD[0]) - np.abs(theSVD[1]), decimals=5)
array([[ -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [  0.00000000e+00,  -1.00000000e-05,   1.00000000e-05],
       [  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   1.00000000e-05],
       [ -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [  0.00000000e+00,  -0.00000000e+00,   0.00000000e+00],
       [ -0.00000000e+00,   1.00000000e-05,  -1.00000000e-05]])
>>> np.round(np.abs(npSVD[2]) - np.abs(theSVD[2]), decimals=5)
array([[  0.00000000e+00,   0.00000000e+00,  -0.00000000e+00],
       [ -1.00000000e-05,  -1.00000000e-05,   1.00000000e-05],
       [  1.00000000e-05,   0.00000000e+00,  -1.00000000e-05]])
>>> np.round(np.abs(npSVD[1]) - np.abs(theSVD[0]), decimals=5)
array([ 0.,  0., -0.])

So there we have it. We added an extra little bit to the svd function, an argument k which stops computing the svd after it reaches rank k.

CNN stories

One interesting use of the SVD is in topic modeling. Topic modeling is the process of taking a bunch of documents (news stories, or emails, or movie scripts, whatever) and grouping them by topic, where the algorithm gets to choose what counts as a “topic.” Topic modeling is just the name that natural language processing folks use instead of clustering.

The SVD can help one model topics as follows. First you construct a matrix A called a document-term matrix whose rows correspond to words in some fixed dictionary and whose columns correspond to documents. The (i,j) entry of A contains the number of times word i shows up in document j. Or, more precisely, some quantity derived from that count, like a normalized count. See this table on wikipedia for a list of options related to that. We’ll just pick one arbitrarily for use in this post.

The point isn’t how we normalize the data, but what the SVD of A = U \Sigma V^T means in this context. Recall that the domain of A, as a linear map, is a vector space whose dimension is the number of stories. We think of the vectors in this space as documents, or rather as an “embedding” of the abstract concept of a document using the counts of how often each word shows up in a document as a proxy for the semantic meaning of the document. Likewise, the codomain is the space of all words, and each word is embedded by which documents it occurs in. If we compare this to the movie rating example, it’s the same thing: a movie is the vector of ratings it receives from people, and a person is the vector of ratings of various movies.

Say you take a rank 3 approximation to A. Then you get three singular vectors v_1, v_2, v_3 which form a basis for a subspace of words, i.e., the “idealized” words. These idealized words are your topics, and you can compute where a “new word” falls by looking at which documents it appears in (writing it as a vector in the domain) and saying its “topic” is the closest of the v_1, v_2, v_3. The same process applies to new documents. You can use this to cluster existing documents as well.

The dataset we’ll use for this post is a relatively small corpus of a thousand CNN stories picked from 2012. Here’s an excerpt from one of them

$ cat data/cnn-stories/story479.txt
3 things to watch on Super Tuesday
Here are three things to watch for: Romney's big day. He's been the off-and-on frontrunner throughout the race, but a big Super Tuesday could begin an end game toward a sometimes hesitant base coalescing behind former Massachusetts Gov. Mitt Romney. Romney should win his home state of Massachusetts, neighboring Vermont and Virginia, ...

So let’s first build this document-term matrix, with the normalized values, and then we’ll compute it’s SVD and see what the topics look like.

Step 1 is cleaning the data. We used a bunch of routines from the nltk library that boils down to this loop:

    for filename, documentText in documentDict.items():
        tokens = tokenize(documentText)
        tagged_tokens = pos_tag(tokens)
        wnl = WordNetLemmatizer()
        stemmedTokens = [wnl.lemmatize(word, wordnetPos(tag)).lower()
                         for word, tag in tagged_tokens]

This turns the Super Tuesday story into a list of words (with repetition):

["thing", "watch", "three", "thing", "watch", "big", ... ]

If you’ll notice the name Romney doesn’t show up in the list of words. I’m only keeping the words that show up in the top 100,000 most common English words, and then lemmatizing all of the words to their roots. It’s not a perfect data cleaning job, but it’s simple and good enough for our purposes.

Now we can create the document term matrix.

def makeDocumentTermMatrix(data):
    words = allWords(data)  # get the set of all unique words

    wordToIndex = dict((word, i) for i, word in enumerate(words))
    indexToWord = dict(enumerate(words))
    indexToDocument = dict(enumerate(data))

    matrix = np.zeros((len(words), len(data)))
    for docID, document in enumerate(data):
        docWords = Counter(document['words'])
        for word, count in docWords.items():
            matrix[wordToIndex[word], docID] = count

    return matrix, (indexToWord, indexToDocument)

This creates a matrix with the raw integer counts. But what we need is a normalized count. The idea is that a common word like “thing” shows up disproportionately more often than “election,” and we don’t want raw magnitude of a word count to outweigh its semantic contribution to the classification. This is the applied math part of the algorithm design. So what we’ll do (and this technique together with SVD is called latent semantic indexing) is normalize each entry so that it measures both the frequency of a term in a document and the relative frequency of a term compared to the global frequency of that term. There are many ways to do this, and we’ll just pick one. See the github repository if you’re interested.

So now lets compute a rank 10 decomposition and see how to cluster the results.

    data = load()
    matrix, (indexToWord, indexToDocument) = makeDocumentTermMatrix(data)
    matrix = normalize(matrix)
    sigma, U, V = svd(matrix, k=10)

This uses our svd, not numpy’s. Though numpy’s routine is much faster, it’s fun to see things work with code written from scratch. The result is too large to display here, but I can report the singular values.

>>> sigma
array([ 42.85249098,  21.85641975,  19.15989197,  16.2403354 ,
        15.40456779,  14.3172779 ,  13.47860033,  13.23795002,
        12.98866537,  12.51307445])

Now we take our original inputs and project them onto the subspace spanned by the singular vectors. This is the part that represents each word (resp., document) in terms of the idealized words (resp., documents), the singular vectors. Then we can apply a simple k-means clustering algorithm to the result, and observe the resulting clusters as documents.

    projectedDocuments = np.dot(matrix.T, U)
    projectedWords = np.dot(matrix, V.T)

    documentCenters, documentClustering = cluster(projectedDocuments)
    wordCenters, wordClustering = cluster(projectedWords)

    wordClusters = [
        [indexToWord[i] for (i, x) in enumerate(wordClustering) if x == j]
        for j in range(len(set(wordClustering)))
    ]

    documentClusters = [
        [indexToDocument[i]['text']
         for (i, x) in enumerate(documentClustering) if x == j]
        for j in range(len(set(documentClustering)))
    ]

And now we can inspect individual clusters. Right off the bat we can tell the clusters aren’t quite right simply by looking at the sizes of each cluster.

>>> Counter(wordClustering)
Counter({1: 9689, 2: 1051, 8: 680, 5: 557, 3: 321, 7: 225, 4: 174, 6: 124, 9: 123})
>>> Counter(documentClustering)
Counter({7: 407, 6: 109, 0: 102, 5: 87, 9: 85, 2: 65, 8: 55, 4: 47, 3: 23, 1: 15})

What looks wrong to me is the size of the largest word cluster. If we could group words by topic, then this is saying there’s a topic with over nine thousand words associated with it! Inspecting it even closer, it includes words like “vegan,” “skunk,” and “pope.” On the other hand, some word clusters are spot on. Examine, for example, the fifth cluster which includes words very clearly associated with crime stories.

>>> wordClusters[4]
['account', 'accuse', 'act', 'affiliate', 'allegation', 'allege', 'altercation', 'anything', 'apartment', 'arrest', 'arrive', 'assault', 'attorney', 'authority', 'bag', 'black', 'blood', 'boy', 'brother', 'bullet', 'candy', 'car', 'carry', 'case', 'charge', 'chief', 'child', 'claim', 'client', 'commit', 'community', 'contact', 'convenience', 'court', 'crime', 'criminal', 'cry', 'dead', 'deadly', 'death', 'defense', 'department', 'describe', 'detail', 'determine', 'dispatcher', 'district', 'document', 'enforcement', 'evidence', 'extremely', 'family', 'father', 'fear', 'fiancee', 'file', 'five', 'foot', 'friend', 'front', 'gate', 'girl', 'girlfriend', 'grand', 'ground', 'guilty', 'gun', 'gunman', 'gunshot', 'hand', 'happen', 'harm', 'head', 'hear', 'heard', 'hoodie', 'hour', 'house', 'identify', 'immediately', 'incident', 'information', 'injury', 'investigate', 'investigation', 'investigator', 'involve', 'judge', 'jury', 'justice', 'kid', 'killing', 'lawyer', 'legal', 'letter', 'life', 'local', 'man', 'men', 'mile', 'morning', 'mother', 'murder', 'near', 'nearby', 'neighbor', 'newspaper', 'night', 'nothing', 'office', 'officer', 'online', 'outside', 'parent', 'person', 'phone', 'police', 'post', 'prison', 'profile', 'prosecute', 'prosecution', 'prosecutor', 'pull', 'racial', 'racist', 'release', 'responsible', 'return', 'review', 'role', 'saw', 'scene', 'school', 'scream', 'search', 'sentence', 'serve', 'several', 'shoot', 'shooter', 'shooting', 'shot', 'slur', 'someone', 'son', 'sound', 'spark', 'speak', 'staff', 'stand', 'store', 'story', 'student', 'surveillance', 'suspect', 'suspicious', 'tape', 'teacher', 'teen', 'teenager', 'told', 'tragedy', 'trial', 'vehicle', 'victim', 'video', 'walk', 'watch', 'wear', 'whether', 'white', 'witness', 'young']

As sad as it makes me to see that ‘black’ and ‘slur’ and ‘racial’ appear in this category, it’s a reminder that naively using the output of a machine learning algorithm can perpetuate racism.

Here’s another interesting cluster corresponding to economic words:

>>> wordClusters[6]
['agreement', 'aide', 'analyst', 'approval', 'approve', 'austerity', 'average', 'bailout', 'beneficiary', 'benefit', 'bill', 'billion', 'break', 'broadband', 'budget', 'class', 'combine', 'committee', 'compromise', 'conference', 'congressional', 'contribution', 'core', 'cost', 'currently', 'cut', 'deal', 'debt', 'defender', 'deficit', 'doc', 'drop', 'economic', 'economy', 'employee', 'employer', 'erode', 'eurozone', 'expire', 'extend', 'extension', 'fee', 'finance', 'fiscal', 'fix', 'fully', 'fund', 'funding', 'game', 'generally', 'gleefully', 'growth', 'hamper', 'highlight', 'hike', 'hire', 'holiday', 'increase', 'indifferent', 'insistence', 'insurance', 'job', 'juncture', 'latter', 'legislation', 'loser', 'low', 'lower', 'majority', 'maximum', 'measure', 'middle', 'negotiation', 'offset', 'oppose', 'package', 'pass', 'patient', 'pay', 'payment', 'payroll', 'pension', 'plight', 'portray', 'priority', 'proposal', 'provision', 'rate', 'recession', 'recovery', 'reduce', 'reduction', 'reluctance', 'repercussion', 'rest', 'revenue', 'rich', 'roughly', 'sale', 'saving', 'scientist', 'separate', 'sharp', 'showdown', 'sign', 'specialist', 'spectrum', 'spending', 'strength', 'tax', 'tea', 'tentative', 'term', 'test', 'top', 'trillion', 'turnaround', 'unemployed', 'unemployment', 'union', 'wage', 'welfare', 'worker', 'worth']

One can also inspect the stories, though the clusters are harder to print out here. Interestingly the first cluster of documents are stories exclusively about Trayvon Martin. The second cluster is mostly international military conflicts. The third cluster also appears to be about international conflict, but what distinguishes it from the first cluster is that every story in the second cluster discusses Syria.

>>> len([x for x in documentClusters[1] if 'Syria' in x]) / len(documentClusters[1])
0.05555555555555555
>>> len([x for x in documentClusters[2] if 'Syria' in x]) / len(documentClusters[2])
1.0

Anyway, you can explore the data more at your leisure (and tinker with the parameters to improve it!).

Issues with the power method

Though I mentioned that the power method isn’t an industry strength algorithm I didn’t say why. Let’s revisit that before we finish. The problem is that the convergence rate of even the 1-dimensional problem depends on the ratio of the first and second singular values, \sigma_1 / \sigma_2. If that ratio is very close to 1, then the convergence will take a long time and need many many matrix-vector multiplications.

One way to alleviate that is to do the trick where, to compute a large power of a matrix, you iteratively square B. But that requires computing a matrix square (instead of a bunch of matrix-vector products), and that requires a lot of time and memory if the matrix isn’t sparse. When the matrix is sparse, you can actually do the power method quite quickly, from what I’ve heard and read.

But nevertheless, the industry standard methods involve computing a particular matrix decomposition that is not only faster than the power method, but also numerically stable. That means that the algorithm’s runtime and accuracy doesn’t depend on slight changes in the entries of the input matrix. Indeed, you can have two matrices where \sigma_1 / \sigma_2 is very close to 1, but changing a single entry will make that ratio much larger. The power method depends on this, so it’s not numerically stable. But the industry standard technique is not. This technique involves something called Householder reflections. So while the power method was great for a proof of concept, there’s much more work to do if you want true SVD power.

Until next time!

Big Dimensions, and What You Can Do About It

Data is abundant, data is big, and big is a problem. Let me start with an example. Let’s say you have a list of movie titles and you want to learn their genre: romance, action, drama, etc. And maybe in this scenario IMDB doesn’t exist so you can’t scrape the answer. Well, the title alone is almost never enough information. One nice way to get more data is to do the following:

  1. Pick a large dictionary of words, say the most common 100,000 non stop-words in the English language.
  2. Crawl the web looking for documents that include the title of a film.
  3. For each film, record the counts of all other words appearing in those documents.
  4. Maybe remove instances of “movie” or “film,” etc.

After this process you have a length-100,000 vector of integers associated with each movie title. IMDB’s database has around 1.5 million listed movies, and if we have a 32-bit integer per vector entry, that’s 600 GB of data to get every movie.

One way to try to find genres is to cluster this (unlabeled) dataset of vectors, and then manually inspect the clusters and assign genres. With a really fast computer we could simply run an existing clustering algorithm on this dataset and be done. Of course, clustering 600 GB of data takes a long time, but there’s another problem. The geometric intuition that we use to design clustering algorithms degrades as the length of the vectors in the dataset grows. As a result, our algorithms perform poorly. This phenomenon is called the “curse of dimensionality” (“curse” isn’t a technical term), and we’ll return to the mathematical curiosities shortly.

A possible workaround is to try to come up with faster algorithms or be more patient. But a more interesting mathematical question is the following:

Is it possible to condense high-dimensional data into smaller dimensions and retain the important geometric properties of the data?

This goal is called dimension reduction. Indeed, all of the chatter on the internet is bound to encode redundant information, so for our movie title vectors it seems the answer should be “yes.” But the questions remain, how does one find a low-dimensional condensification? (Condensification isn’t a word, the right word is embedding, but embedding is overloaded so we’ll wait until we define it) And what mathematical guarantees can you prove about the resulting condensed data? After all, it stands to reason that different techniques preserve different aspects of the data. Only math will tell.

In this post we’ll explore this so-called “curse” of dimensionality, explain the formality of why it’s seen as a curse, and implement a wonderfully simple technique called “the random projection method” which preserves pairwise distances between points after the reduction. As usual, and all the code, data, and tests used in the making of this post are on Github.

Some curious issues, and the “curse”

We start by exploring the curse of dimensionality with experiments on synthetic data.

In two dimensions, take a circle centered at the origin with radius 1 and its bounding square.

circle.png

The circle fills up most of the area in the square, in fact it takes up exactly \pi out of 4 which is about 78%. In three dimensions we have a sphere and a cube, and the ratio of sphere volume to cube volume is a bit smaller, 4 \pi /3 out of a total of 8, which is just over 52%. What about in a thousand dimensions? Let’s try by simulation.

import random

def randUnitCube(n):
   return [(random.random() - 0.5)*2 for _ in range(n)]

def sphereCubeRatio(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   return sum(1 for x in randomSample if sum(a**2 for a in x) &lt;= 1) / numSamples 

The result is as we computed for small dimension,

 &gt;&gt;&gt; sphereCubeRatio(2,10000)
0.7857
&gt;&gt;&gt; sphereCubeRatio(3,10000)
0.5196

And much smaller for larger dimension

&gt;&gt;&gt; sphereCubeRatio(20,100000) # 100k samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,1000000) # 1M samples
0.0
&gt;&gt;&gt; sphereCubeRatio(20,2000000)
5e-07

Forget a thousand dimensions, for even twenty dimensions, a million samples wasn’t enough to register a single random point inside the unit sphere. This illustrates one concern, that when we’re sampling random points in the d-dimensional unit cube, we need at least 2^d samples to ensure we’re getting a even distribution from the whole space. In high dimensions, this face basically rules out a naive Monte Carlo approximation, where you sample random points to estimate the probability of an event too complicated to sample from directly. A machine learning viewpoint of the same problem is that in dimension d, if your machine learning algorithm requires a representative sample of the input space in order to make a useful inference, then you require 2^d samples to learn.

Luckily, we can answer our original question because there is a known formula for the volume of a sphere in any dimension. Rather than give the closed form formula, which involves the gamma function and is incredibly hard to parse, we’ll state the recursive form. Call V_i the volume of the unit sphere in dimension i. Then V_0 = 1 by convention, V_1 = 2 (it’s an interval), and V_n = \frac{2 \pi V_{n-2}}{n}. If you unpack this recursion you can see that the numerator looks like (2\pi)^{n/2} and the denominator looks like a factorial, except it skips every other number. So an even dimension would look like 2 \cdot 4 \cdot \dots \cdot n, and this grows larger than a fixed exponential. So in fact the total volume of the sphere vanishes as the dimension grows! (In addition to the ratio vanishing!)

def sphereVolume(n):
   values = [0] * (n+1)
   for i in range(n+1):
      if i == 0:
         values[i] = 1
      elif i == 1:
         values[i] = 2
      else:
         values[i] = 2*math.pi / i * values[i-2]

   return values[-1]

This should be counterintuitive. I think most people would guess, when asked about how the volume of the unit sphere changes as the dimension grows, that it stays the same or gets bigger.  But at a hundred dimensions, the volume is already getting too small to fit in a float.

&gt;&gt;&gt; sphereVolume(20)
0.025806891390014047
&gt;&gt;&gt; sphereVolume(100)
2.3682021018828297e-40
&gt;&gt;&gt; sphereVolume(1000)
0.0

The scary thing is not just that this value drops, but that it drops exponentially quickly. A consequence is that, if you’re trying to cluster data points by looking at points within a fixed distance r of one point, you have to carefully measure how big r needs to be to cover the same proportional volume as it would in low dimension.

Here’s a related issue. Say I take a bunch of points generated uniformly at random in the unit cube.

from itertools import combinations

def distancesRandomPoints(n, numSamples):
   randomSample = [randUnitCube(n) for _ in range(numSamples)]
   pairwiseDistances = [dist(x,y) for (x,y) in combinations(randomSample, 2)]
   return pairwiseDistances

In two dimensions, the histogram of distances between points looks like this

2d-distances.png

However, as the dimension grows the distribution of distances changes. It evolves like the following animation, in which each frame is an increase in dimension from 2 to 100.

distances-animation.gif

The shape of the distribution doesn’t appear to be changing all that much after the first few frames, but the center of the distribution tends to infinity (in fact, it grows like \sqrt{n}). The variance also appears to stay constant. This chart also becomes more variable as the dimension grows, again because we should be sampling exponentially many more points as the dimension grows (but we don’t). In other words, as the dimension grows the average distance grows and the tightness of the distribution stays the same. So at a thousand dimensions the average distance is about 26, tightly concentrated between 24 and 28. When the average is a thousand, the distribution is tight between 998 and 1002. If one were to normalize this data, it would appear that random points are all becoming equidistant from each other.

So in addition to the issues of runtime and sampling, the geometry of high-dimensional space looks different from what we expect. To get a better understanding of “big data,” we have to update our intuition from low-dimensional geometry with analysis and mathematical theorems that are much harder to visualize.

The Johnson-Lindenstrauss Lemma

Now we turn to proving dimension reduction is possible. There are a few methods one might first think of, such as look for suitable subsets of coordinates, or sums of subsets, but these would all appear to take a long time or they simply don’t work.

Instead, the key technique is to take a random linear subspace of a certain dimension, and project every data point onto that subspace. No searching required. The fact that this works is called the Johnson-Lindenstrauss Lemma. To set up some notation, we’ll call d(v,w) the usual distance between two points.

Lemma [Johnson-Lindenstrauss (1984)]: Given a set X of n points in \mathbb{R}^d, project the points in X to a randomly chosen subspace of dimension c. Call the projection \rho. For any \varepsilon > 0, if c is at least \Omega(\log(n) / \varepsilon^2), then with probability at least 1/2 the distances between points in X are preserved up to a factor of (1+\varepsilon). That is, with good probability every pair v,w \in X will satisfy

\displaystyle \| v-w \|^2 (1-\varepsilon) \leq \| \rho(v) - \rho(w) \|^2 \leq \| v-w \|^2 (1+\varepsilon)

Before we do the proof, which is quite short, it’s important to point out that the target dimension c does not depend on the original dimension! It only depends on the number of points in the dataset, and logarithmically so. That makes this lemma seem like pure magic, that you can take data in an arbitrarily high dimension and put it in a much smaller dimension.

On the other hand, if you include all of the hidden constants in the bound on the dimension, it’s not that impressive. If your data have a million dimensions and you want to preserve the distances up to 1% (\varepsilon = 0.01), the bound is bigger than a million! If you decrease the preservation \varepsilon to 10% (0.1), then you get down to about 12,000 dimensions, which is more reasonable. At 45% the bound drops to around 1,000 dimensions. Here’s a plot showing the theoretical bound on c in terms of \varepsilon for n fixed to a million.

boundplot

 

But keep in mind, this is just a theoretical bound for potentially misbehaving data. Later in this post we’ll see if the practical dimension can be reduced more than the theory allows. As we’ll see, an algorithm run on the projected data is still effective even if the projection goes well beyond the theoretical bound. Because the theorem is known to be tight in the worst case (see the notes at the end) this speaks more to the robustness of the typical algorithm than to the robustness of the projection method.

A second important note is that this technique does not necessarily avoid all the problems with the curse of dimensionality. We mentioned above that one potential problem is that “random points” are roughly equidistant in high dimensions. Johnson-Lindenstrauss actually preserves this problem because it preserves distances! As a consequence, you won’t see strictly better algorithm performance if you project (which we suggested is possible in the beginning of this post). But you will alleviate slow runtimes if the runtime depends exponentially on the dimension. Indeed, if you replace the dimension d with the logarithm of the number of points \log n, then 2^d becomes linear in n, and 2^{O(d)} becomes polynomial.

Proof of the J-L lemma

Let’s prove the lemma.

Proof. To start we make note that one can sample from the uniform distribution on dimension-c linear subspaces of \mathbb{R}^d by choosing the entries of a c \times d matrix A independently from a normal distribution with mean 0 and variance 1. Then, to project a vector x by this matrix (call the projection \rho), we can compute

\displaystyle \rho(x) = \frac{1}{\sqrt{c}}A x

Now fix \varepsilon > 0 and fix two points in the dataset x,y. We want an upper bound on the probability that the following is false

\displaystyle \| x-y \|^2 (1-\varepsilon) \leq \| \rho(x) - \rho(y) \|^2 \leq \| x-y \|^2 (1+\varepsilon)

Since that expression is a pain to work with, let’s rearrange it by calling u = x-y, and rearranging (using the linearity of the projection) to get the equivalent statement.

\left | \| \rho(u) \|^2 - \|u \|^2 \right | \leq \varepsilon \| u \|^2

And so we want a bound on the probability that this event does not occur, meaning the inequality switches directions.

Once we get such a bound (it will depend on c and \varepsilon) we need to ensure that this bound is true for every pair of points. The union bound allows us to do this, but it also requires that the probability of the bad thing happening tends to zero faster than 1/\binom{n}{2}. That’s where the \log(n) will come into the bound as stated in the theorem.

Continuing with our use of u for notation, define X to be the random variable \frac{c}{\| u \|^2} \| \rho(u) \|^2. By expanding the notation and using the linearity of expectation, you can show that the expected value of X is c, meaning that in expectation, distances are preserved. We are on the right track, and just need to show that the distribution of X, and thus the possible deviations in distances, is tightly concentrated around c. In full rigor, we will show

\displaystyle \Pr [X \geq (1+\varepsilon) c] < e^{-(\varepsilon^2 - \varepsilon^3) \frac{c}{4}}

Let A_i denote the i-th column of A. Define by X_i the quantity \langle A_i, u \rangle / \| u \|. This is a weighted average of the entries of A_i by the entries of u. But since we chose the entries of A from the normal distribution, and since a weighted average of normally distributed random variables is also normally distributed (has the same distribution), X_i is a N(0,1) random variable. Moreover, each column is independent. This allows us to decompose X as

X = \frac{k}{\| u \|^2} \| \rho(u) \|^2 = \frac{\| Au \|^2}{\| u \|^2}

Expanding further,

X = \sum_{i=1}^c \frac{\| A_i u \|^2}{\|u\|^2} = \sum_{i=1}^c X_i^2

Now the event X \leq (1+\varepsilon) c can be expressed in terms of the nonegative variable e^{\lambda X}, where 0 < \lambda < 1/2 is parameter, to get

\displaystyle \Pr[X \geq (1+\varepsilon) c] = \Pr[e^{\lambda X} \geq e^{(1+\varepsilon)c \lambda}]

This will become useful because the sum X = \sum_i X_i^2 will split into a product momentarily. First we apply Markov’s inequality, which says that for any nonnegative random variable Y, \Pr[Y \geq t] \leq \mathbb{E}[Y] / t. This lets us write

\displaystyle \Pr[e^{\lambda X} \geq e^{(1+\varepsilon) c \lambda}] \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{(1+\varepsilon) c \lambda}}

Now we can split up the exponent \lambda X into \sum_{i=1}^c \lambda X_i^2, and using the i.i.d.-ness of the X_i^2 we can rewrite the RHS of the inequality as

\left ( \frac{\mathbb{E}[e^{\lambda X_1^2}]}{e^{(1+\varepsilon)\lambda}} \right )^c

A similar statement using -\lambda is true for the (1-\varepsilon) part, namely that

\displaystyle \Pr[X \leq (1-\varepsilon)c] \leq \left ( \frac{\mathbb{E}[e^{-\lambda X_1^2}]}{e^{-(1-\varepsilon)\lambda}} \right )^c

The last thing that’s needed is to bound \mathbb{E}[e^{\lambda X_i^2}], but since X_i^2 \sim N(0,1), we can use the known density function for a normal distribution, and integrate to get the exact value \mathbb{E}[e^{\lambda X_1^2}] = \frac{1}{\sqrt{1-2\lambda}}. Including this in the bound gives us a closed-form bound in terms of \lambda, c, \varepsilon. Using standard calculus the optimal \lambda \in (0,1/2) is \lambda = \varepsilon / 2(1+\varepsilon). This gives

\displaystyle \Pr[X \geq (1+\varepsilon) c] \leq ((1+\varepsilon)e^{-\varepsilon})^{c/2}

Using the Taylor series expansion for e^x, one can show the bound 1+\varepsilon < e^{\varepsilon - (\varepsilon^2 - \varepsilon^3)/2}, which simplifies the final upper bound to e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

Doing the same thing for the (1-\varepsilon) version gives an equivalent bound, and so the total bound is doubled, i.e. 2e^{-(\varepsilon^2 - \varepsilon^3) c/4}.

As we said at the beginning, applying the union bound means we need

\displaystyle 2e^{-(\varepsilon^2 - \varepsilon^3) c/4} < \frac{1}{\binom{n}{2}}

Solving this for c gives c \geq \frac{8 \log m}{\varepsilon^2 - \varepsilon^3}, as desired.

\square

Projecting in Practice

Let’s write a python program to actually perform the Johnson-Lindenstrauss dimension reduction scheme. This is sometimes called the Johnson-Lindenstrauss transform, or JLT.

First we define a random subspace by sampling an appropriately-sized matrix with normally distributed entries, and a function that performs the projection onto a given subspace (for testing).

import random
import math
import numpy

def randomSubspace(subspaceDimension, ambientDimension):
   return numpy.random.normal(0, 1, size=(subspaceDimension, ambientDimension))

def project(v, subspace):
   subspaceDimension = len(subspace)
   return (1 / math.sqrt(subspaceDimension)) * subspace.dot(v)

We have a function that computes the theoretical bound on the optimal dimension to reduce to.

def theoreticalBound(n, epsilon):
   return math.ceil(8*math.log(n) / (epsilon**2 - epsilon**3))

And then performing the JLT is simply matrix multiplication

def jlt(data, subspaceDimension):
   ambientDimension = len(data[0])
   A = randomSubspace(subspaceDimension, ambientDimension)
   return (1 / math.sqrt(subspaceDimension)) * A.dot(data.T).T

The high-dimensional dataset we’ll use comes from a data mining competition called KDD Cup 2001. The dataset we used deals with drug design, and the goal is to determine whether an organic compound binds to something called thrombin. Thrombin has something to do with blood clotting, and I won’t pretend I’m an expert. The dataset, however, has over a hundred thousand features for about 2,000 compounds. Here are a few approximate target dimensions we can hope for as epsilon varies.

&gt;&gt;&gt; [((1/x),theoreticalBound(n=2000, epsilon=1/x))
       for x in [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]]
[('0.50', 487), ('0.33', 821), ('0.25', 1298), ('0.20', 1901),
 ('0.17', 2627), ('0.14', 3477), ('0.12', 4448), ('0.11', 5542),
 ('0.10', 6757), ('0.07', 14659), ('0.05', 25604)]

Going down from a hundred thousand dimensions to a few thousand is by any measure decreases the size of the dataset by about 95%. We can also observe how the distribution of overall distances varies as the size of the subspace we project to varies.

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The animation proceeds from 5000 dimensions down to 2 (when the plot is at its bulkiest closer to zero).

The last three frames are for 10, 5, and 2 dimensions respectively. As you can see the histogram starts to beef up around zero. To be honest I was expecting something a bit more dramatic like a uniform-ish distribution. Of course, the distribution of distances is not all that matters. Another concern is the worst case change in distances between any two points before and after the projection. We can see that indeed when we project to the dimension specified in the theorem, that the distances are within the prescribed bounds.

def checkTheorem(oldData, newData, epsilon):
   numBadPoints = 0

   for (x,y), (x2,y2) in zip(combinations(oldData, 2), combinations(newData, 2)):
      oldNorm = numpy.linalg.norm(x2-y2)**2
      newNorm = numpy.linalg.norm(x-y)**2

      if newNorm == 0 or oldNorm == 0:
         continue

      if abs(oldNorm / newNorm - 1) &gt; epsilon:
         numBadPoints += 1

   return numBadPoints

if __name__ == &quot;__main__&quot;
   from data import thrombin
   train, labels = thrombin.load() 

   numPoints = len(train)
   epsilon = 0.2
   subspaceDim = theoreticalBound(numPoints, epsilon)
   ambientDim = len(train[0])
   newData = jlt(train, subspaceDim)

   print(checkTheorem(train, newData, epsilon))

This program prints zero every time I try running it, which is the poor man’s way of saying it works “with high probability.” We can also plot statistics about the number of pairs of data points that are distorted by more than \varepsilon as the subspace dimension shrinks. We ran this on the following set of subspace dimensions with \varepsilon = 0.1 and took average/standard deviation over twenty trials:

   dims = [1000, 750, 500, 250, 100, 75, 50, 25, 10, 5, 2]

The result is the following chart, whose x-axis is the dimension projected to (so the left hand is the most extreme projection to 2, 5, 10 dimensions), the y-axis is the number of distorted pairs, and the error bars represent a single standard deviation away from the mean.

thrombin-worst-case

This chart provides good news about this dataset because the standard deviations are low. It tells us something that mathematicians often ignore: the predictability of the tradeoff that occurs once you go past the theoretically perfect bound. In this case, the standard deviations tell us that it’s highly predictable. Moreover, since this tradeoff curve measures pairs of points, we might conjecture that the distortion is localized around a single set of points that got significantly “rattled” by the projection. This would be an interesting exercise to explore.

Now all of these charts are really playing with the JLT and confirming the correctness of our code (and hopefully our intuition). The real question is: how well does a machine learning algorithm perform on the original data when compared to the projected data? If the algorithm only “depends” on the pairwise distances between the points, then we should expect nearly identical accuracy in the unprojected and projected versions of the data. To show this we’ll use an easy learning algorithm, the k-nearest-neighbors clustering method. The problem, however, is that there are very few positive examples in this particular dataset. So looking for the majority label of the nearest k neighbors for any k > 2 unilaterally results in the “all negative” classifier, which has 97% accuracy. This happens before and after projecting.

To compensate for this, we modify k-nearest-neighbors slightly by having the label of a predicted point be 1 if any label among its nearest neighbors is 1. So it’s not a majority vote, but rather a logical OR of the labels of nearby neighbors. Our point in this post is not to solve the problem well, but rather to show how an algorithm (even a not-so-good one) can degrade as one projects the data into smaller and smaller dimensions. Here is the code.

def nearestNeighborsAccuracy(data, labels, k=10):
   from sklearn.neighbors import NearestNeighbors
   trainData, trainLabels, testData, testLabels = randomSplit(data, labels) # cross validation
   model = NearestNeighbors(n_neighbors=k).fit(trainData)
   distances, indices = model.kneighbors(testData)
   predictedLabels = []

   for x in indices:
      xLabels = [trainLabels[i] for i in x[1:]]
      predictedLabel = max(xLabels)
      predictedLabels.append(predictedLabel)

   totalAccuracy = sum(x == y for (x,y) in zip(testLabels, predictedLabels)) / len(testLabels)
   falsePositive = (sum(x == 0 and y == 1 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 0 for x in testLabels))
   falseNegative = (sum(x == 1 and y == 0 for (x,y) in zip(testLabels, predictedLabels)) /
      sum(x == 1 for x in testLabels))

   return totalAccuracy, falsePositive, falseNegative

And here is the accuracy of this modified k-nearest-neighbors algorithm run on the thrombin dataset. The horizontal line represents the accuracy of the produced classifier on the unmodified data set. The x-axis represents the dimension projected to (left-hand side is the lowest), and the y-axis represents the accuracy. The mean accuracy over fifty trials was plotted, with error bars representing one standard deviation. The complete code to reproduce the plot is in the Github repository [link link link].

thrombin-knn-accuracy

Likewise, we plot the proportion of false positive and false negatives for the output classifier. Note that a “positive” label made up only about 2% of the total data set. First the false positives

thrombin-knn-fp

Then the false negatives

thrombin-knn-fn

As we can see from these three charts, things don’t really change that much (for this dataset) even when we project down to around 200-300 dimensions. Note that for these parameters the “correct” theoretical choice for dimension was on the order of 5,000 dimensions, so this is a 95% savings from the naive approach, and 99.75% space savings from the original data. Not too shabby.

Notes

The \Omega(\log(n)) worst-case dimension bound is asymptotically tight, though there is some small gap in the literature that depends on \varepsilon. This result is due to Noga Alon, the very last result (Section 9) of this paper.

We did dimension reduction with respect to preserving the Euclidean distance between points. One might naturally wonder if you can achieve the same dimension reduction with a different metric, say the taxicab metric or a p-norm. In fact, you cannot achieve anything close to logarithmic dimension reduction for the taxicab (l_1) metric. This result is due to Brinkman-Charikar in 2004.

The code we used to compute the JLT is not particularly efficient. There are much more efficient methods. One of them, borrowing its namesake from the Fast Fourier Transform, is called the Fast Johnson-Lindenstrauss Transform. The technique is due to Ailon-Chazelle from 2009, and it involves something called “preconditioning a sparse projection matrix with a randomized Fourier transform.” I don’t know precisely what that means, but it would be neat to dive into that in a future post.

The central focus in this post was whether the JLT preserves distances between points, but one might be curious as to whether the points themselves are well approximated. The answer is an enthusiastic no. If the data were images, the projected points would look nothing like the original images. However, it appears the degradation tradeoff is measurable (by some accounts perhaps linear), and there appears to be some work (also this by the same author) when restricting to sparse vectors (like word-association vectors).

Note that the JLT is not the only method for dimensionality reduction. We previously saw principal component analysis (applied to face recognition), and in the future we will cover a related technique called the Singular Value Decomposition. It is worth noting that another common technique specific to nearest-neighbor is called “locality-sensitive hashing.” Here the goal is to project the points in such a way that “similar” points land very close to each other. Say, if you were to discretize the plane into bins, these bins would form the hash values and you’d want to maximize the probability that two points with the same label land in the same bin. Then you can do things like nearest-neighbors by comparing bins.

Another interesting note, if your data is linearly separable (like the examples we saw in our age-old post on Perceptrons), then you can use the JLT to make finding a linear separator easier. First project the data onto the dimension given in the theorem. With high probability the points will still be linearly separable. And then you can use a perceptron-type algorithm in the smaller dimension. If you want to find out which side a new point is on, you project and compare with the separator in the smaller dimension.

Beyond its interest for practical dimensionality reduction, the JLT has had many other interesting theoretical consequences. More generally, the idea of “randomly projecting” your data onto some small dimensional space has allowed mathematicians to get some of the best-known results on many optimization and learning problems, perhaps the most famous of which is called MAX-CUT; the result is by Goemans-Williamson and it led to a mathematical constant being named after them, \alpha_{GW} =.878567 \dots. If you’re interested in more about the theory, Santosh Vempala wrote a wonderful (and short!) treatise dedicated to this topic.

randomprojectionbook

The Many Faces of Set Cover

A while back Peter Norvig posted a wonderful pair of articles about regex golf. The idea behind regex golf is to come up with the shortest possible regular expression that matches one given list of strings, but not the other.

“Regex Golf,” by Randall Munroe.

In the first article, Norvig runs a basic algorithm to recreate and improve the results from the comic, and in the second he beefs it up with some improved search heuristics. My favorite part about this topic is that regex golf can be phrased in terms of a problem called set cover. I noticed this when reading the comic, and was delighted to see Norvig use that as the basis of his algorithm.

The set cover problem shows up in other places, too. If you have a database of items labeled by users, and you want to find the smallest set of labels to display that covers every item in the database, you’re doing set cover. I hear there are applications in biochemistry and biology but haven’t seen them myself.

If you know what a set is (just think of the “set” or “hash set” type from your favorite programming language), then set cover has a simple definition.

Definition (The Set Cover Problem): You are given a finite set U called a “universe” and sets S_1, \dots, S_n each of which is a subset of U. You choose some of the S_i to ensure that every x \in U is in one of your chosen sets, and you want to minimize the number of S_i you picked.

It’s called a “cover” because the sets you pick “cover” every element of U. Let’s do a simple. Let U = \{ 1,2,3,4,5 \} and

\displaystyle S_1 = \{ 1,3,4 \}, S_2 = \{ 2,3,5 \}, S_3 = \{ 1,4,5 \}, S_4 = \{ 2,4 \}

Then the smallest possible number of sets you can pick is 2, and you can achieve this by picking both S_1, S_2 or both S_2, S_3. The connection to regex golf is that you pick U to be the set of strings you want to match, and you pick a set of regexes that match some of the strings in U but none of the strings you want to avoid matching (I’ll call them V). If w is such a regex, then you can form the set S_w of strings that w matches. Then if you find a small set cover with the strings w_1, \dots, w_t, then you can “or” them together to get a single regex w_1 \mid w_2 \mid \dots \mid w_t that matches all of U but none of V.

Set cover is what’s called NP-hard, and one implication is that we shouldn’t hope to find an efficient algorithm that will always give you the shortest regex for every regex golf problem. But despite this, there are approximation algorithms for set cover. What I mean by this is that there is a regex-golf algorithm A that outputs a subset of the regexes matching all of U, and the number of regexes it outputs is such-and-such close to the minimum possible number. We’ll make “such-and-such” more formal later in the post.

What made me sad was that Norvig didn’t go any deeper than saying, “We can try to approximate set cover, and the greedy algorithm is pretty good.” It’s true, but the ideas are richer than that! Set cover is a simple example to showcase interesting techniques from theoretical computer science. And perhaps ironically, in Norvig’s second post a header promised the article would discuss the theory of set cover, but I didn’t see any of what I think of as theory. Instead he partially analyzes the structure of the regex golf instances he cares about. This is useful, but not really theoretical in any way unless he can say something universal about those instances.

I don’t mean to bash Norvig. His articles were great! And in-depth theory was way beyond scope. So this post is just my opportunity to fill in some theory gaps. We’ll do three things:

  1. Show formally that set cover is NP-hard.
  2. Prove the approximation guarantee of the greedy algorithm.
  3. Show another (very different) approximation algorithm based on linear programming.

Along the way I’ll argue that by knowing (or at least seeing) the details of these proofs, one can get a better sense of what features to look for in the set cover instance you’re trying to solve. We’ll also see how set cover depicts the broader themes of theoretical computer science.

NP-hardness

The first thing we should do is show that set cover is NP-hard. Intuitively what this means is that we can take some hard problem P and encode instances of P inside set cover problems. This idea is called a reduction, because solving problem P will “reduce” to solving set cover, and the method we use to encode instance of P as set cover problems will have a small amount of overhead. This is one way to say that set cover is “at least as hard as” P.

The hard problem we’ll reduce to set cover is called 3-satisfiability (3-SAT). In 3-SAT, the input is a formula whose variables are either true or false, and the formula is expressed as an OR of a bunch of clauses, each of which is an AND of three variables (or their negations). This is called 3-CNF form. A simple example:

\displaystyle (x \vee y \vee \neg z) \wedge (\neg x \vee w \vee y) \wedge (z \vee x \vee \neg w)

The goal of the algorithm is to decide whether there is an assignment to the variables which makes the formula true. 3-SAT is one of the most fundamental problems we believe to be hard and, roughly speaking, by reducing it to set cover we include set cover in a class called NP-complete, and if any one of these problems can be solved efficiently, then they all can (this is the famous P versus NP problem, and an efficient algorithm would imply P equals NP).

So a reduction would consist of the following: you give me a formula \varphi in 3-CNF form, and I have to produce (in a way that depends on \varphi!) a universe U and a choice of subsets S_i \subset U in such a way that

\varphi has a true assignment of variables if and only if the corresponding set cover problem has a cover using k sets.

In other words, I’m going to design a function f from 3-SAT instances to set cover instances, such that x is satisfiable if and only if f(x) has a set cover with k sets.

Why do I say it only for k sets? Well, if you can always answer this question then I claim you can find the minimum size of a set cover needed by doing a binary search for the smallest value of k. So finding the minimum size of a set cover reduces to the problem of telling if theres a set cover of size k.

Now let’s do the reduction from 3-SAT to set cover.

If you give me \varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m where each C_i is a clause and the variables are denoted x_1, \dots, x_n, then I will choose as my universe U to be the set of all the clauses and indices of the variables (these are all just formal symbols). i.e.

\displaystyle U = \{ C_1, C_2, \dots, C_m, 1, 2, \dots, n \}

The first part of U will ensure I make all the clauses true, and the last part will ensure I don’t pick a variable to be both true and false at the same time.

To show how this works I have to pick my subsets. For each variable x_i, I’ll make two sets, one called S_{x_i} and one called S_{\neg x_i}. They will both contain i in addition to the clauses which they make true when the corresponding literal is true (by literal I just mean the variable or its negation). For example, if C_j uses the literal \neg x_7, then S_{\neg x_7} will contain C_j but S_{x_7} will not. Finally, I’ll set k = n, the number of variables.

Now to prove this reduction works I have to prove two things: if my starting formula has a satisfying assignment I have to show the set cover problem has a cover of size k. Indeed, take the sets S_{y} for all literals y that are set to true in a satisfying assignment. There can be at most n true literals since half are true and half are false, so there will be at most n sets, and these sets clearly cover all of U because every literal has to be satisfied by some literal or else the formula isn’t true.

The reverse direction is similar: if I have a set cover of size n, I need to use it to come up with a satisfying truth assignment for the original formula. But indeed, the sets that get chosen can’t include both a S_{x_i} and its negation set S_{\neg x_i}, because there are n of the elements \{1, 2, \dots, n \} \subset U, and each i is only in the two S_{x_i}, S_{\neg x_i}. Just by counting if I cover all the indices i, I already account for n sets! And finally, since I have covered all the clauses, the literals corresponding to the sets I chose give exactly a satisfying assignment.

Whew! So set cover is NP-hard because I encoded this logic problem 3-SAT within its rules. If we think 3-SAT is hard (and we do) then set cover must also be hard. So if we can’t hope to solve it exactly we should try to approximate the best solution.

The greedy approach

The method that Norvig uses in attacking the meta-regex golf problem is the greedy algorithm. The greedy algorithm is exactly what you’d expect: you maintain a list L of the subsets you’ve picked so far, and at each step you pick the set S_i that maximizes the number of new elements of U that aren’t already covered by the sets in L. In python pseudocode:

def greedySetCover(universe, sets):
   chosenSets = set()
   leftToCover = universe.copy()
   unchosenSets = sets

   covered = lambda s: leftToCover & s

   while universe != 0:
      if len(chosenSets) == len(sets):
         raise Exception("No set cover possible")
      
      nextSet = max(unchosenSets, key=lambda s: len(covered(s)))
      unchosenSets.remove(nextSet)
      chosenSets.add(nextSet)
      leftToCover -= nextSet
      
   return chosenSets

This is what theory has to say about the greedy algorithm:

Theorem: If it is possible to cover U by the sets in F = \{ S_1, \dots, S_n \}, then the greedy algorithm always produces a cover that at worst has size O(\log(n)) \textup{OPT}, where \textup{OPT} is the size of the smallest cover. Moreover, this is asymptotically the best any algorithm can do.

One simple fact we need from calculus is that the following sum is asymptotically the same as \log(n):

\displaystyle H(n) = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \log(n) + O(1)

Proof. [adapted from Wan] Let’s say the greedy algorithm picks sets T_1, T_2, \dots, T_k in that order. We’ll set up a little value system for the elements of U. Specifically, the value of each T_i is 1, and in step i we evenly distribute this unit value across all newly covered elements of T_i. So for T_1 each covered element gets value 1/|T_1|, and if T_2 covers four new elements, each gets a value of 1/4. One can think of this “value” as a price, or energy, or unit mass, or whatever. It’s just an accounting system (albeit a clever one) we use to make some inequalities clear later.

In general call the value v_x of element x \in U the value assigned to x at the step where it’s first covered. In particular, the number of sets chosen by the greedy algorithm k is just \sum_{x \in U} v_x. We’re just bunching back together the unit value we distributed for each step of the algorithm.

Now we want to compare the sets chosen by greedy to the optimal choice. Call a smallest set cover C_{\textup{OPT}}. Let’s stare at the following inequality.

\displaystyle \sum_{x \in U} v_x \leq \sum_{S \in C_{\textup{OPT}}} \sum_{x \in S} v_x

It’s true because each x counts for a v_x at most once in the left hand side, and in the right hand side the sets in C_{\textup{OPT}} must hit each x at least once but may hit some x more than once. Also remember the left hand side is equal to k.

Now we want to show that the inner sum on the right hand side, \sum_{x \in S} v_x, is at most H(|S|). This will in fact prove the entire theorem: because each set S_i has size at most n, the inequality above will turn into

\displaystyle k \leq |C_{\textup{OPT}}| H(|S|) \leq |C_{\textup{OPT}}| H(n)

And so k \leq \textup{OPT} \cdot O(\log(n)), which is the statement of the theorem.

So we want to show that \sum_{x \in S} v_x \leq H(|S|). For each j define \delta_j(S) to be the number of elements in S not covered in T_1, \cup \dots \cup T_j. Notice that \delta_{j-1}(S) - \delta_{j}(S) is the number of elements of S that are covered for the first time in step j. If we call t_S the smallest integer j for which \delta_j(S) = 0, we can count up the differences up to step t_S, we get

\sum_{x \in S} v_x = \sum_{i=1}^{t_S} (\delta_{i-1}(S) - \delta_i(S)) \cdot \frac{1}{T_i - (T_1 \cup \dots \cup T_{i-1})}

The rightmost term is just the cost assigned to the relevant elements at step i. Moreover, because T_i covers more new elements than S (by definition of the greedy algorithm), the fraction above is at most 1/\delta_{i-1}(S). The end is near. For brevity I’ll drop the (S) from \delta_j(S).

\displaystyle \begin{aligned} \sum_{x \in S} v_x & \leq \sum_{i=1}^{t_S} (\delta_{i-1} - \delta_i) \frac{1}{\delta_{i-1}} \\ & \leq \sum_{i=1}^{t_S} (\frac{1}{1 + \delta_i} + \frac{1}{2+\delta_i} \dots + \frac{1}{\delta_{i-1}}) \\ & = \sum_{i=1}^{t_S} H(\delta_{i-1}) - H(\delta_i) \\ &= H(\delta_0) - H(\delta_{t_S}) = H(|S|) \end{aligned}

And that proves the claim.

\square

I have three postscripts to this proof:

  1. This is basically the exact worst-case approximation that the greedy algorithm achieves. In fact, Petr Slavik proved in 1996 that the greedy gives you a set of size exactly (\log n - \log \log n + O(1)) \textup{OPT} in the worst case.
  2. This is also the best approximation that any set cover algorithm can achieve, provided that P is not NP. This result was basically known in 1994, but it wasn’t until 2013 and the use of some very sophisticated tools that the best possible bound was found with the smallest assumptions.
  3. In the proof we used that |S| \leq n to bound things, but if we knew that our sets S_i (i.e. subsets matched by a regex) had sizes bounded by, say, B, the same proof would show that the approximation factor is \log(B) instead of \log n. However, in order for that to be useful you need B to be a constant, or at least to grow more slowly than any polynomial in n, since e.g. \log(n^{0.1}) = 0.1 \log n. In fact, taking a second look at Norvig’s meta regex golf problem, some of his instances had this property! Which means the greedy algorithm gives a much better approximation ratio for certain meta regex golf problems than it does for the worst case general problem. This is one instance where knowing the proof of a theorem helps us understand how to specialize it to our interests.
norvig-table

Norvig’s frequency table for president meta-regex golf. The left side counts the size of each set (defined by a regex)

The linear programming approach

So we just said that you can’t possibly do better than the greedy algorithm for approximating set cover. There must be nothing left to say, job well done, right? Wrong! Our second analysis, based on linear programming, shows that instances with special features can have better approximation results.

In particular, if we’re guaranteed that each element x \in U occurs in at most B of the sets S_i, then the linear programming approach will give a B-approximation, i.e. a cover whose size is at worst larger than OPT by a multiplicative factor of B. In the case that B is constant, we can beat our earlier greedy algorithm.

The technique is now a classic one in optimization, called LP-relaxation (LP stands for linear programming). The idea is simple. Most optimization problems can be written as integer linear programs, that is there you have n variables x_1, \dots, x_n \in \{ 0, 1 \} and you want to maximize (or minimize) a linear function of the x_i subject to some linear constraints. The thing you’re trying to optimize is called the objective. While in general solving integer linear programs is NP-hard, we can relax the “integer” requirement to 0 \leq x_i \leq 1, or something similar. The resulting linear program, called the relaxed program, can be solved efficiently using the simplex algorithm or another more complicated method.

The output of solving the relaxed program is an assignment of real numbers for the x_i that optimizes the objective function. A key fact is that the solution to the relaxed linear program will be at least as good as the solution to the original integer program, because the optimal solution to the integer program is a valid candidate for the optimal solution to the linear program. Then the idea is that if we use some clever scheme to round the x_i to integers, we can measure how much this degrades the objective and prove that it doesn’t degrade too much when compared to the optimum of the relaxed program, which means it doesn’t degrade too much when compared to the optimum of the integer program as well.

If this sounds wishy washy and vague don’t worry, we’re about to make it super concrete for set cover.

We’ll make a binary variable x_i for each set S_i in the input, and x_i = 1 if and only if we include it in our proposed cover. Then the objective function we want to minimize is \sum_{i=1}^n x_i. If we call our elements X = \{ e_1, \dots, e_m \}, then we need to write down a linear constraint that says each element e_j is hit by at least one set in the proposed cover. These constraints have to depend on the sets S_i, but that’s not a problem. One good constraint for element e_j is

\displaystyle \sum_{i : e_j \in S_i} x_i \geq 1

In words, the only way that an e_j will not be covered is if all the sets containing it have their x_i = 0. And we need one of these constraints for each j. Putting it together, the integer linear program is

The integer program for set cover.

The integer program for set cover.

Once we understand this formulation of set cover, the relaxation is trivial. We just replace the last constraint with inequalities.

setcoverlp

For a given candidate assignment x to the x_i, call Z(x) the objective value (in this case \sum_i x_i). Now we can be more concrete about the guarantees of this relaxation method. Let \textup{OPT}_{\textup{IP}} be the optimal value of the integer program and x_{\textup{IP}} a corresponding assignment to x_i achieving the optimum. Likewise let \textup{OPT}_{\textup{LP}}, x_{\textup{LP}} be the optimal things for the linear relaxation. We will prove:

Theorem: There is a deterministic algorithm that rounds x_{\textup{LP}} to integer values x so that the objective value Z(x) \leq B \textup{OPT}_{\textup{IP}}, where B is the maximum number of sets that any element e_j occurs in. So this gives a B-approximation of set cover.

Proof. Let B be as described in the theorem, and call y = x_{\textup{LP}} to make the indexing notation easier. The rounding algorithm is to set x_i = 1 if y_i \geq 1/B and zero otherwise.

To prove the theorem we need to show two things hold about this new candidate solution x:

  1. The choice of all S_i for which x_i = 1 covers every element.
  2. The number of sets chosen (i.e. Z(x)) is at most B times more than \textup{OPT}_{\textup{LP}}.

Since \textup{OPT}_{\textup{LP}} \leq \textup{OPT}_{\textup{IP}}, so if we can prove number 2 we get Z(x) \leq B \textup{OPT}_{\textup{LP}} \leq B \textup{OPT}_{\textup{IP}}, which is the theorem.

So let’s prove 1. Fix any j and we’ll show that element e_j is covered by some set in the rounded solution. Call B_j the number of times element e_j occurs in the input sets. By definition B_j \leq B, so 1/B_j \geq 1/B. Recall y was the optimal solution to the relaxed linear program, and so it must be the case that the linear constraint for each e_j is satisfied: \sum_{i : e_j \in S_i} x_i \geq 1. We know that there are B_j terms and they sums to at least 1, so not all terms can be smaller than 1/B_j (otherwise they’d sum to something less than 1). In other words, some variable x_i in the sum is at least 1/B_j \geq 1/B, and so x_i is set to 1 in the rounded solution, corresponding to a set S_i that contains e_j. This finishes the proof of 1.

Now let’s prove 2. For each j, we know that for each x_i = 1, the corresponding variable y_i \geq 1/B. In particular 1 \leq y_i B. Now we can simply bound the sum.

\displaystyle \begin{aligned} Z(x) = \sum_i x_i &\leq \sum_i x_i (B y_i) \\ &\leq B \sum_{i} y_i \\ &= B \cdot \textup{OPT}_{\textup{LP}} \end{aligned}

The second inequality is true because some of the x_i are zero, but we can ignore them when we upper bound and just include all the y_i. This proves part 2 and the theorem.

\square

I’ve got some more postscripts to this proof:

  1. The proof works equally well when the sets are weighted, i.e. your cost for picking a set is not 1 for every set but depends on some arbitrarily given constants w_i \geq 0.
  2. We gave a deterministic algorithm rounding y to x, but one can get the same result (with high probability) using a randomized algorithm. The idea is to flip a coin with bias y_i roughly \log(n) times and set x_i = 1 if and only if the coin lands heads at least once. The guarantee is no better than what we proved, but for some other problems randomness can help you get approximations where we don’t know of any deterministic algorithms to get the same guarantees. I can’t think of any off the top of my head, but I’m pretty sure they’re out there.
  3. For step 1 we showed that at least one term in the inequality for e_j would be rounded up to 1, and this guaranteed we covered all the elements. A natural question is: why not also round up at most one term of each of these inequalities? It might be that in the worst case you don’t get a better guarantee, but it would be a quick extra heuristic you could use to post-process a rounded solution.
  4. Solving linear programs is slow. There are faster methods based on so-called “primal-dual” methods that use information about the dual of the linear program to construct a solution to the problem. Goemans and Williamson have a nice self-contained chapter on their website about this with a ton of applications.

Additional Reading

Williamson and Shmoys have a large textbook called The Design of Approximation Algorithms. One problem is that this field is like a big heap of unrelated techniques, so it’s not like the book will build up some neat theoretical foundation that works for every problem. Rather, it’s messy and there are lots of details, but there are definitely diamonds in the rough, such as the problem of (and algorithms for) coloring 3-colorable graphs with “approximately 3” colors, and the infamous unique games conjecture.

I wrote a post a while back giving conditions which, if a problem satisfies those conditions, the greedy algorithm will give a constant-factor approximation. This is much better than the worst case \log(n)-approximation we saw in this post. Moreover, I also wrote a post about matroids, which is a characterization of problems where the greedy algorithm is actually optimal.

Set cover is one of the main tools that IBM’s AntiVirus software uses to detect viruses. Similarly to the regex golf problem, they find a set of strings that occurs source code in some viruses but not (usually) in good programs. Then they look for a small set of strings that covers all the viruses, and their virus scan just has to search binaries for those strings. Hopefully the size of your set cover is really small compared to the number of viruses you want to protect against. I can’t find a reference that details this, but that is understandable because it is proprietary software.

Until next time!