# Earthmover Distance

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?

It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.

All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.

Solution (Earthmover distance): Treat each sample set $A$ corresponding to a “point” as a discrete probability distribution, so that each sample $x \ in A$ has probability mass $p_x = 1 / |A|$. The distance between $A$ and $B$ is the optional solution to the following linear program.

Each $x \in A$ corresponds to a pile of dirt of height $p_x$, and each $y \in B$ corresponds to a hole of depth $p_y$. The cost of moving a unit of dirt from $x$ to $y$ is the Euclidean distance $d(x, y)$ between the points (or whatever hipster metric you want to use).

Let $z_{x, y}$ be a real variable corresponding to an amount of dirt to move from $x \in A$ to $y \in B$, with cost $d(x, y)$. Then the constraints are:

• Each $z_{x, y} \geq 0$, so dirt only moves from $x$ to $y$.
• Every pile $x \in A$ must vanish, i.e. for each fixed $x \in A$, $\sum_{y \in B} z_{x,y} = p_x$.
• Likewise, every hole $y \in B$ must be completely filled, i.e. $\sum_{y \in B} z_{x,y} = p_y$.

The objective is to minimize the cost of doing this: $\sum_{x, y \in A \times B} d(x, y) z_{x, y}$.

In python, using the ortools library (and leaving out a few docstrings and standard import statements, full code on Github):

from ortools.linear_solver import pywraplp

def earthmover_distance(p1, p2):
dist1 = {x: count / len(p1) for (x, count) in Counter(p1).items()}
dist2 = {x: count / len(p2) for (x, count) in Counter(p2).items()}
solver = pywraplp.Solver('earthmover_distance', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

variables = dict()

# for each pile in dist1, the constraint that says all the dirt must leave this pile
dirt_leaving_constraints = defaultdict(lambda: 0)

# for each hole in dist2, the constraint that says this hole must be filled
dirt_filling_constraints = defaultdict(lambda: 0)

# the objective
objective = solver.Objective()
objective.SetMinimization()

for (x, dirt_at_x) in dist1.items():
for (y, capacity_of_y) in dist2.items():
amount_to_move_x_y = solver.NumVar(0, solver.infinity(), 'z_{%s, %s}' % (x, y))
variables[(x, y)] = amount_to_move_x_y
dirt_leaving_constraints[x] += amount_to_move_x_y
dirt_filling_constraints[y] += amount_to_move_x_y
objective.SetCoefficient(amount_to_move_x_y, euclidean_distance(x, y))

for x, linear_combination in dirt_leaving_constraints.items():

for y, linear_combination in dirt_filling_constraints.items():

status = solver.Solve()
if status not in [solver.OPTIMAL, solver.FEASIBLE]:
raise Exception('Unable to find feasible solution')

return objective.Value()


Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.

One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves

Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.

This kind of attention to detail can be very important in certain situations. One that I’ve been paying close attention to recently is the problem of studying gerrymandering from a mathematical perspective. Justin Solomon of MIT is a champion of the Earthmover distance (see his fascinating talk here for more, with slides) which is just one topic in a field called “optimal transport.”

This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.

Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.

Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.

Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.

# The Reasonable Effectiveness of the Multiplicative Weights Update Algorithm

Christos Papadimitriou, who studies multiplicative weights in the context of biology.

## Hard to believe

Sanjeev Arora and his coauthors consider it “a basic tool [that should be] taught to all algorithms students together with divide-and-conquer, dynamic programming, and random sampling.” Christos Papadimitriou calls it “so hard to believe that it has been discovered five times and forgotten.” It has formed the basis of algorithms in machine learning, optimization, game theory, economics, biology, and more.

What mystical algorithm has such broad applications? Now that computer scientists have studied it in generality, it’s known as the Multiplicative Weights Update Algorithm (MWUA). Procedurally, the algorithm is simple. I can even describe the core idea in six lines of pseudocode. You start with a collection of $n$ objects, and each object has a weight.

Set all the object weights to be 1.
For some large number of rounds:
Pick an object at random proportionally to the weights
Some event happens
Increase the weight of the chosen object if it does well in the event
Otherwise decrease the weight

The name “multiplicative weights” comes from how we implement the last step: if the weight of the chosen object at step $t$ is $w_t$ before the event, and $G$ represents how well the object did in the event, then we’ll update the weight according to the rule:

$\displaystyle w_{t+1} = w_t (1 + G)$

Think of this as increasing the weight by a small multiple of the object’s performance on a given round.

Here is a simple example of how it might be used. You have some money you want to invest, and you have a bunch of financial experts who are telling you what to invest in every day. So each day you pick an expert, and you follow their advice, and you either make a thousand dollars, or you lose a thousand dollars, or something in between. Then you repeat, and your goal is to figure out which expert is the most reliable.

This is how we use multiplicative weights: if we number the experts $1, \dots, N$, we give each expert a weight $w_i$ which starts at 1. Then, each day we pick an expert at random (where experts with larger weights are more likely to be picked) and at the end of the day we have some gain or loss $G$. Then we update the weight of the chosen expert by multiplying it by $(1 + G / 1000)$. Sometimes you have enough information to update the weights of experts you didn’t choose, too. The theoretical guarantees of the algorithm say we’ll find the best expert quickly (“quickly” will be concrete later).

In fact, let’s play a game where you, dear reader, get to decide the rewards for each expert and each day. I programmed the multiplicative weights algorithm to react according to your choices. Click the image below to go to the demo.

This core mechanism of updating weights can be interpreted in many ways, and that’s part of the reason it has sprouted up all over mathematics and computer science. Just a few examples of where this has led:

1. In game theory, weights are the “belief” of a player about the strategy of an opponent. The most famous algorithm to use this is called Fictitious Play, and others include EXP3 for minimizing regret in the so-called “adversarial bandit learning” problem.
2. In machine learning, weights are the difficulty of a specific training example, so that higher weights mean the learning algorithm has to “try harder” to accommodate that example. The first result I’m aware of for this is the Perceptron (and similar Winnow) algorithm for learning hyperplane separators. The most famous is the AdaBoost algorithm.
3. Analogously, in optimization, the weights are the difficulty of a specific constraint, and this technique can be used to approximately solve linear and semidefinite programs. The approximation is because MWUA only provides a solution with some error.
4. In mathematical biology, the weights represent the fitness of individual alleles, and filtering reproductive success based on this and updating weights for successful organisms produces a mechanism very much like evolution. With modifications, it also provides a mechanism through which to understand sex in the context of evolutionary biology.
5. The TCP protocol, which basically defined the internet, uses additive and multiplicative weight updates (which are very similar in the analysis) to manage congestion.
6. You can get easy $\log(n)$-approximation algorithms for many NP-hard problems, such as set cover.

Additional, more technical examples can be found in this survey of Arora et al.

In the rest of this post, we’ll implement a generic Multiplicative Weights Update Algorithm, we’ll prove it’s main theoretical guarantees, and we’ll implement a linear program solver as an example of its applicability. As usual, all of the code used in the making of this post is available in a Github repository.

## The generic MWUA algorithm

Let’s start by writing down pseudocode and an implementation for the MWUA algorithm in full generality.

In general we have some set $X$ of objects and some set $Y$ of “event outcomes” which can be completely independent. If these sets are finite, we can write down a table $M$ whose rows are objects, whose columns are outcomes, and whose $i,j$ entry $M(i,j)$ is the reward produced by object $x_i$ when the outcome is $y_j$. We will also write this as $M(x, y)$ for object $x$ and outcome $y$. The only assumption we’ll make on the rewards is that the values $M(x, y)$ are bounded by some small constant $B$ (by small I mean $B$ should not require exponentially many bits to write down as compared to the size of $X$). In symbols, $M(x,y) \in [0,B]$. There are minor modifications you can make to the algorithm if you want negative rewards, but for simplicity we will leave that out. Note the table $M$ just exists for analysis, and the algorithm does not know its values. Moreover, while the values in $M$ are static, the choice of outcome $y$ for a given round may be nondeterministic.

The MWUA algorithm randomly chooses an object $x \in X$ in every round, observing the outcome $y \in Y$, and collecting the reward $M(x,y)$ (or losing it as a penalty). The guarantee of the MWUA theorem is that the expected sum of rewards/penalties of MWUA is not much worse than if one had picked the best object (in hindsight) every single round.

Let’s describe the algorithm in notation first and build up pseudocode as we go. The input to the algorithm is the set of objects, a subroutine that observes an outcome, a black-box reward function, a learning rate parameter, and a number of rounds.

def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
...


We define for object $x$ a nonnegative number $w_x$ we call a “weight.” The weights will change over time so we’ll also sub-script a weight with a round number $t$, i.e. $w_{x,t}$ is the weight of object $x$ in round $t$. Initially, all the weights are $1$. Then MWUA continues in rounds. We start each round by drawing an example randomly with probability proportional to the weights. Then we observe the outcome for that round and the reward for that round.

# draw: [float] -> int
# pick an index from the given list of floats proportionally
# to the size of the entry (i.e. normalize to a probability
# distribution and draw according to the probabilities).
def draw(weights):
choice = random.uniform(0, sum(weights))
choiceIndex = 0

for weight in weights:
choice -= weight
if choice <= 0:
return choiceIndex

choiceIndex += 1

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate numRounds):
weights = [1] * len(objects)
for t in numRounds:
chosenObjectIndex = draw(weights)
chosenObject = objects[chosenObjectIndex]

outcome = observeOutcome(t, weights, chosenObject)
thisRoundReward = reward(chosenObject, outcome)

...


Sampling objects in this way is the same as associating a distribution $D_t$ to each round, where if $S_t = \sum_{x \in X} w_{x,t}$ then the probability of drawing $x$, which we denote $D_t(x)$, is $w_{x,t} / S_t$. We don’t need to keep track of this distribution in the actual run of the algorithm, but it will help us with the mathematical analysis.

Next comes the weight update step. Let’s call our learning rate variable parameter $\varepsilon$. In round $t$ say we have object $x_t$ and outcome $y_t$, then the reward is $M(x_t, y_t)$. We update the weight of the chosen object $x_t$ according to the formula:

$\displaystyle w_{x_t, t} = w_{x_t} (1 + \varepsilon M(x_t, y_t) / B)$

In the more general event that you have rewards for all objects (if not, the reward-producing function can output zero), you would perform this weight update on all objects $x \in X$. This turns into the following Python snippet, where we hide the division by $B$ into the choice of learning rate:

# MWUA: the multiplicative weights update algorithm
def MWUA(objects, observeOutcome, reward, learningRate, numRounds):
weights = [1] * len(objects)
for t in numRounds:
chosenObjectIndex = draw(weights)
chosenObject = objects[chosenObjectIndex]

outcome = observeOutcome(t, weights, chosenObject)
thisRoundReward = reward(chosenObject, outcome)

for i in range(len(weights)):
weights[i] *= (1 + learningRate * reward(objects[i], outcome))


One of the amazing things about this algorithm is that the outcomes and rewards could be chosen adaptively by an adversary who knows everything about the MWUA algorithm (except which random numbers the algorithm generates to make its choices). This means that the rewards in round $t$ can depend on the weights in that same round! We will exploit this when we solve linear programs later in this post.

But even in such an oppressive, exploitative environment, MWUA persists and achieves its guarantee. And now we can state that guarantee.

Theorem (from Arora et al): The cumulative reward of the MWUA algorithm is, up to constant multiplicative factors, at least the cumulative reward of the best object minus $\log(n)$, where $n$ is the number of objects. (Exact formula at the end of the proof)

The core of the proof, which we’ll state as a lemma, uses one of the most elegant proof techniques in all of mathematics. It’s the idea of constructing a potential function, and tracking the change in that potential function over time. Such a proof usually has the mysterious script:

1. Define potential function, in our case $S_t$.
2. State what seems like trivial facts about the potential function to write $S_{t+1}$ in terms of $S_t$, and hence get general information about $S_T$ for some large $T$.
3. Theorem is proved.
4. Wait, what?

Clearly, coming up with a useful potential function is a difficult and prized skill.

In this proof our potential function is the sum of the weights of the objects in a given round, $S_t = \sum_{x \in X} w_{x, t}$. Now the lemma.

Lemma: Let $B$ be the bound on the size of the rewards, and $0 < \varepsilon < 1/2$ a learning parameter. Recall that $D_t(x)$ is the probability that MWUA draws object $x$ in round $t$. Write the expected reward for MWUA for round $t$ as the following (using only the definition of expected value):

$\displaystyle R_t = \sum_{x \in X} D_t(x) M(x, y_t)$

Then the claim of the lemma is:

$\displaystyle S_{t+1} \leq S_t e^{\varepsilon R_t / B}$

Proof. Expand $S_{t+1} = \sum_{x \in X} w_{x, t+1}$ using the definition of the MWUA update:

$\displaystyle \sum_{x \in X} w_{x, t+1} = \sum_{x \in X} w_{x, t}(1 + \varepsilon M(x, y_t) / B)$

Now distribute $w_{x, t}$ and split into two sums:

$\displaystyle \dots = \sum_{x \in X} w_{x, t} + \frac{\varepsilon}{B} \sum_{x \in X} w_{x,t} M(x, y_t)$

Using the fact that $D_t(x) = \frac{w_{x,t}}{S_t}$, we can replace $w_{x,t}$ with $D_t(x) S_t$, which allows us to get $R_t$

\displaystyle \begin{aligned} \dots &= S_t + \frac{\varepsilon S_t}{B} \sum_{x \in X} D_t(x) M(x, y_t) \\ &= S_t \left ( 1 + \frac{\varepsilon R_t}{B} \right ) \end{aligned}

And then using the fact that $(1 + x) \leq e^x$ (Taylor series), we can bound the last expression by $S_te^{\varepsilon R_t / B}$, as desired.

$\square$

Now using the lemma, we can get a hold on $S_T$ for a large $T$, namely that

$\displaystyle S_T \leq S_1 e^{\varepsilon \sum_{t=1}^T R_t / B}$

If $|X| = n$ then $S_1=n$, simplifying the above. Moreover, the sum of the weights in round $T$ is certainly greater than any single weight, so that for every fixed object $x \in X$,

$\displaystyle S_T \geq w_{x,T} \leq (1 + \varepsilon)^{\sum_t M(x, y_t) / B}$

Squeezing $S_t$ between these two inequalities and taking logarithms (to simplify the exponents) gives

$\displaystyle \left ( \sum_t M(x, y_t) / B \right ) \log(1+\varepsilon) \leq \log n + \frac{\varepsilon}{B} \sum_t R_t$

Multiply through by $B$, divide by $\varepsilon$, rearrange, and use the fact that when $0 < \varepsilon < 1/2$ we have $\log(1 + \varepsilon) \geq \varepsilon - \varepsilon^2$ (Taylor series) to get

$\displaystyle \sum_t R_t \geq \left [ \sum_t M(x, y_t) \right ] (1-\varepsilon) - \frac{B \log n}{\varepsilon}$

The bracketed term is the payoff of object $x$, and MWUA’s payoff is at least a fraction of that minus the logarithmic term. The bound applies to any object $x \in X$, and hence to the best one. This proves the theorem.

$\square$

Briefly discussing the bound itself, we see that the smaller the learning rate is, the closer you eventually get to the best object, but by contrast the more the subtracted quantity $B \log(n) / \varepsilon$ hurts you. If your target is an absolute error bound against the best performing object on average, you can do more algebra to determine how many rounds you need in terms of a fixed $\delta$. The answer is roughly: let $\varepsilon = O(\delta / B)$ and pick $T = O(B^2 \log(n) / \delta^2)$. See this survey for more.

## MWUA for linear programs

Now we’ll approximately solve a linear program using MWUA. Recall that a linear program is an optimization problem whose goal is to minimize (or maximize) a linear function of many variables. The objective to minimize is usually given as a dot product $c \cdot x$, where $c$ is a fixed vector and $x = (x_1, x_2, \dots, x_n)$ is a vector of non-negative variables the algorithm gets to choose. The choices for $x$ are also constrained by a set of $m$ linear inequalities, $A_i \cdot x \geq b_i$, where $A_i$ is a fixed vector and $b_i$ is a scalar for $i = 1, \dots, m$. This is usually summarized by putting all the $A_i$ in a matrix, $b_i$ in a vector, as

$x_{\textup{OPT}} = \textup{argmin}_x \{ c \cdot x \mid Ax \geq b, x \geq 0 \}$

We can further simplify the constraints by assuming we know the optimal value $Z = c \cdot x_{\textup{OPT}}$ in advance, by doing a binary search (more on this later). So, if we ignore the hard constraint $Ax \geq b$, the “easy feasible region” of possible $x$‘s includes $\{ x \mid x \geq 0, c \cdot x = Z \}$.

In order to fit linear programming into the MWUA framework we have to define two things.

1. The objects: the set of linear inequalities $A_i \cdot x \geq b_i$.
2. The rewards: the error of a constraint for a special input vector $x_t$.

Number 2 is curious (why would we give a reward for error?) but it’s crucial and we’ll discuss it momentarily.

The special input $x_t$ depends on the weights in round $t$ (which is allowed, recall). Specifically, if the weights are $w = (w_1, \dots, w_m)$, we ask for a vector $x_t$ in our “easy feasible region” which satisfies

$\displaystyle (A^T w) \cdot x_t \geq w \cdot b$

For this post we call the implementation of procuring such a vector the “oracle,” since it can be seen as the black-box problem of, given a vector $\alpha$ and a scalar $\beta$ and a convex region $R$, finding a vector $x \in R$ satisfying $\alpha \cdot x \geq \beta$. This allows one to solve more complex optimization problems with the same technique, swapping in a new oracle as needed. Our choice of inputs, $\alpha = A^T w, \beta = w \cdot b$, are particular to the linear programming formulation.

Two remarks on this choice of inputs. First, the vector $A^T w$ is a weighted average of the constraints in $A$, and $w \cdot b$ is a weighted average of the thresholds. So this this inequality is a “weighted average” inequality (specifically, a convex combination, since the weights are nonnegative). In particular, if no such $x$ exists, then the original linear program has no solution. Indeed, given a solution $x^*$ to the original linear program, each constraint, say $A_1 x^*_1 \geq b_1$, is unaffected by left-multiplication by $w_1$.

Second, and more important to the conceptual understanding of this algorithm, the choice of rewards and the multiplicative updates ensure that easier constraints show up less prominently in the inequality by having smaller weights. That is, if we end up overly satisfying a constraint, we penalize that object for future rounds so we don’t waste our effort on it. The byproduct of MWUA—the weights—identify the hardest constraints to satisfy, and so in each round we can put a proportionate amount of effort into solving (one of) the hard constraints. This is why it makes sense to reward error; the error is a signal for where to improve, and by over-representing the hard constraints, we force MWUA’s attention on them.

At the end, our final output is an average of the $x_t$ produced in each round, i.e. $x^* = \frac{1}{T}\sum_t x_t$. This vector satisfies all the constraints to a roughly equal degree. We will skip the proof that this vector does what we want, but see these notes for a simple proof. We’ll spend the rest of this post implementing the scheme outlined above.

## Implementing the oracle

Fix the convex region $R = \{ c \cdot x = Z, x \geq 0 \}$ for a known optimal value $Z$. Define $\textup{oracle}(\alpha, \beta)$ as the problem of finding an $x \in R$ such that $\alpha \cdot x \geq \beta$.

For the case of this linear region $R$, we can simply find the index $i$ which maximizes $\alpha_i Z / c_i$. If this value exceeds $\beta$, we can return the vector with that value in the $i$-th position and zeros elsewhere. Otherwise, the problem has no solution.

To prove the “no solution” part, say $n=2$ and you have $x = (x_1, x_2)$ a solution to $\alpha \cdot x \geq \beta$. Then for whichever index makes $\alpha_i Z / c_i$ bigger, say $i=1$, you can increase $\alpha \cdot x$ without changing $c \cdot x = Z$ by replacing $x_1$ with $x_1 + (c_2/c_1)x_2$ and $x_2$ with zero. I.e., we’re moving the solution $x$ along the line $c \cdot x = Z$ until it reaches a vertex of the region bounded by $c \cdot x = Z$ and $x \geq 0$. This must happen when all entries but one are zero. This is the same reason why optimal solutions of (generic) linear programs occur at vertices of their feasible regions.

The code for this becomes quite simple. Note we use the numpy library in the entire codebase to make linear algebra operations fast and simple to read.

def makeOracle(c, optimalValue):
n = len(c)

def oracle(weightedVector, weightedThreshold):
def quantity(i):
return weightedVector[i] * optimalValue / c[i] if c[i] > 0 else -1

biggest = max(range(n), key=quantity)
if quantity(biggest) < weightedThreshold:
raise InfeasibleException

return numpy.array([optimalValue / c[i] if i == biggest else 0 for i in range(n)])

return oracle


## Implementing the core solver

The core solver implements the discussion from previously, given the optimal value of the linear program as input. To avoid too many single-letter variable names, we use linearObjective instead of $c$.

def solveGivenOptimalValue(A, b, linearObjective, optimalValue, learningRate=0.1):
m, n = A.shape  # m equations, n variables
oracle = makeOracle(linearObjective, optimalValue)

def reward(i, specialVector):
...

def observeOutcome(_, weights, __):
...

numRounds = 1000
weights, cumulativeReward, outcomes = MWUA(
range(m), observeOutcome, reward, learningRate, numRounds
)
averageVector = sum(outcomes) / numRounds

return averageVector


First we make the oracle, then the reward and outcome-producing functions, then we invoke the MWUA subroutine. Here are those two functions; they are closures because they need access to $A$ and $b$. Note that neither $c$ nor the optimal value show up here.

    def reward(i, specialVector):
constraint = A[i]
threshold = b[i]
return threshold - numpy.dot(constraint, specialVector)

def observeOutcome(_, weights, __):
weights = numpy.array(weights)
weightedVector = A.transpose().dot(weights)
weightedThreshold = weights.dot(b)
return oracle(weightedVector, weightedThreshold)


## Implementing the binary search, and an example

Finally, the top-level routine. Note that the binary search for the optimal value is sophisticated (though it could be more sophisticated). It takes a max range for the search, and invokes the optimization subroutine, moving the upper bound down if the linear program is feasible and moving the lower bound up otherwise.

def solve(A, b, linearObjective, maxRange=1000):
optRange = [0, maxRange]

while optRange[1] - optRange[0] > 1e-8:
proposedOpt = sum(optRange) / 2
print("Attempting to solve with proposedOpt=%G" % proposedOpt)

# Because the binary search starts so high, it results in extreme
# reward values that must be tempered by a slow learning rate. Exercise
# to the reader: determine absolute bounds for the rewards, and set
# this learning rate in a more principled fashion.
learningRate = 1 / max(2 * proposedOpt * c for c in linearObjective)
learningRate = min(learningRate, 0.1)

try:
result = solveGivenOptimalValue(A, b, linearObjective, proposedOpt, learningRate)
optRange[1] = proposedOpt
except InfeasibleException:
optRange[0] = proposedOpt

return result


Finally, a simple example:

A = numpy.array([[1, 2, 3], [0, 4, 2]])
b = numpy.array([5, 6])
c = numpy.array([1, 2, 1])

x = solve(A, b, c)
print(x)
print(c.dot(x))
print(A.dot(x) - b)


The output:

Attempting to solve with proposedOpt=500
Attempting to solve with proposedOpt=250
Attempting to solve with proposedOpt=125
Attempting to solve with proposedOpt=62.5
Attempting to solve with proposedOpt=31.25
Attempting to solve with proposedOpt=15.625
Attempting to solve with proposedOpt=7.8125
Attempting to solve with proposedOpt=3.90625
Attempting to solve with proposedOpt=1.95312
Attempting to solve with proposedOpt=2.92969
Attempting to solve with proposedOpt=3.41797
Attempting to solve with proposedOpt=3.17383
Attempting to solve with proposedOpt=3.05176
Attempting to solve with proposedOpt=2.99072
Attempting to solve with proposedOpt=3.02124
Attempting to solve with proposedOpt=3.00598
Attempting to solve with proposedOpt=2.99835
Attempting to solve with proposedOpt=3.00217
Attempting to solve with proposedOpt=3.00026
Attempting to solve with proposedOpt=2.99931
Attempting to solve with proposedOpt=2.99978
Attempting to solve with proposedOpt=3.00002
Attempting to solve with proposedOpt=2.9999
Attempting to solve with proposedOpt=2.99996
Attempting to solve with proposedOpt=2.99999
Attempting to solve with proposedOpt=3.00001
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3  # note %G rounds the printed values
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
Attempting to solve with proposedOpt=3
[ 0.     0.987  1.026]
3.00000000425
[  5.20000072e-02   8.49831849e-09]


So there we have it. A fiendishly clever use of multiplicative weights for solving linear programs.

## Discussion

One of the nice aspects of MWUA is it’s completely transparent. If you want to know why a decision was made, you can simply look at the weights and look at the history of rewards of the objects. There’s also a clear interpretation of what is being optimized, as the potential function used in the proof is a measure of both quality and adaptability to change. The latter is why MWUA succeeds even in adversarial settings, and why it makes sense to think about MWUA in the context of evolutionary biology.

This even makes one imagine new problems that traditional algorithms cannot solve, but which MWUA handles with grace. For example, imagine trying to solve an “online” linear program in which over time a constraint can change. MWUA can adapt to maintain its approximate solution.

The linear programming technique is known in the literature as the Plotkin-Shmoys-Tardos framework for covering and packing problems. The same ideas extend to other convex optimization problems, including semidefinite programming.

If you’ve been reading this entire post screaming “This is just gradient descent!” Then you’re right and wrong. It bears a striking resemblance to gradient descent (see this document for details about how special cases of MWUA are gradient descent by another name), but the adaptivity for the rewards makes MWUA different.

Even though so many people have been advocating for MWUA over the past decade, it’s surprising that it doesn’t show up in the general math/CS discourse on the internet or even in many algorithms courses. The Arora survey I referenced is from 2005 and the linear programming technique I demoed is originally from 1991! I took algorithms classes wherever I could, starting undergraduate in 2007, and I didn’t even hear a whisper of this technique until midway through my PhD in theoretical CS (I did, however, study fictitious play in a game theory class). I don’t have an explanation for why this is the case, except maybe that it takes more than 20 years for techniques to make it to the classroom. At the very least, this is one good reason to go to graduate school. You learn the things (and where to look for the things) which haven’t made it to classrooms yet.

Until next time!

# Linear Programming and the Simplex Algorithm

In the last post in this series we saw some simple examples of linear programs, derived the concept of a dual linear program, and saw the duality theorem and the complementary slackness conditions which give a rough sketch of the stopping criterion for an algorithm. This time we’ll go ahead and write this algorithm for solving linear programs, and next time we’ll apply the algorithm to an industry-strength version of the nutrition problem we saw last time. The algorithm we’ll implement is called the simplex algorithm. It was the first algorithm for solving linear programs, invented in the 1940’s by George Dantzig, and it’s still the leading practical algorithm, and it was a key part of a Nobel Prize. It’s by far one of the most important algorithms ever devised.

As usual, we’ll post all of the code written in the making of this post on this blog’s Github page.

## Slack variables and equality constraints

The simplex algorithm can solve any kind of linear program, but it only accepts a special form of the program as input. So first we have to do some manipulations. Recall that the primal form of a linear program was the following minimization problem.

$\min \left \langle c, x \right \rangle \\ \textup{s.t. } Ax \geq b, x \geq 0$

where the brackets mean “dot product.” And its dual is

$\max \left \langle y, b \right \rangle \\ \textup{s.t. } A^Ty \leq c, y \geq 0$

The linear program can actually have more complicated constraints than just the ones above. In general, one might want to have “greater than” and “less than” constraints in the same problem. It turns out that this isn’t any harder, and moreover the simplex algorithm only uses equality constraints, and with some finicky algebra we can turn any set of inequality or equality constraints into a set of equality constraints.

We’ll call our goal the “standard form,” which is as follows:

$\max \left \langle c, x \right \rangle \\ \textup{s.t. } Ax = b, x \geq 0$

It seems impossible to get the usual minimization/maximization problem into standard form until you realize there’s nothing stopping you from adding more variables to the problem. That is, say we’re given a constraint like:

$\displaystyle x_7 + x_3 \leq 10,$

we can add a new variable $\xi$, called a slack variable, so that we get an equality:

$\displaystyle x_7 + x_3 + \xi = 10$

And now we can just impose that $\xi \geq 0$. The idea is that $\xi$ represents how much “slack” there is in the inequality, and you can always choose it to make the condition an equality. So if the equality holds and the variables are nonnegative, then the $x_i$ will still satisfy their original inequality. For “greater than” constraints, we can do the same thing but subtract a nonnegative variable. Finally, if we have a minimization problem “$\min z$” we can convert it to $\max -z$.

So, to combine all of this together, if we have the following linear program with each kind of constraint,

We can add new variables $\xi_1, \xi_2$, and write it as

By defining the vector variable $x = (x_1, x_2, x_3, \xi_1, \xi_2)$ and $c = (-1,-1,-1,0,0)$ and $A$ to have $-1, 0, 1$ as appropriately for the new variables, we see that the system is written in standard form.

This is the kind of tedious transformation we can automate with a program. Assuming there are $n$ variables, the input consists of the vector $c$ of length $n$, and three matrix-vector pairs $(A, b)$ representing the three kinds of constraints. It’s a bit annoying to describe, but the essential idea is that we compute a rectangular “identity” matrix whose diagonal entries are $\pm 1$, and then join this with the original constraint matrix row-wise. The reader can see the full implementation in the Github repository for this post, though we won’t use this particular functionality in the algorithm that follows.

There are some other additional things we could do: for example there might be some variables that are completely unrestricted. What you do in this case is take an unrestricted variable $z$ and replace it by the difference of two unrestricted variables $z' - z''$.  For simplicity we’ll ignore this, but it would be a fruitful exercise for the reader to augment the function to account for these.

## What happened to the slackness conditions?

The “standard form” of our linear program raises an obvious question: how can the complementary slackness conditions make sense if everything is an equality? It turns out that one can redo all the work one did for linear programs of the form we gave last time (minimize w.r.t. greater-than constraints) for programs in the new “standard form” above. We even get the same complementary slackness conditions! If you want to, you can do this entire routine quite a bit faster if you invoke the power of Lagrangians. We won’t do that here, but the tool shows up as a way to work with primal-dual conversions in many other parts of mathematics, so it’s a good buzzword to keep in mind.

In our case, the only difference with the complementary slackness conditions is that one of the two is trivial: $\left \langle y^*, Ax^* - b \right \rangle = 0$. This is because if our candidate solution $x^*$ is feasible, then it will have to satisfy $Ax = b$ already. The other one, that $\left \langle x^*, A^Ty^* - c \right \rangle = 0$, is the only one we need to worry about.

Again, the complementary slackness conditions give us inspiration here. Recall that, informally, they say that when a variable is used at all, it is used as much as it can be to fulfill its constraint (the corresponding dual constraint is tight). So a solution will correspond to a choice of some variables which are either used or not, and a choice of nonzero variables will correspond to a solution. We even saw this happen in the last post when we observed that broccoli trumps oranges. If we can get a good handle on how to navigate the set of these solutions, then we’ll have a nifty algorithm.

Let’s make this official and lay out our assumptions.

## Extreme points and basic solutions

Remember that the graphical way to solve a linear program is to look at the line (or hyperplane) given by $\langle c, x \rangle = q$ and keep increasing $q$ (or decreasing it, if you are minimizing) until the very last moment when this line touches the region of feasible solutions. Also recall that the “feasible region” is just the set of all solutions to $Ax = b$, that is the solutions that satisfy the constraints. We imagined this picture:

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

With this geometric intuition it’s clear that there will always be an optimal solution on a vertex of the feasible region. These points are called extreme points of the feasible region. But because we will almost never work in the plane again (even introducing slack variables makes us relatively high dimensional!) we want an algebraic characterization of these extreme points.

If you have a little bit of practice with convex sets the correct definition is very natural. Recall that a set $X$ is convex if for any two points $x, y \in X$ every point on the line segment between $x$ and $y$ is also in $X$. An algebraic way to say this (thinking of these points now as vectors) is that every point $\delta x + (1-\delta) y \in X$ when $0 \leq \delta \leq 1$. Now an extreme point is just a point that isn’t on the inside of any such line, i.e. can’t be written this way for $0 < \delta < 1$. For example,

A convex set with extremal points in red. Image credit Wikipedia.

Another way to say this is that if $z$ is an extreme point then whenever $z$ can be written as $\delta x + (1-\delta) y$ for some $0 < \delta < 1$, then actually $x=y=z$. Now since our constraints are all linear (and there are a finite number of them) they won’t define a convex set with weird curves like the one above. This means that there are a finite number of extreme points that just correspond to the intersections of some of the constraints. So there are at most $2^n$ possibilities.

Indeed we want a characterization of extreme points that’s specific to linear programs in standard form, “$\max \langle c, x \rangle \textup{ s.t. } Ax=b, x \geq 0$.” And here is one.

Definition: Let $A$ be an $m \times n$ matrix with $n \geq m$. A solution $x$ to $Ax=b$ is called basic if at most $m$ of its entries are nonzero.

The reason we call it “basic” is because, under some mild assumptions we describe below, a basic solution corresponds to a vector space basis of $\mathbb{R}^m$. Which basis? The one given by the $m$ columns of $A$ used in the basic solution. We don’t need to talk about bases like this, though, so in the event of a headache just think of the basis as a set $B \subset \{ 1, 2, \dots, n \}$ of size $m$ corresponding to the nonzero entries of the basic solution.

Indeed, what we’re doing here is looking at the matrix $A_B$ formed by taking the columns of $A$ whose indices are in $B$, and the vector $x_B$ in the same way, and looking at the equation $A_Bx_B = b$. If all the parts of $x$ that we removed were zero then this will hold if and only if $Ax=b$. One might worry that $A_B$ is not invertible, so we’ll go ahead and assume it is. In fact, we’ll assume that every set of $m$ columns of $A$ forms a basis and that the rows of $A$ are also linearly independent. This isn’t without loss of generality because if some rows or columns are not linearly independent, we can remove the offending constraints and variables without changing the set of solutions (this is why it’s so nice to work with the standard form).

Moreover, we’ll assume that every basic solution has exactly $m$ nonzero variables. A basic solution which doesn’t satisfy this assumption is called degenerate, and they’ll essentially be special corner cases in the simplex algorithm. Finally, we call a basic solution feasible if (in addition to satisfying $Ax=b$) it satisfies $x \geq 0$. Now that we’ve made all these assumptions it’s easy to see that choosing $m$ nonzero variables uniquely determines a basic feasible solution. Again calling the sub-matrix $A_B$ for a basis $B$, it’s just $x_B = A_B^{-1}b$. Now to finish our characterization, we just have to show that under the same assumptions basic feasible solutions are exactly the extremal points of the feasible region.

Proposition: A vector $x$ is a basic feasible solution if and only if it’s an extreme point of the set $\{ x : Ax = b, x \geq 0 \}$.

Proof. For one direction, suppose you have a basic feasible solution $x$, and say we write it as $x = \delta y + (1-\delta) z$ for some $0 < \delta < 1$. We want to show that this implies $y = z$. Since all of these points are in the feasible region, all of their coordinates are nonnegative. So whenever a coordinate $x_i = 0$ it must be that both $y_i = z_i = 0$. Since $x$ has exactly $n-m$ zero entries, it must be that $y, z$ both have at least $n-m$ zero entries, and hence $y,z$ are both basic. By our non-degeneracy assumption they both then have exactly $m$ nonzero entries. Let $B$ be the set of the nonzero indices of $x$. Because $Ay = Az = b$, we have $A(y-z) = 0$. Now $y-z$ has all of its nonzero entries in $B$, and because the columns of $A_B$ are linearly independent, the fact that $A_B(y-z) = 0$ implies $y-z = 0$.

In the other direction, suppose  that you have some extreme point $x$ which is feasible but not basic. In other words, there are more than $m$ nonzero entries of $x$, and we’ll call the indices $J = \{ j_1, \dots, j_t \}$ where $t > m$. The columns of $A_J$ are linearly dependent (since they’re $t$ vectors in $\mathbb{R}^m$), and so let $\sum_{i=1}^t z_{j_i} A_{j_i}$ be a nontrivial linear combination of the columns of $A$. Add zeros to make the $z_{j_i}$ into a length $n$ vector $z$, so that $Az = 0$. Now

$A(x + \varepsilon z) = A(x - \varepsilon z) = Ax = b$

And if we pick $\varepsilon$ sufficiently small $x \pm \varepsilon z$ will still be nonnegative, because the only entries we’re changing of $x$ are the strictly positive ones. Then $x = \delta (x + \varepsilon z) + (1 - \delta) \varepsilon z$ for $\delta = 1/2$, but this is very embarrassing for $x$ who was supposed to be an extreme point. $\square$

Now that we know extreme points are the same as basic feasible solutions, we need to show that any linear program that has some solution has a basic feasible solution. This is clear geometrically: any time you have an optimum it has to either lie on a line or at a vertex, and if it lies on a line then you can slide it to a vertex without changing its value. Nevertheless, it is a useful exercise to go through the algebra.

Theorem. Whenever a linear program is feasible and bounded, it has a basic feasible solution.

Proof. Let $x$ be an optimal solution to the LP. If $x$ has at most $m$ nonzero entries then it’s a basic solution and by the non-degeneracy assumption it must have exactly $m$ nonzero entries. In this case there’s nothing to do, so suppose that $x$ has $r > m$ nonzero entries. It can’t be a basic feasible solution, and hence is not an extreme point of the set of feasible solutions (as proved by the last theorem). So write it as $x = \delta y + (1-\delta) z$ for some feasible $y \neq z$ and $0 < \delta < 1$.

The only thing we know about $x$ is it’s optimal. Let $c$ be the cost vector, and the optimality says that $\langle c,x \rangle \geq \langle c,y \rangle$, and $\langle c,x \rangle \geq \langle c,z \rangle$. We claim that in fact these are equal, that $y, z$ are both optimal as well. Indeed, say $y$ were not optimal, then

$\displaystyle \langle c, y \rangle < \langle c,x \rangle = \delta \langle c,y \rangle + (1-\delta) \langle c,z \rangle$

Which can be rearranged to show that $\langle c,y \rangle < \langle c, z \rangle$. Unfortunately for $x$, this implies that it was not optimal all along:

$\displaystyle \langle c,x \rangle < \delta \langle c, z \rangle + (1-\delta) \langle c,z \rangle = \langle c,z \rangle$

An identical argument works to show $z$ is optimal, too. Now we claim we can use $y,z$ to get a new solution that has fewer than $r$ nonzero entries. Once we show this we’re done: inductively repeat the argument with the smaller solution until we get down to exactly $m$ nonzero variables. As before we know that $y,z$ must have at least as many zeros as $x$. If they have more zeros we’re done. And if they have exactly as many zeros we can do the following trick. Write $w = \gamma y + (1- \gamma)z$ for a $\gamma \in \mathbb{R}$ we’ll choose later. Note that no matter the $\gamma$, $w$ is optimal. Rewriting $w = z + \gamma (y-z)$, we just have to pick a $\gamma$ that ensures one of the nonzero coefficients of $z$ is zeroed out while maintaining nonnegativity. Indeed, we can just look at the index $i$ which minimizes $z_i / (y-z)_i$ and use $\delta = - z_i / (y-z)_i$. $\square$.

So we have an immediate (and inefficient) combinatorial algorithm: enumerate all subsets of size $m$, compute the corresponding basic feasible solution $x_B = A_B^{-1}b$, and see which gives the biggest objective value. The problem is that, even if we knew the value of $m$, this would take time $n^m$, and it’s not uncommon for $m$ to be in the tens or hundreds (and if we don’t know $m$ the trivial search is exponential).

So we have to be smarter, and this is where the simplex tableau comes in.

## The simplex tableau

Now say you have any basis $B$ and any feasible solution $x$. For now $x$ might not be a basic solution, and even if it is, its basis of nonzero entries might not be the same as $B$. We can decompose the equation $Ax = b$ into the basis part and the non basis part:

$A_Bx_B + A_{B'} x_{B'} = b$

and solving the equation for $x_B$ gives

$x_B = A^{-1}_B(b - A_{B'} x_{B'})$

It may look like we’re making a wicked abuse of notation here, but both $A_Bx_B$ and $A_{B'}x_{B'}$ are vectors of length $m$ so the dimensions actually do work out. Now our feasible solution $x$ has to satisfy $Ax = b$, and the entries of $x$ are all nonnegative, so it must be that $x_B \geq 0$ and $x_{B'} \geq 0$, and by the equality above $A^{-1}_B (b - A_{B'}x_{B'}) \geq 0$ as well. Now let’s write the maximization objective $\langle c, x \rangle$ by expanding it first in terms of the $x_B, x_{B'}$, and then expanding $x_B$.

\displaystyle \begin{aligned} \langle c, x \rangle & = \langle c_B, x_B \rangle + \langle c_{B'}, x_{B'} \rangle \\ & = \langle c_B, A^{-1}_B(b - A_{B'}x_{B'}) \rangle + \langle c_{B'}, x_{B'} \rangle \\ & = \langle c_B, A^{-1}_Bb \rangle + \langle c_{B'} - (A^{-1}_B A_{B'})^T c_B, x_{B'} \rangle \end{aligned}

If we want to maximize the objective, we can just maximize this last line. There are two cases. In the first, the vector $c_{B'} - (A^{-1}_B A_{B'})^T c_B \leq 0$ and $A_B^{-1}b \geq 0$. In the above equation, this tells us that making any component of $x_{B'}$ bigger will decrease the overall objective. In other words, $\langle c, x \rangle \leq \langle c_B, A_B^{-1}b \rangle$. Picking $x = A_B^{-1}b$ (with zeros in the non basis part) meets this bound and hence must be optimal. In other words, no matter what basis $B$ we’ve chosen (i.e., no matter the candidate basic feasible solution), if the two conditions hold then we’re done.

Now the crux of the algorithm is the second case: if the conditions aren’t met, we can pick a positive index of $c_{B'} - (A_B^{-1}A_{B'})^Tc_B$ and increase the corresponding value of $x_{B'}$ to increase the objective value. As we do this, other variables in the solution will change as well (by decreasing), and we have to stop when one of them hits zero. In doing so, this changes the basis by removing one index and adding another. In reality, we’ll figure out how much to increase ahead of time, and the change will correspond to a single elementary row-operation in a matrix.

Indeed, the matrix we’ll use to represent all of this data is called a tableau in the literature. The columns of the tableau will correspond to variables, and the rows to constraints. The last row of the tableau will maintain a candidate solution $y$ to the dual problem. Here’s a rough picture to keep the different parts clear while we go through the details.

But to make it work we do a slick trick, which is to “left-multiply everything” by $A_B^{-1}$. In particular, if we have an LP given by $c, A, b$, then for any basis it’s equivalent to the LP given by $c, A_B^{-1}A, A_{B}^{-1} b$ (just multiply your solution to the new program by $A_B$ to get a solution to the old one). And so the actual tableau will be of this form.

When we say it’s in this form, it’s really only true up to rearranging columns. This is because the chosen basis will always be represented by an identity matrix (as it is to start with), so to find the basis you can find the embedded identity sub-matrix. In fact, the beginning of the simplex algorithm will have the initial basis sitting in the last few columns of the tableau.

Let’s look a little bit closer at the last row. The first portion is zero because $A_B^{-1}A_B$ is the identity. But furthermore with this $A_B^{-1}$ trick the dual LP involves $A_B^{-1}$ everywhere there’s a variable. In particular, joining all but the last column of the last row of the tableau, we have the vector $c - A_B^T(A_B^{-1})^T c$, and setting $y = A_B^{-1}c_B$ we get a candidate solution for the dual. What makes the trick even slicker is that $A_B^{-1}b$ is already the candidate solution $x_B$, since $(A_B^{-1}A)_B^{-1}$ is the identity. So we’re implicitly keeping track of two solutions here, one for the primal LP, given by the last column of the tableau, and one for the dual, contained in the last row of the tableau.

I told you the last row was the dual solution, so why all the other crap there? This is the final slick in the trick: the last row further encodes the complementary slackness conditions. Now that we recognize the dual candidate sitting there, the complementary slackness conditions simply ask for the last row to be non-positive (this is just another way of saying what we said at the beginning of this section!). You should check this, but it gives us a stopping criterion: if the last row is non-positive then stop and output the last column.

## The simplex algorithm

Now (finally!) we can describe and implement the simplex algorithm in its full glory. Recall that our informal setup has been:

1. Find an initial basic feasible solution, and set up the corresponding tableau.
2. Find a positive index of the last row, and increase the corresponding variable (adding it to the basis) just enough to make another variable from the basis zero (removing it from the basis).
3. Repeat step 2 until the last row is nonpositive.
4. Output the last column.

This is almost correct, except for some details about how increasing the corresponding variables works. What we’ll really do is represent the basis variables as pivots (ones in the tableau) and then the first 1 in each row will be the variable whose value is given by the entry in the last column of that row. So, for example, the last entry in the first row may be the optimal value for $x_5$, if the fifth column is the first entry in row 1 to have a 1.

As we describe the algorithm, we’ll illustrate it running on a simple example. In doing this we’ll see what all the different parts of the tableau correspond to from the previous section in each step of the algorithm.

Spoiler alert: the optimum is $x_1 = 2, x_2 = 1$ and the value of the max is 8.

So let’s be more programmatically formal about this. The main routine is essentially pseudocode, and the difficulty is in implementing the helper functions

def simplex(c, A, b):
tableau = initialTableau(c, A, b)

while canImprove(tableau):
pivot = findPivotIndex(tableau)

return primalSolution(tableau), objectiveValue(tableau)


Let’s start with the initial tableau. We’ll assume the user’s inputs already include the slack variables. In particular, our example data before adding slack is

c = [3, 2]
A = [[1, 2], [1, -1]]
b = [4, 1]


c = [3, 2, 0, 0]
A = [[1,  2,  1,  0],
[1, -1,  0,  1]]
b = [4, 1]


Now to set up the initial tableau we need an initial feasible solution in mind. The reader is recommended to work this part out with a pencil, since it’s much easier to write down than it is to explain. Since we introduced slack variables, our initial feasible solution (basis) $B$ can just be $(0,0,1,1)$. And so $x_B$ is just the slack variables, $c_B$ is the zero vector, and $A_B$ is the 2×2 identity matrix. Now $A_B^{-1}A_{B'} = A_{B'}$, which is just the original two columns of $A$ we started with, and $A_B^{-1}b = b$. For the last row, $c_B$ is zero so the part under $A_B^{-1}A_B$ is the zero vector. The part under $A_B^{-1}A_{B'}$ is just $c_{B'} = (3,2)$.

Rather than move columns around every time the basis $B$ changes, we’ll keep the tableau columns in order of $(x_1, \dots, x_n, \xi_1, \dots, \xi_m)$. In other words, for our example the initial tableau should look like this.

[[ 1,  2,  1,  0,  4],
[ 1, -1,  0,  1,  1],
[ 3,  2,  0,  0,  0]]


So implementing initialTableau is just a matter of putting the data in the right place.

def initialTableau(c, A, b):
tableau = [row[:] + [x] for row, x in zip(A, b)]
tableau.append(c[:] + [0])
return tableau


As an aside: in the event that we don’t start with the trivial basic feasible solution of “trivially use the slack variables,” we’d have to do a lot more work in this function. Next, the primalSolution() and objectiveValue() functions are simple, because they just extract the encoded information out from the tableau (some helper functions are omitted for brevity).

def primalSolution(tableau):
# the pivot columns denote which variables are used
columns = transpose(tableau)
indices = [j for j, col in enumerate(columns[:-1]) if isPivotCol(col)]
return list(zip(indices, columns[-1]))

def objectiveValue(tableau):
return -(tableau[-1][-1])


Similarly, the canImprove() function just checks if there’s a nonnegative entry in the last row

def canImprove(tableau):
lastRow = tableau[-1]
return any(x > 0 for x in lastRow[:-1])


Let’s run the first loop of our simplex algorithm. The first step is checking to see if anything can be improved (in our example it can). Then we have to find a pivot entry in the tableau. This part includes some edge-case checking, but if the edge cases aren’t a problem then the strategy is simple: find a positive entry corresponding to some entry $j$ of $B'$, and then pick an appropriate entry in that column to use as the pivot. Pivoting increases the value of $x_j$ (from zero) to whatever is the largest we can make it without making some other variables become negative. As we’ve said before, we’ll stop increasing $x_j$ when some other variable hits zero, and we can compute which will be the first to do so by looking at the current values of $x_B = A_B^{-1}b$ (in the last column of the tableau), and seeing how pivoting will affect them. If you stare at it for long enough, it becomes clear that the first variable to hit zero will be the entry $x_i$ of the basis for which $x_i / A_{i,j}$ is minimal (and $A_{i,j}$ has to be positve). This is because, in order to maintain the linear equalities, every entry of $x_B$ will be decreased by that value during a pivot, and we can’t let any of the variables become negative.

All of this results in the following function, where we have left out the degeneracy/unboundedness checks.

[UPDATE 2018-04-21]: The pivot choices are not as simple as I thought at the time I wrote this. See the discussion on this issue, but the short story is that I was increasing the variable too much, and to fix it it’s easier to update the pivot column choice to be the smallest positive entry of the last row. The code on github is updated to reflect that, but this post will remain unchanged.

def findPivotIndex(tableau):
# pick first nonzero index of the last row
column = [i for i,x in enumerate(tableau[-1][:-1]) if x > 0][0]
quotients = [(i, r[-1] / r[column]) for i,r in enumerate(tableau[:-1]) if r[column] > 0]

# pick row index minimizing the quotient
row = min(quotients, key=lambda x: x[1])[0]
return row, column


For our example, the minimizer is the $(1,0)$ entry (second row, first column). Pivoting is just doing the usual elementary row operations (we covered this in a primer a while back on row-reduction). The pivot function we use here is no different, and in particular mutates the list in place.

def pivotAbout(tableau, pivot):
i,j = pivot

pivotDenom = tableau[i][j]
tableau[i] = [x / pivotDenom for x in tableau[i]]

for k,row in enumerate(tableau):
if k != i:
pivotRowMultiple = [y * tableau[k][j] for y in tableau[i]]
tableau[k] = [x - y for x,y in zip(tableau[k], pivotRowMultiple)]


And in our example pivoting around the chosen entry gives the new tableau.

[[ 0.,  3.,  1., -1.,  3.],
[ 1., -1.,  0.,  1.,  1.],
[ 0.,  5.,  0., -3., -3.]]


In particular, $B$ is now $(1,0,1,0)$, since our pivot removed the second slack variable $\xi_2$ from the basis. Currently our solution has $x_1 = 1, \xi_1 = 3$. Notice how the identity submatrix is still sitting in there, the columns are just swapped around.

There’s still a positive entry in the bottom row, so let’s continue. The next pivot is (0,1), and pivoting around that entry gives the following tableau:

[[ 0.        ,  1.        ,  0.33333333, -0.33333333,  1.        ],
[ 1.        ,  0.        ,  0.33333333,  0.66666667,  2.        ],
[ 0.        ,  0.        , -1.66666667, -1.33333333, -8.        ]]


And because all of the entries in the bottom row are negative, we’re done. We read off the solution as we described, so that the first variable is 2 and the second is 1, and the objective value is the opposite of the bottom right entry, 8.

To see all of the source code, including the edge-case-checking we left out of this post, see the Github repository for this post.

An obvious question is: what is the runtime of the simplex algorithm? Is it polynomial in the size of the tableau? Is it even guaranteed to stop at some point? The surprising truth is that nobody knows the answer to all of these questions! Originally (in the 1940’s) the simplex algorithm actually had an exponential runtime in the worst case, though this was not known until 1972. And indeed, to this day while some variations are known to terminate, no variation is known to have polynomial runtime in the worst case. Some of the choices we made in our implementation (for example, picking the first column with a positive entry in the bottom row) have the potential to cycle, i.e., variables leave and enter the basis without changing the objective at all. Doing something like picking a random positive column, or picking the column which will increase the objective value by the largest amount are alternatives. Unfortunately, every single pivot-picking rule is known to give rise to exponential-time simplex algorithms in the worst case (in fact, this was discovered as recently as 2011!). So it remains open whether there is a variant of the simplex method that runs in guaranteed polynomial time.

But then, in a stunning turn of events, Leonid Khachiyan proved in the 70’s that in fact linear programs can always be solved in polynomial time, via a completely different algorithm called the ellipsoid method. Following that was a method called the interior point method, which is significantly more efficient. Both of these algorithms generalize to problems that are harder than linear programming as well, so we will probably cover them in the distant future of this blog.

Despite the celebratory nature of these two results, people still use the simplex algorithm for industrial applications of linear programming. The reason is that it’s much faster in practice, and much simpler to implement and experiment with.

The next obvious question has to do with the poignant observation that whole numbers are great. That is, you often want the solution to your problem to involve integers, and not real numbers. But adding the constraint that the variables in a linear program need to be integer valued (even just 0-1 valued!) is NP-complete. This problem is called integer linear programming, or just integer programming (IP). So we can’t hope to solve IP, and rightly so: the reader can verify easily that boolean satisfiability instances can be written as linear programs where each clause corresponds to a constraint.

This brings up a very interesting theoretical issue: if we take an integer program and just remove the integrality constraints, and solve the resulting linear program, how far away are the two solutions? If they’re close, then we can hope to give a good approximation to the integer program by solving the linear program and somehow turning the resulting solution back into an integer solution. In fact this is a very popular technique called LP-rounding. We’ll also likely cover that on this blog at some point.

Oh there’s so much to do and so little time! Until next time.

# Linear Programming and Healthy Diets — Part 1

Optimization is by far one of the richest ways to apply computer science and mathematics to the real world. Everybody is looking to optimize something: companies want to maximize profits, factories want to maximize efficiency, investors want to minimize risk, the list just goes on and on. The mathematical tools for optimization are also some of the richest mathematical techniques. They form the cornerstone of an entire industry known as operations research, and advances in this field literally change the world.

The mathematical field is called combinatorial optimization, and the name comes from the goal of finding optimal solutions more efficiently than an exhaustive search through every possibility. This post will introduce the most central problem in all of combinatorial optimization, known as the linear program. Even better, we know how to efficiently solve linear programs, so in future posts we’ll write a program that computes the most affordable diet while meeting the recommended health standard.

## Generalizing a Specific Linear Program

Most optimization problems have two parts: an objective function, the thing we want to maximize or minimize, and constraints, rules we must abide by to ensure we get a valid solution. As a simple example you may want to minimize the amount of time you spend doing your taxes (objective function), but you certainly can’t spend a negative amount of time on them (a constraint).

The following more complicated example is the centerpiece of this post. Most people want to minimize the amount of money spent on food. At the same time, one needs to maintain a certain level of nutrition. For males ages 19-30, the United States National Institute for Health recommends 3.7 liters of water per day, 1,000 milligrams of calcium per day, 90 milligrams of vitamin C per day, etc.

We can set up this nutrition problem mathematically, just using a few toy variables. Say we had the option to buy some combination of oranges, milk, and broccoli. Some rough estimates [1] give the following content/costs of these foods. For 0.272 USD you can get 100 grams of orange, containing a total of 53.2mg of calcium, 40mg of vitamin C, and 87g of water. For 0.100 USD you can get 100 grams of whole milk, containing 276mg of calcium, 0mg of vitamin C, and 87g of water. Finally, for 0.381 USD you can get 100 grams of broccoli containing 47mg of calcium, 89.2mg of vitamin C, and 91g of water. Here’s a table summarizing this information:

Nutritional content and prices for 100g of three foods

Food         calcium(mg)     vitamin C(mg)      water(g)   price(USD/100g)
Broccoli     47              89.2               91         0.381
Whole milk   276             0                  87         0.100
Oranges      40              53.2               87         0.272

Some observations: broccoli is more expensive but gets the most of all three nutrients, whole milk doesn’t have any vitamin C but gets a ton of calcium for really cheap, and oranges are a somewhere in between. So you could probably tinker with the quantities and figure out what the cheapest healthy diet is. The problem is what happens when we incorporate hundreds or thousands of food items and tens of nutrient recommendations. This simple example is just to help us build up a nice formality.

So let’s continue doing that. If we denote by $b$ the number of 100g units of broccoli we decide to buy, and $m$ the amount of milk and $r$ the amount of oranges, then we can write the daily cost of food as

$\displaystyle \text{cost}(b,m,r) = 0.381 b + 0.1 m + 0.272 r$

In the interest of being compact (and again, building toward the general linear programming formulation) we can extract the price information into a single cost vector $c = (0.381, 0.1, 0.272)$, and likewise write our variables as a vector $x = (b,m,r)$. We’re implicitly fixing an ordering on the variables that is maintained throughout the problem, but the choice of ordering doesn’t matter. Now the cost function is just the inner product (dot product) of the cost vector and the variable vector $\left \langle c,x \right \rangle$. For some reason lots of people like to write this as $c^Tx$, where $c^T$ denotes the transpose of a matrix, and we imagine that $c$ and $x$ are matrices of size $3 \times 1$. I’ll stick to using the inner product bracket notation.

Now for each type of food we get a specific amount of each nutrient, and the sum of those nutrients needs to be bigger than the minimum recommendation. For example, we want at least 1,000 mg of calcium per day, so we require that $1000 \leq 47b + 276m + 40r$. Likewise, we can write out a table of the constraints by looking at the columns of our table above.

$\displaystyle \begin{matrix} 91b & + & 87m & + & 87r & \geq & 3700 & \text{(water)}\\ 47b & + & 276m & + & 40r & \geq & 1000 & \text{(calcium)} \\ 89.2b & + & 0m & + & 53.2r & \geq & 90 & \text{(vitamin C)} \end{matrix}$

In the same way that we extracted the cost data into a vector to separate it from the variables, we can extract all of the nutrient data into a matrix $A$, and the recommended minimums into a vector $v$. Traditionally the letter $b$ is used for the minimums vector, but for now we’re using $b$ for broccoli.

$A = \begin{pmatrix} 91 & 87 & 87 \\ 47 & 276 & 40 \\ 89.2 & 0 & 53.2 \end{pmatrix}$

$v = \begin{pmatrix} 3700 \\ 1000 \\ 90 \end{pmatrix}$

And now the constraint is that $Ax \geq v$, where the $\geq$ means “greater than or equal to in every coordinate.” So now we can write down the more general form of the problem for our specific matrices and vectors. That is, our problem is to minimize $\left \langle c,x \right \rangle$ subject to the constraint that $Ax \geq v$. This is often written in offset form to contrast it with variations we’ll see in a bit:

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to the constraint } Ax \geq v$

In general there’s no reason you can’t have a “negative” amount of one variable. In this problem you can’t buy negative broccoli, so we’ll add the constraints to ensure the variables are nonnegative. So our final form is

$\displaystyle \text{minimize} \left \langle c,x \right \rangle \\ \text{subject to } Ax \geq v \\ \text{and } x \geq 0$

In general, if you have an $m \times n$ matrix $A$, a “minimums” vector $v \in \mathbb{R}^m$, and a cost vector $c \in \mathbb{R}^n$, the problem of finding the vector $x$ that minimizes the cost function while meeting the constraints is called a linear programming problem or simply a linear program.

To satiate the reader’s burning curiosity, the solution for our calcium/vitamin C problem is roughly $x = (1.01, 41.47, 0)$. That is, you should have about 100g of broccoli and 4.2kg of milk (like 4 liters), and skip the oranges entirely. The daily cost is about 4.53 USD. If this seems awkwardly large, it’s because there are cheaper ways to get water than milk.

100g of broccoli (image source: 100-grams.blogspot.com)

## Duality

Now that we’ve seen the general form a linear program and a cute example, we can ask the real meaty question: is there an efficient algorithm that solves arbitrary linear programs? Despite how widely applicable these problems seem, the answer is yes!

But before we can describe the algorithm we need to know more about linear programs. For example, say you have some vector $x$ which satisfies your constraints. How can you tell if it’s optimal? Without such a test we’d have no way to know when to terminate our algorithm. Another problem is that we’ve phrased the problem in terms of minimization, but what about problems where we want to maximize things? Can we use the same algorithm that finds minima to find maxima as well?

Both of these problems are neatly answered by the theory of duality. In mathematics in general, the best way to understand what people mean by “duality” is that one mathematical object uniquely determines two different perspectives, each useful in its own way. And typically a duality theorem provides one with an efficient way to transform one perspective into the other, and relate the information you get from both perspectives. A theory of duality is considered beautiful because it gives you truly deep insight into the mathematical object you care about.

In linear programming duality is between maximization and minimization. In particular, every maximization problem has a unique “dual” minimization problem, and vice versa. The really interesting thing is that the variables you’re trying to optimize in one form correspond to the contraints in the other form! Here’s how one might discover such a beautiful correspondence. We’ll use a made up example with small numbers to make things easy.

So you have this optimization problem

$\displaystyle \begin{matrix} \text{minimize} & 4x_1+3x_2+9x_3 & \\ \text{subject to} & x_1+x_2+x_3 & \geq 6 \\ & 2x_1+x_3 & \geq 2 \\ & x_2+x_3 & \geq 1 & \\ & x_1,x_2,x_3 & \geq 0 \end{matrix}$

Just for giggles let’s write out what $A$ and $c$ are.

$\displaystyle A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix}, c = (4,3,9), v = (6,2,1)$

Say you want to come up with a lower bound on the optimal solution to your problem. That is, you want to know that you can’t make $4x_1 + 3x_2 + 9x_3$ smaller than some number $m$. The constraints can help us derive such lower bounds. In particular, every variable has to be nonnegative, so we know that $4x_1 + 3x_2 + 9x_3 \geq x_1 + x_2 + x_3 \geq 6$, and so 6 is a lower bound on our optimum. Likewise,

\displaystyle \begin{aligned}4x_1+3x_2+9x_3 & \geq 4x_1+4x_3+3x_2+3x_3 \\ &=2(2x_1 + x_3)+3(x_2+x_3) \\ & \geq 2 \cdot 2 + 3 \cdot 1 \\ &=7\end{aligned}

and that’s an even better lower bound than 6. We could try to write this approach down in general: find some numbers $y_1, y_2, y_3$ that we’ll use for each constraint to form

$\displaystyle y_1(\text{constraint 1}) + y_2(\text{constraint 2}) + y_3(\text{constraint 3})$

To make it a valid lower bound we need to ensure that the coefficients of each of the $x_i$ are smaller than the coefficients in the objective function (i.e. that the coefficient of $x_1$ ends up less than 4). And to make it the best lower bound possible we want to maximize what the right-hand-size of the inequality would be: $y_1 6 + y_2 2 + y_3 1$. If you write out these equations and the constraints you get our “lower bound” problem written as

$\displaystyle \begin{matrix} \text{maximize} & 6y_1 + 2y_2 + y_3 & \\ \text{subject to} & y_1 + 2y_2 & \leq 4 \\ & y_1 + y_3 & \leq 3 \\ & y_1+y_2 + y_3 & \leq 9 \\ & y_1,y_2,y_3 & \geq 0 \end{matrix}$

And wouldn’t you know, the matrix providing the constraints is $A^T$, and the vectors $c$ and $v$ switched places.

$\displaystyle A^T = \begin{pmatrix} 1 & 2 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix}$

This is no coincidence. All linear programs can be transformed in this way, and it would be a useful exercise for the reader to turn the above maximization problem back into a minimization problem by the same technique (computing linear combinations of the constraints to make upper bounds). You’ll be surprised to find that you get back to the original minimization problem! This is part of what makes it “duality,” because the dual of the dual is the original thing again. Often, when we fix the “original” problem, we call it the primal form to distinguish it from the dual form. Usually the primal problem is the one that is easy to interpret.

(Note: because we’re done with broccoli for now, we’re going to use $b$ to denote the constraint vector that used to be $v$.)

Now say you’re given the data of a linear program for minimization, that is the vectors $c, b$ and matrix $A$ for the problem, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b; x \geq 0$.” We can make a general definition: the dual linear program is the maximization problem “maximize $\left \langle b, y \right \rangle$ subject to $A^T y \leq c, y \geq 0$.” Here $y$ is the new set of variables and the superscript T denotes the transpose of the matrix. The constraint for the dual is often written $y^T A \leq c^T$, again identifying vectors with a single-column matrices, but I find the swamp of transposes pointless and annoying (why do things need to be columns?).

Now we can actually prove that the objective function for the dual provides a bound on the objective function for the original problem. It’s obvious from the work we’ve done, which is why it’s called the weak duality theorem.

Weak Duality Theorem: Let $c, A, b$ be the data of a linear program in the primal form (the minimization problem) whose objective function is $\left \langle c, x \right \rangle$. Recall that the objective function of the dual (maximization) problem is $\left \langle b, y \right \rangle$. If $x,y$ are feasible solutions (satisfy the constraints of their respective problems), then

$\left \langle b, y \right \rangle \leq \left \langle c, x \right \rangle$

In other words, the maximum of the dual is a lower bound on the minimum of the primal problem and vice versa. Moreover, any feasible solution for one provides a bound on the other.

Proof. The proof is pleasingly simple. Just inspect the quantity $\left \langle A^T y, x \right \rangle = \left \langle y, Ax \right \rangle$. The constraints from the definitions of the primal and dual give us that

$\left \langle y, b \right \rangle \leq \left \langle y, Ax \right \rangle = \left \langle A^Ty, x \right \rangle \leq \left \langle c,x \right \rangle$

The inequalities follow from the linear algebra fact that if the $u$ in $\left \langle u,v \right \rangle$ is nonnegative, then you can only increase the size of the product by increasing the components of $v$. This is why we need the nonnegativity constraints.

In fact, the world is much more pleasing. There is a theorem that says the two optimums are equal!

Strong Duality Theorem: If there are any solutions $x,y$ to the primal (minimization) problem and the dual (maximization) problem, respectively, then the two problems also have optimal solutions $x^*, y^*$, and two candidate solutions $x^*, y^*$ are optimal if and only if they produce equal objective values $\left \langle c, x^* \right \rangle = \left \langle y^*, b \right \rangle$.

The proof of this theorem is a bit more convoluted than the weak duality theorem, and the key technique is a lemma of Farkas and its variations. See the second half of these notes for a full proof. The nice thing is that this theorem gives us a way to tell if an algorithm to solve linear programs is done: maintain a pair of feasible solutions to the primal and dual problems, improve them by some rule, and stop when the two solutions give equal objective values. The hard part, then, is finding a principled and guaranteed way to improve a given pair of solutions.

On the other hand, you can also prove the strong duality theorem by inventing an algorithm that provably terminates. We’ll see such an algorithm, known as the simplex algorithm in the next post. Sneak peek: it’s a lot like Gaussian elimination. Then we’ll use the algorithm (or an equivalent industry-strength version) to solve a much bigger nutrition problem.

In fact, you can do a bit better than the strong duality theorem, in terms of coming up with a stopping condition for a linear programming algorithm. You can observe that an optimal solution implies further constraints on the relationship between the primal and the dual problems. In particular, this is called the complementary slackness conditions, and they essentially say that if an optimal solution to the primal has a positive variable then the corresponding constraint in the dual problem must be tight (is an equality) to get an optimal solution to the dual. The contrapositive says that if some constraint is slack, or a strict inequality, then either the corresponding variable is zero or else the solution is not optimal. More formally,

Theorem (Complementary Slackness Conditions): Let $A, c, b$ be the data of the primal form of a linear program, “minimize $\left \langle c, x \right \rangle$ subject to $Ax \geq b, x \geq 0$.” Then $x^*, y^*$ are optimal solutions to the primal and dual problems if any only if all of the following conditions hold.

• $x^*, y^*$ are both feasible for their respective problems.
• Whenever $x^*_i > 0$ the corresponding constraint $A^T_i y^* = c_i$ is an equality.
• Whenever $y^*_j > 0$ the corresponding constraint $A_j x^* = b_j$ is an equality.

Here we denote by $M_i$ the $i$-th row of the matrix $M$ and $v_i$ to denote the $i$-th entry of a vector. Another way to write the condition using vectors instead of English is

$\left \langle x^*, A^T y^* - c \right \rangle = 0$
$\left \langle y^*, Ax^* - b \right \rangle$

The proof follows from the duality theorems, and just involves pushing around some vector algebra. See section 6.2 of these notes.

One can interpret complementary slackness in linear programs in a lot of different ways. For us, it will simply be a termination condition for an algorithm: one can efficiently check all of these conditions for the nonzero variables and stop if they’re all satisfied or if we find a variable that violates a slackness condition. Indeed, in more mature optimization analyses, the slackness condition that is more egregiously violated can provide evidence for where a candidate solution can best be improved. For a more intricate and detailed story about how to interpret the complementary slackness conditions, see Section 4 of these notes by Joel Sobel.

Finally, before we close we should note there are geometric ways to think about linear programming. I have my preferred visualization in my head, but I have yet to find a suitable animation on the web that replicates it. Here’s one example in two dimensions. The set of constraints define a convex geometric region in the plane

The constraints define a convex area of “feasible solutions.” Image source: Wikipedia.

Now the optimization function $f(x) = \left \langle c,x \right \rangle$ is also a linear function, and if you fix some output value $y = f(x)$ this defines a line in the plane. As $y$ changes, the line moves along its normal vector (that is, all these fixed lines are parallel). Now to geometrically optimize the target function, we can imagine starting with the line $f(x) = 0$, and sliding it along its normal vector in the direction that keeps it in the feasible region. We can keep sliding it in this direction, and the maximum of the function is just the last instant that this line intersects the feasible region. If none of the constraints are parallel to the family of lines defined by $f$, then this is guaranteed to occur at a vertex of the feasible region. Otherwise, there will be a family of optima lying anywhere on the line segment of last intersection.

In higher dimensions, the only change is that the lines become affine subspaces of dimension $n-1$. That means in three dimensions you’re sliding planes, in four dimensions you’re sliding 3-dimensional hyperplanes, etc. The facts about the last intersection being a vertex or a “line segment” still hold. So as we’ll see next time, successful algorithms for linear programming in practice take advantage of this observation by efficiently traversing the vertices of this convex region. We’ll see this in much more detail in the next post.

Until then!